What is Harness Engineering?

If you’ve been following AI tooling over the last few years, you’ve probably noticed the terminology keeps shifting. First it was prompt engineering, then context engineering, and now harness engineering is entering the conversation.

Each term reflects a real shift in how we work with AI — and they build on each other. Understanding the difference isn’t just academic; it changes how you architect your AI workflows. And right now, harness engineering is where the leverage is.

The Formula: Agent = Model + Harness

Before getting into the three layers, it helps to have a clear mental model.

An AI agent is not the model. An agent is the model plus a harness. The harness is everything else: the tools available to the model, the system prompt, the context management strategy, the permissions and guardrails, and the feedback loops that catch and correct errors.

Viv Trivedy, who coined the term harness engineering, puts it directly: “If you’re not the model, you’re the harness.”

💡 Tip: Think of the model as a car engine — powerful, but completely inert on its own. The harness is the rest of the vehicle: steering, brakes, transmission, chassis. The engine doesn’t decide where to go. The harness does.

This distinction matters because most of the gains people attribute to “better AI” actually come from better harnesses. More on that shortly.

Layer 1: Prompt Engineering (2022–2024)

Prompt engineering dominated the early years of working with LLMs. It was popular when we were dealing with a single model or a single AI agent, and the core idea was simple: give the model a better input, get a better output.

You were optimizing the words and structure of what you sent to the model. That meant:

A well-crafted system prompt (“act as a senior back-end engineer”)
Chain-of-thought instructions
Few-shot examples before the user message
Explicit output format instructions (“return in JSON”)

This is still relevant — good prompt engineering is the foundation everything above it sits on. But it only gets you so far once you move beyond a single LLM call.

Layer 2: Context Engineering (2025)

Venn diagram showing context engineering as the umbrella over prompt engineering, RAG, memory, state/history, and structured outputs

Context engineering was popularized by Andrej Karpathy in 2025. It’s the practice of managing what information goes into the AI model’s context window.

As agents became more capable, the question shifted from “how do I phrase this?” to “what should the model even see?” That includes:

RAG — linking agents to databases and documents
Memory management — what the agent remembers across turns
File and schema selection — which files, schemas, and tool definitions are in scope
MCP tools — providing the right tools at the right time
Context compression — avoiding bloated context windows by summarizing or pruning

Context engineering builds on top of prompt engineering. Good prompts matter less if the model is drowning in irrelevant context — or missing the file it actually needs.

⚠️ Warning: Models degrade as context fills — a phenomenon called context rot. When a model’s context exceeds 50,000–100,000 tokens, precision can drop roughly 50%. Feeding an agent your entire codebase at once hurts more than it helps.

There are four practical techniques for managing this:

Compaction — intelligently summarize and offload older context as you approach window limits
Tool-call offloading — keep only the head and tail of large tool outputs; write the full content to a file on disk
Progressive disclosure — reveal tools and context only when the current task requires them (Agent Skills are a direct application of this)
Context resets — for very long jobs, tear down the session entirely and rebuild from a compact handoff file

Layer 3: Harness Engineering

Diagram showing how a harness transforms a limited LLM into a capable agent fleet through tools like filesystem, code execution, sandboxes, memory, and long-horizon patterns

Harness engineering takes both of the above and wraps them in infrastructure. It’s the complete environment you design for an agent — not just what it sees or how you phrase things, but how it executes, what it’s allowed to do, and how you verify it did the right thing.

This reframes agent failures entirely. When an agent makes a mistake, the instinct is to blame the model. The better frame — sometimes called the “skill issue” reframe — is that agent failures are almost always configuration problems. The gap between what today’s models can do and what you see them doing is largely a harness gap.

The practical implication is what Mitchell Hashimoto calls the ratchet pattern: every model error you encounter becomes the basis for a harness solution that prevents that error from recurring. You stop fixing outputs and start fixing the system. A good rule of thumb: every line in your AGENTS.md should be traceable back to a specific thing that went wrong.

Guides and Sensors

The Martin Fowler framing is useful here. A harness operates through two types of controls:

Guides (feedforward controls) — shape agent behavior before an action happens. They increase the probability of getting a good result on the first attempt.

Architectural documentation and coding conventions
Bootstrap scripts and codemods
Harness templates for common service topologies
CLAUDE.md / AGENTS.md files with machine-readable project context

Sensors (feedback controls) — observe post-action behavior and enable self-correction.

Tests, linters, type checkers, structural analysis
AI code review agents (“LLM as judge”)
Mutation testing
Runtime monitoring (SLO degradation, log anomaly detection)

A feedback-only system creates repetitive errors — the agent keeps making the same mistakes because nothing is steering it away in the first place. A feedforward-only system never validates whether it’s working. You need both.

Three Types of Harness

There are three distinct regulation areas where harnesses apply:

1. Maintainability harness — regulates internal code quality. Linters catch structural issues (duplication, complexity, coverage gaps); AI review agents handle semantic issues that static analysis misses.

2. Architecture fitness harness — defines and enforces what the codebase is supposed to look like. Performance requirements, observability conventions, dependency constraints — all made machine-readable so an agent can check conformance automatically. This draws on the fitness functions pattern.

3. Behaviour harness — verifies functional correctness. Currently the weakest area in most teams. Feedforward: functional specifications. Feedback: AI-generated test suites with coverage and mutation testing. Heavy reliance on manual testing here is a sign of an immature harness.

When Controls Run

Where in the development lifecycle a control runs matters as much as what it checks:

Phase	Controls
Pre-integration (local)	Fast linters, basic tests, code review agents
Post-integration (CI/CD)	Mutation testing, comprehensive architecture review
Continuous monitoring	Dead code detection, test coverage drift, runtime anomalies

The principle is keep quality left — earlier detection is dramatically cheaper. A linter that fires in milliseconds on every save prevents an architectural violation that would otherwise slip through to a multi-hour review cycle.

Long-Horizon Execution Patterns

As tasks grow longer and more autonomous, harnesses need patterns that keep agents on track across many steps:

Ralph Loops — hooks intercept every completion attempt, re-inject the original prompt into a fresh context window, and force continuation. Prevents agents from declaring done prematurely on long jobs.
Planner/Generator/Evaluator splits — separate agents for planning, generating, and evaluating. Outperforms self-evaluation because the same model that generated the output is a poor judge of it.
Sprint Contracts — before coding begins, the generator and evaluator negotiate explicit “done” conditions. Prevents scope drift and makes the feedback loop deterministic.
Self-verification hooks — post-tool hooks run the test suite automatically; failures inject the error text back into the loop without human intervention.

The Numbers Don’t Lie

The most compelling argument for investing in harness engineering isn’t theoretical — it’s empirical.

LangChain tested the same model on Terminal Bench 2.0 and improved its score from 52.8% to 66.5% just by changing the harness — different system prompt, different available tools, different verification middleware. Model weights unchanged.

Can.ac went further: testing 16 models while only changing the harness editing format, they pushed one model from 6.7% to 68.3% — over an order of magnitude improvement without touching the model at all.

OpenAI’s Codex team built software exceeding one million lines of code without manually writing a single line. The differentiating factor wasn’t the base model. It was the surrounding environment.

The same logic explains Stripe shipping 1,300 AI pull requests a week — fully reviewed, merged, and deployed. They’re not using better models than everyone else. They built better infrastructure around their agents.

💡 Pro tip: Cursor discovered that removing 80% of available tools from their harness improved agent performance. Fewer options means less distraction. When building your harness, remove tools the agent doesn’t need — don’t add everything available.

What a Harness Looks Like in Practice

Here’s a concrete example: an automated bug-fix pipeline.

Error tracker (Sentry) detects a bug
        ↓
Agent retrieves the relevant files from the repo
        ↓
Generator agent writes a fix
        ↓
Evaluator agent runs the test suite
        ↓
  Tests pass? → Open PR
  Tests fail? → Loop back to generator
        ↓
PR is created and logged

Notice what you’re not doing: writing prompts. You’re designing the flow, defining the tools, deciding what triggers what, and specifying the success condition. The harness is the infrastructure that makes this loop run without human intervention on each step.

This is also what makes multi-agent orchestration a form of harness engineering. If you’re coordinating multiple specialized agents — one to generate, one to evaluate, one to log — the routing and coordination logic between them is your harness.

Building Your First Harness

You don’t need to build everything at once. A practical harness can start with four components:

1. Documentation as code — write CLAUDE.md or AGENTS.md files as machine-readable source of truth: directory structure, naming conventions, build instructions, project-specific rules. These are guides — they shape agent behavior before anything runs. Keep them under 60 lines (the HumanLayer benchmark); longer files signal speculation rather than hard-won rules. Every line should trace to a real failure or a non-negotiable constraint. One more thing: MCP tool descriptions are trusted text — a malicious or poorly-scoped MCP can prompt-inject your agent, so treat tool selection as a security decision.

2. Architectural constraints — codify your dependency rules, module boundaries, and mandatory patterns in a form the agent can check. This turns “please follow the architecture” into an enforceable sensor.

3. Layered verification — stack controls by speed and cost. PostToolUse hooks fire in milliseconds. Pre-commit checks take seconds. CI pipelines take minutes. Human review takes hours. Each layer catches what the previous one misses.

4. Garbage collection — schedule periodic agent tasks to detect documentation drift, dead code, and architectural violations that accumulate over time. The harness degrades without active maintenance.

Teams that implement these four report 2–5x reliability improvements in agentic workflows.

One important mindset shift: a harness is a living system, not a static config. You remove rules when capable models make them redundant. You add rules only after observing real failures. The harness evolves with the agent — and with the models it wraps.

Where This Is Heading

The tooling ecosystem is shifting from LLM APIs (completion-focused) to what Viv Trivedy calls Harness-as-a-Service (HaaS) — runtime-focused platforms like the Claude Agent SDK, Codex SDK, and OpenAI Agents SDK that provide the loop, tool-calling, conversation state, and approval flows out of the box. You customize the system prompt, tools, context, and subagents instead of rebuilding the fundamentals yourself.

There’s also a convergence happening at the product level. The top coding agents — Claude Code, Cursor, Codex, Aider, Cline — look more alike than their underlying models differ. Harness patterns are converging. The interesting engineering isn’t in picking the model; it’s in designing the scaffolding.

The Three Layers Together

Think of them as a stack, each built on the one below:

Stack diagram showing prompt engineering at the base, context engineering in the middle, and harness engineering at the top

You don’t graduate from prompt engineering when you learn context engineering. You need all three. But as your systems become more autonomous, the leverage shifts upward — and the distinction between them matters for knowing where to invest your time.

A bad output? That might be a prompt problem. A good output that still breaks things? Probably a context problem. Reliable outputs that still fail at scale? That’s a harness problem.

If you want to go deeper on context engineering specifically, check out AGENTS.md, Instructions, and Path-Specific Rules.