Harness Engineering

The model is only half the system. Harness engineering is the discipline of building the other half — the directives, agents, checks, and feedback loops that wrap an AI coding agent and make it reliable. It is where prompt engineering and context engineering were always heading, and it is what DevArch has been since v1.0.

Agent = Model + Harness

The formula spread through 2026, anchored by OpenAI's Harness engineering: leveraging Codex in an agent-first world and Martin Fowler's companion piece. The harness is everything that is not the model: the runtime that validates, authorizes, executes, and logs every action the model proposes. Most agents that fail in production do not fail because the underlying model is weak — they fail because the harness around it is brittle, unobservable, or unconstrained. The model returns a proposed action; the harness decides whether, how, and with what guardrails that action actually happens.

Concentric bands showing the agent stack: Claude (the model) at the core, wrapped by Claude Code (runtime, tools, permissions), wrapped by DevArch (directives, agents, hooks). The model is labelled the core; the surrounding rings are labelled the harness — the part you engineer. An arrow points from the rings down to Your Project: coherent, disciplined output.
The model is the core; Claude Code and DevArch are the harness wrapping it. DevArch is the outermost ring — the discipline that governs everything inside.

The lineage: prompt → context → harness

AI engineering has matured through three phases, each subsuming the last. The question moved from “which words get the best output?” to “what information should the model see?” to “what system makes the agent reliable at all?”

Prompt engineering

Language

Tuned the words of a single interaction. The unit of work was one message.

Context engineering

Information

Tuned what the model is given — retrieval, memory, the shape of the input. The unit of work was one request.

Harness engineering

Environment

Tunes the system that governs how the agent works across a whole task, session, and project. The unit of work is the agent itself.

Harness engineering does not replace the earlier disciplines — it contains them. Context engineering is one part of a harness; prompting is one part of context. The higher level is the complete environment that governs the agent end to end.

Spec-driven development needs a harness

A specification says what to build, but a spec is an inert document — nothing enforces it while the agent works. The harness is the missing layer that turns specs into continuously-enforced operating conditions. That is the shift from in-the-loop — a human validating every output by hand — to on-the-loop, where the system enforces the conditions automatically and pulls the human in only where judgment is most valuable. Repeated mistakes get encoded as rules; review bottlenecks get replaced by automated checks; quality debt stops accumulating because the expertise lives in the harness instead of one person's attention.

Guides and Sensors

Martin Fowler frames a harness as two kinds of control. Guides are feed-forward: they anticipate the agent's behavior and steer it before it acts. Sensors are feedback: they observe after the agent acts and help it self-correct. Each runs in one of two modes — computational (deterministic and fast: linters, tests, type-checkers) or inferential (semantic and richer: AI analysis). A good harness does not aim to eliminate human input; it directs that input to where it matters most.

How DevArch implements it

DevArch is a harness for Claude Code. Its mechanisms map directly onto the vocabulary above — each is a Guide or a Sensor, computational or inferential.

DevArch mechanismRoleMode
CLAUDE.md directives — coding-discipline rules, Boundary Statements, Behavior Statements written before codeGuidesfeed-forward
Agents — mutation-verification, pre-session-audit, seam-detector, pattern-recurrence-detectorSensorsinferential feedback
Hooks — boundary-check, budget gates, dotfile enforcementSensorscomputational feedback

The whole agent-lifecycle ruleset — session start, planning, coding, session end — is the on-the-looplayer: it runs without being invoked. And the controls line up with Fowler's three regulation categories: maintainability (modularity and boundary rules), architecture fitness (ADRs and Boundary Statements), and behavior (Behavior Statements and RED / YELLOW / GREEN test grading).

DevArch did not adopt the term and then build to fit it. The tooling came first — forged building the product under its own discipline — and the industry arrived at the name afterward. The category caught up to the practice.

Further reading