Harness Engineering
The model is only half the system. Harness engineering is the discipline of building the other half — the directives, agents, checks, and feedback loops that wrap an AI coding agent and make it reliable. It is where prompt engineering and context engineering were always heading, and it is what DevArch has been since v1.0.
Agent = Model + Harness
The formula spread through 2026, anchored by OpenAI's Harness engineering: leveraging Codex in an agent-first world and Martin Fowler's companion piece. The harness is everything that is not the model: the runtime that validates, authorizes, executes, and logs every action the model proposes. Most agents that fail in production do not fail because the underlying model is weak — they fail because the harness around it is brittle, unobservable, or unconstrained. The model returns a proposed action; the harness decides whether, how, and with what guardrails that action actually happens.
The lineage: prompt → context → harness
AI engineering has matured through three phases, each subsuming the last. The question moved from “which words get the best output?” to “what information should the model see?” to “what system makes the agent reliable at all?”
Prompt engineering
Language
Tuned the words of a single interaction. The unit of work was one message.
Context engineering
Information
Tuned what the model is given — retrieval, memory, the shape of the input. The unit of work was one request.
Harness engineering
Environment
Tunes the system that governs how the agent works across a whole task, session, and project. The unit of work is the agent itself.
Harness engineering does not replace the earlier disciplines — it contains them. Context engineering is one part of a harness; prompting is one part of context. The higher level is the complete environment that governs the agent end to end.
Spec-driven development needs a harness
A specification says what to build, but a spec is an inert document — nothing enforces it while the agent works. The harness is the missing layer that turns specs into continuously-enforced operating conditions. That is the shift from in-the-loop — a human validating every output by hand — to on-the-loop, where the system enforces the conditions automatically and pulls the human in only where judgment is most valuable. Repeated mistakes get encoded as rules; review bottlenecks get replaced by automated checks; quality debt stops accumulating because the expertise lives in the harness instead of one person's attention.
Guides and Sensors
Martin Fowler frames a harness as two kinds of control. Guides are feed-forward: they anticipate the agent's behavior and steer it before it acts. Sensors are feedback: they observe after the agent acts and help it self-correct. Each runs in one of two modes — computational (deterministic and fast: linters, tests, type-checkers) or inferential (semantic and richer: AI analysis). A good harness does not aim to eliminate human input; it directs that input to where it matters most.
How DevArch implements it
DevArch is a harness for Claude Code. Its mechanisms map directly onto the vocabulary above — each is a Guide or a Sensor, computational or inferential.
| DevArch mechanism | Role | Mode |
|---|---|---|
| CLAUDE.md directives — coding-discipline rules, Boundary Statements, Behavior Statements written before code | Guides | feed-forward |
| Agents — mutation-verification, pre-session-audit, seam-detector, pattern-recurrence-detector | Sensors | inferential feedback |
| Hooks — boundary-check, budget gates, dotfile enforcement | Sensors | computational feedback |
The whole agent-lifecycle ruleset — session start, planning, coding, session end — is the on-the-looplayer: it runs without being invoked. And the controls line up with Fowler's three regulation categories: maintainability (modularity and boundary rules), architecture fitness (ADRs and Boundary Statements), and behavior (Behavior Statements and RED / YELLOW / GREEN test grading).