Harness Engineering — The New Discipline Powering Software Development in the AI Agent Era
AI agents writing code is no longer a novelty. It’s becoming standard practice. But throwing prompts at an agent doesn’t produce production-quality software. For agents to deliver reliable results consistently, you need to design the environment they work in: the constraints, tools, documentation, and feedback loops that keep them on track.
The systematic approach to designing that environment is called Harness Engineering.
What is Harness Engineering?
Harness Engineering is the discipline of systematically building constraints, tools, documentation, and feedback loops that enable AI coding agents to do reliable work.
The term was popularized by HashiCorp founder Mitchell Hashimoto, who proposed a six-stage AI adoption journey and named stage five “Engineer the Harness.”
Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again. That’s Harness Engineering.
— Mitchell Hashimoto
The concept gained industry-wide attention when OpenAI published a detailed article in February 2026 describing how they built a million lines of code with zero manually-written code using Codex agents. Martin Fowler introduced Birgitta Böckeler’s analysis as “a valuable framing of a key part of AI-enabled software development.”
Prompt → Context → Harness: Three Layers
To understand Harness Engineering, it helps to see where it sits relative to related concepts.
| Layer | Focus | Key Question |
|---|---|---|
| Prompt Engineering | Optimizing instructional text to the LLM | ”What do we tell it?” |
| Context Engineering | Managing all tokens the LLM receives (tools, RAG, schemas, memory) | “What do we show it?” |
| Harness Engineering | System-wide constraints, feedback, and improvement cycles | ”What do we prevent, measure, and control?” |
While Prompt and Context Engineering optimize the quality of a single inference, Harness Engineering ensures ongoing quality at the system level.
OpenAI’s Experiment: One Million Lines, Zero Hand-Written
Ryan Lopopolo of OpenAI shared a detailed account of a five-month experiment building an internal product entirely with Codex agents.
By the Numbers
- Code volume: ~1 million lines (application logic, infrastructure, tooling, documentation)
- Hand-written code: 0 lines
- Team size: Started with 3 engineers, grew to 7
- Throughput: Average 3.5 PRs per engineer per day
- Total PRs: ~1,500 over five months
- Estimated speed: Built in ~1/10th the time of hand-writing
Philosophy: Humans Steer. Agents Execute.
The team’s guiding principle was clear: humans design environments, specify intent, and build feedback loops; agents write the code. The engineer’s job shifted from implementation to system design.
Early progress was slow, not because Codex was incapable, but because the environment was underspecified. The agent lacked tools, abstractions, and internal structure. When something failed, the answer was never “try harder.” It was always: “What capability is missing, and how do we make it both legible and enforceable for the agent?”
The Four Pillars of a Harness
Synthesizing insights from OpenAI, Mitchell Hashimoto, and Birgitta Böckeler, a harness consists of four pillars.
1. Architecture as Guardrails
Agents flounder in unconstrained environments. Paradoxically, stricter constraints produce more reliable agent output.
OpenAI adopted a rigid layered architecture:
Types → Config → Repo → Service → Runtime → UI
Each business domain can only depend “forward” through this fixed layer sequence. Cross-cutting concerns (auth, telemetry, feature flags) enter through a single explicit interface: Providers. These constraints are enforced mechanically via custom linters and structural tests.
This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it’s an early prerequisite: the constraints are what allows speed without decay or architectural drift.
— Ryan Lopopolo, OpenAI
2. Documentation as System of Record
One of OpenAI’s most important lessons: treat AGENTS.md as a table of contents, not an encyclopedia.
The “one big AGENTS.md” approach failed predictably:
- Context is a scarce resource: a giant instruction file crowds out the task and code, so agents miss key constraints
- When everything is “important,” nothing is: agents fall back to local pattern-matching
- Monolithic docs go stale instantly: agents can’t tell what’s current, and humans stop maintaining them
Instead, AGENTS.md was kept to ~100 lines as a “map,” pointing into a structured docs/ directory treated as the system of record:
| File / Directory | Role |
|---|---|
AGENTS.md | Table of contents (~100 lines). Pointers to deeper sources of truth |
ARCHITECTURE.md | Top-level map of domains and package layering |
docs/design-docs/ | Design documents and core beliefs |
docs/exec-plans/active/ | Active execution plans |
docs/exec-plans/completed/ | Completed execution plans |
docs/product-specs/ | Product specifications |
docs/references/ | Reference materials for LLMs |
Design docs, execution plans, and technical debt trackers are all version-controlled, enabling agents to operate autonomously without external context.
Mitchell Hashimoto recommends a similar approach: update AGENTS.md every time an agent makes a mistake, accumulating implicit prompting improvements. Documentation becomes a living feedback loop, not a static artifact.
3. Observability and Feedback Loops (Application Legibility)
Agents need to understand not just the code but also the running behavior of the application.
OpenAI made the application legible to Codex by:
- Wiring the Chrome DevTools Protocol into the agent runtime for DOM snapshots, screenshots, and navigation, enabling bug reproduction and fix validation
- Providing a local observability stack (Victoria Logs / Metrics / Traces) per git worktree, queryable via LogQL and PromQL
- Making the app bootable per git worktree, so Codex could launch and validate one instance per change
This enabled prompts like “ensure service startup completes in under 800ms” or “no span in these critical user journeys exceeds two seconds.” Single Codex runs regularly work on tasks for over six hours, often while humans are sleeping.
4. Entropy Management and Garbage Collection
Full agent autonomy introduces entropy. Codex replicates existing patterns, even suboptimal ones. Over time, this leads to drift.
OpenAI initially spent every Friday (20% of the week) cleaning up “AI slop.” That didn’t scale.
Instead, they encoded golden principles directly into the repository and built a recurring cleanup process:
- Background Codex tasks scan for deviations, update quality grades, and open targeted refactoring PRs
- Most of these PRs are reviewed in under a minute and auto-merged
- Technical debt is treated like a high-interest loan, paid down daily in small increments rather than left to accumulate
Mitchell Hashimoto’s Six-Stage Framework
Mitchell Hashimoto’s AI adoption journey provides a practical roadmap for individual practitioners.
| Stage | Action | Description |
|---|---|---|
| 1 | Drop the Chatbot | Move from chat UIs to agents that can read files, execute programs, and make HTTP requests |
| 2 | Reproduce Your Own Work | Do the task manually, then with an agent. Learn what agents excel at and where they fall short |
| 3 | End-of-Day Agents | Deploy agents during the last 30 minutes of your workday for research and exploratory tasks |
| 4 | Outsource the Slam Dunks | Delegate high-confidence tasks to agents while you focus on deep work. Turn off notifications |
| 5 | Engineer the Harness | Every time an agent makes a mistake, build a system to prevent it from happening again |
| 6 | Always Have an Agent Running | Maintain continuous background agent work, starting at 10-20% of your workday |
The key insight is that Stage 5, Harness Engineering, compounds. Every improvement applies to every future agent run.
Practical Steps to Get Started
How do you begin applying Harness Engineering to your own projects?
Start Today
- Create an AGENTS.md (or CLAUDE.md): Document your project’s conventions, forbidden patterns, and test procedures. Update it immediately when an agent makes a mistake
- Review your pre-commit hooks: Ensure linters, formatters, and type checks run locally, not just in CI. These provide instant feedback to agents
- Invest in test coverage: Tests are the foundation agents use to verify correctness. Without tests, agents can’t validate their own work
Medium-Term Investments
- Enforce architectural constraints mechanically: Use custom linters or scripts to validate dependency directions, file size limits, and naming conventions
- Structure your documentation: Split documentation by purpose with cross-references, rather than maintaining one giant file
- Make application behavior visible to agents: Give agents access to logs and metrics, not just test results
Team Adoption
- Treat agent mistakes as incidents: Conduct postmortems and improve the harness when agents fail
- Schedule regular garbage collection: Periodically audit agent-generated code for quality and detect pattern deviations
Redefining Code Review: Spec-Driven Quality Assurance
If Harness Engineering is the discipline of designing the agent’s execution environment, its most visible impact is on code review. Ankit Jain, CEO of Aviator, makes a sharp argument in How to Kill the Code Review about how review processes must transform in the AI agent era.
Traditional Code Review Becomes a Bottleneck
The productivity gains from AI agents create a serious bottleneck in review processes. Teams with high AI adoption report 98% more PR merges but 91% longer PR review times. In an era where agents produce code at scale, having humans read every diff line-by-line is unsustainable.
This problem surfaced in OpenAI’s own Harness Engineering practice. Processing 1,500 PRs over five months, with “most PRs reviewed in under a minute and auto-merged,” is fundamentally different from traditional code review.
From “Is the Code Correct?” to “Is the Spec Correct?”
Ankit Jain poses a fundamental question: if AI generates the code and AI reviews it, what’s the point of a human staring at a diff in a review UI?
The answer is to shift human attention upstream.
| Traditional Review | AI Agent Era Review |
|---|---|
| Read code lines and verify correctness | Verify that specs and constraints are correctly defined |
| Focus on implementation details | Focus on intent and acceptance criteria |
| Review then merge | Auto-merge after passing deterministic verification |
| Humans as gatekeepers | Systems as gatekeepers, humans as architects |
This aligns perfectly with Harness Engineering’s “Humans steer. Agents execute.” philosophy. The spec becomes the source of truth, and code becomes an artifact of the spec.
Defense in Depth: Replacing Review with Systems
Ankit Jain proposes a five-layer verification model to replace human code review. These layers complement the four pillars of Harness Engineering.
- Multi-solution comparison: Have multiple agents implement different approaches, then select the one that passes the most verification steps. Compete rather than gamble on a single solution
- Deterministic guardrails: Tests, type checks, and contract verification — fact-based, opinion-free checks. Harness Engineering’s architectural constraints and custom linters live here
- Human-defined acceptance criteria: Use behavior-driven development (BDD) frameworks to define verification criteria from specs before implementation. Acceptance criteria should never be invented after implementation
- Structured permission systems: Minimize agent access scope and auto-trigger human review for specific patterns (security-related changes, database migrations, etc.)
- Adversarial verification: Separate responsibilities between implementation agents and review agents, creating a mutual oversight structure
When these layers function correctly, the need for humans to read code line-by-line drops dramatically. The human job shifts from “reading code” to designing and improving this defense-in-depth system itself — which is precisely Harness Engineering.
Ship Fast, Observe, Roll Back
Instead of the traditional “review then merge” model, Ankit Jain advocates for “ship fast, observe everything, roll back quickly.”
This is the same philosophy behind OpenAI’s observability stack (Victoria Logs / Metrics / Traces) and their entropy management through background scans and automated refactoring PRs. The quality checkpoint is moving from “a gate before merge” to “continuous observation and feedback.”
What Lies Ahead
OpenAI is candid in their conclusion: they’re still learning.
Building software still demands discipline, but the discipline shows up more in the scaffolding rather than the code. The tooling, abstractions, and feedback loops that keep the codebase coherent are increasingly important.
The engineer’s value is shifting from “how fast you write code” to “how well you design environments where agents produce reliable results.” As the transformation of code review shows, the human intervention point is moving from implementation details to upstream spec, constraint, and verification design. This isn’t just a tool upgrade. It’s a redefinition of the software engineering profession itself.
Harness Engineering is the concept at the core of that redefinition.
That’s all from the Gemba on breaking down Harness Engineering, the new discipline for the AI agent era.