Harness Engineering — The New Discipline Powering Software Development in the AI Agent Era

Tadashi Shigeoka ·  Sun, February 22, 2026

AI agents writing code is no longer a novelty. It’s becoming standard practice. But throwing prompts at an agent doesn’t produce production-quality software. For agents to deliver reliable results consistently, you need to design the environment they work in: the constraints, tools, documentation, and feedback loops that keep them on track.

The systematic approach to designing that environment is called Harness Engineering.

What is Harness Engineering?

Harness Engineering is the discipline of systematically building constraints, tools, documentation, and feedback loops that enable AI coding agents to do reliable work.

The term was popularized by HashiCorp founder Mitchell Hashimoto, who proposed a six-stage AI adoption journey and named stage five “Engineer the Harness.”

Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again. That’s Harness Engineering.

— Mitchell Hashimoto

The concept gained industry-wide attention when OpenAI published a detailed article in February 2026 describing how they built a million lines of code with zero manually-written code using Codex agents. Martin Fowler introduced Birgitta Böckeler’s analysis as “a valuable framing of a key part of AI-enabled software development.”

Prompt → Context → Harness: Three Layers

To understand Harness Engineering, it helps to see where it sits relative to related concepts.

LayerFocusKey Question
Prompt EngineeringOptimizing instructional text to the LLM”What do we tell it?”
Context EngineeringManaging all tokens the LLM receives (tools, RAG, schemas, memory)“What do we show it?”
Harness EngineeringSystem-wide constraints, feedback, and improvement cycles”What do we prevent, measure, and control?”

While Prompt and Context Engineering optimize the quality of a single inference, Harness Engineering ensures ongoing quality at the system level.

OpenAI’s Experiment: One Million Lines, Zero Hand-Written

Ryan Lopopolo of OpenAI shared a detailed account of a five-month experiment building an internal product entirely with Codex agents.

By the Numbers

  • Code volume: ~1 million lines (application logic, infrastructure, tooling, documentation)
  • Hand-written code: 0 lines
  • Team size: Started with 3 engineers, grew to 7
  • Throughput: Average 3.5 PRs per engineer per day
  • Total PRs: ~1,500 over five months
  • Estimated speed: Built in ~1/10th the time of hand-writing

Philosophy: Humans Steer. Agents Execute.

The team’s guiding principle was clear: humans design environments, specify intent, and build feedback loops; agents write the code. The engineer’s job shifted from implementation to system design.

Early progress was slow, not because Codex was incapable, but because the environment was underspecified. The agent lacked tools, abstractions, and internal structure. When something failed, the answer was never “try harder.” It was always: “What capability is missing, and how do we make it both legible and enforceable for the agent?”

The Four Pillars of a Harness

Synthesizing insights from OpenAI, Mitchell Hashimoto, and Birgitta Böckeler, a harness consists of four pillars.

1. Architecture as Guardrails

Agents flounder in unconstrained environments. Paradoxically, stricter constraints produce more reliable agent output.

OpenAI adopted a rigid layered architecture:

Types → Config → Repo → Service → Runtime → UI

Each business domain can only depend “forward” through this fixed layer sequence. Cross-cutting concerns (auth, telemetry, feature flags) enter through a single explicit interface: Providers. These constraints are enforced mechanically via custom linters and structural tests.

This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it’s an early prerequisite: the constraints are what allows speed without decay or architectural drift.

— Ryan Lopopolo, OpenAI

2. Documentation as System of Record

One of OpenAI’s most important lessons: treat AGENTS.md as a table of contents, not an encyclopedia.

The “one big AGENTS.md” approach failed predictably:

  • Context is a scarce resource: a giant instruction file crowds out the task and code, so agents miss key constraints
  • When everything is “important,” nothing is: agents fall back to local pattern-matching
  • Monolithic docs go stale instantly: agents can’t tell what’s current, and humans stop maintaining them

Instead, AGENTS.md was kept to ~100 lines as a “map,” pointing into a structured docs/ directory treated as the system of record:

File / DirectoryRole
AGENTS.mdTable of contents (~100 lines). Pointers to deeper sources of truth
ARCHITECTURE.mdTop-level map of domains and package layering
docs/design-docs/Design documents and core beliefs
docs/exec-plans/active/Active execution plans
docs/exec-plans/completed/Completed execution plans
docs/product-specs/Product specifications
docs/references/Reference materials for LLMs

Design docs, execution plans, and technical debt trackers are all version-controlled, enabling agents to operate autonomously without external context.

Mitchell Hashimoto recommends a similar approach: update AGENTS.md every time an agent makes a mistake, accumulating implicit prompting improvements. Documentation becomes a living feedback loop, not a static artifact.

3. Observability and Feedback Loops (Application Legibility)

Agents need to understand not just the code but also the running behavior of the application.

OpenAI made the application legible to Codex by:

  • Wiring the Chrome DevTools Protocol into the agent runtime for DOM snapshots, screenshots, and navigation, enabling bug reproduction and fix validation
  • Providing a local observability stack (Victoria Logs / Metrics / Traces) per git worktree, queryable via LogQL and PromQL
  • Making the app bootable per git worktree, so Codex could launch and validate one instance per change

This enabled prompts like “ensure service startup completes in under 800ms” or “no span in these critical user journeys exceeds two seconds.” Single Codex runs regularly work on tasks for over six hours, often while humans are sleeping.

4. Entropy Management and Garbage Collection

Full agent autonomy introduces entropy. Codex replicates existing patterns, even suboptimal ones. Over time, this leads to drift.

OpenAI initially spent every Friday (20% of the week) cleaning up “AI slop.” That didn’t scale.

Instead, they encoded golden principles directly into the repository and built a recurring cleanup process:

  • Background Codex tasks scan for deviations, update quality grades, and open targeted refactoring PRs
  • Most of these PRs are reviewed in under a minute and auto-merged
  • Technical debt is treated like a high-interest loan, paid down daily in small increments rather than left to accumulate

Mitchell Hashimoto’s Six-Stage Framework

Mitchell Hashimoto’s AI adoption journey provides a practical roadmap for individual practitioners.

StageActionDescription
1Drop the ChatbotMove from chat UIs to agents that can read files, execute programs, and make HTTP requests
2Reproduce Your Own WorkDo the task manually, then with an agent. Learn what agents excel at and where they fall short
3End-of-Day AgentsDeploy agents during the last 30 minutes of your workday for research and exploratory tasks
4Outsource the Slam DunksDelegate high-confidence tasks to agents while you focus on deep work. Turn off notifications
5Engineer the HarnessEvery time an agent makes a mistake, build a system to prevent it from happening again
6Always Have an Agent RunningMaintain continuous background agent work, starting at 10-20% of your workday

The key insight is that Stage 5, Harness Engineering, compounds. Every improvement applies to every future agent run.

Practical Steps to Get Started

How do you begin applying Harness Engineering to your own projects?

Start Today

  1. Create an AGENTS.md (or CLAUDE.md): Document your project’s conventions, forbidden patterns, and test procedures. Update it immediately when an agent makes a mistake
  2. Review your pre-commit hooks: Ensure linters, formatters, and type checks run locally, not just in CI. These provide instant feedback to agents
  3. Invest in test coverage: Tests are the foundation agents use to verify correctness. Without tests, agents can’t validate their own work

Medium-Term Investments

  1. Enforce architectural constraints mechanically: Use custom linters or scripts to validate dependency directions, file size limits, and naming conventions
  2. Structure your documentation: Split documentation by purpose with cross-references, rather than maintaining one giant file
  3. Make application behavior visible to agents: Give agents access to logs and metrics, not just test results

Team Adoption

  1. Treat agent mistakes as incidents: Conduct postmortems and improve the harness when agents fail
  2. Schedule regular garbage collection: Periodically audit agent-generated code for quality and detect pattern deviations

Redefining Code Review: Spec-Driven Quality Assurance

If Harness Engineering is the discipline of designing the agent’s execution environment, its most visible impact is on code review. Ankit Jain, CEO of Aviator, makes a sharp argument in How to Kill the Code Review about how review processes must transform in the AI agent era.

Traditional Code Review Becomes a Bottleneck

The productivity gains from AI agents create a serious bottleneck in review processes. Teams with high AI adoption report 98% more PR merges but 91% longer PR review times. In an era where agents produce code at scale, having humans read every diff line-by-line is unsustainable.

This problem surfaced in OpenAI’s own Harness Engineering practice. Processing 1,500 PRs over five months, with “most PRs reviewed in under a minute and auto-merged,” is fundamentally different from traditional code review.

From “Is the Code Correct?” to “Is the Spec Correct?”

Ankit Jain poses a fundamental question: if AI generates the code and AI reviews it, what’s the point of a human staring at a diff in a review UI?

The answer is to shift human attention upstream.

Traditional ReviewAI Agent Era Review
Read code lines and verify correctnessVerify that specs and constraints are correctly defined
Focus on implementation detailsFocus on intent and acceptance criteria
Review then mergeAuto-merge after passing deterministic verification
Humans as gatekeepersSystems as gatekeepers, humans as architects

This aligns perfectly with Harness Engineering’s “Humans steer. Agents execute.” philosophy. The spec becomes the source of truth, and code becomes an artifact of the spec.

Defense in Depth: Replacing Review with Systems

Ankit Jain proposes a five-layer verification model to replace human code review. These layers complement the four pillars of Harness Engineering.

  1. Multi-solution comparison: Have multiple agents implement different approaches, then select the one that passes the most verification steps. Compete rather than gamble on a single solution
  2. Deterministic guardrails: Tests, type checks, and contract verification — fact-based, opinion-free checks. Harness Engineering’s architectural constraints and custom linters live here
  3. Human-defined acceptance criteria: Use behavior-driven development (BDD) frameworks to define verification criteria from specs before implementation. Acceptance criteria should never be invented after implementation
  4. Structured permission systems: Minimize agent access scope and auto-trigger human review for specific patterns (security-related changes, database migrations, etc.)
  5. Adversarial verification: Separate responsibilities between implementation agents and review agents, creating a mutual oversight structure

When these layers function correctly, the need for humans to read code line-by-line drops dramatically. The human job shifts from “reading code” to designing and improving this defense-in-depth system itself — which is precisely Harness Engineering.

Ship Fast, Observe, Roll Back

Instead of the traditional “review then merge” model, Ankit Jain advocates for “ship fast, observe everything, roll back quickly.”

This is the same philosophy behind OpenAI’s observability stack (Victoria Logs / Metrics / Traces) and their entropy management through background scans and automated refactoring PRs. The quality checkpoint is moving from “a gate before merge” to “continuous observation and feedback.”

What Lies Ahead

OpenAI is candid in their conclusion: they’re still learning.

Building software still demands discipline, but the discipline shows up more in the scaffolding rather than the code. The tooling, abstractions, and feedback loops that keep the codebase coherent are increasingly important.

The engineer’s value is shifting from “how fast you write code” to “how well you design environments where agents produce reliable results.” As the transformation of code review shows, the human intervention point is moving from implementation details to upstream spec, constraint, and verification design. This isn’t just a tool upgrade. It’s a redefinition of the software engineering profession itself.

Harness Engineering is the concept at the core of that redefinition.

That’s all from the Gemba on breaking down Harness Engineering, the new discipline for the AI agent era.

References