Specification-Driven Code Generation with LLMs — An Empirical Study on Telling AI What to Build

LLM-based code generation tools like Claude Code, Codex, Cursor, and GitHub Copilot have dramatically boosted developer productivity. Yet their effectiveness depends heavily on prompt quality, and there are several fundamental challenges:

Prompt ambiguity: Natural language prompts are open to interpretation, leading to code that doesn’t match expectations
High iteration costs: Getting correct code often requires multiple prompt revisions
Lack of structured process: Current tools assume free-form conversation, lacking systematic development methodology

What if a specification-driven approach could address these issues and improve both quality and efficiency of LLM code generation? That’s the question posed by a study accepted at SANER 2026. This article introduces the paper “Understanding Specification-Driven Code Generation with LLMs: An Empirical Study Design” by Giovanni Rosa et al.

Methodology — The CURRANTE Tool and 3-Phase Workflow

At the core of this research is CURRANTE, a Visual Studio Code extension for human-in-the-loop LLM code generation. CURRANTE enforces a three-phase workflow:

Phase 1: Specification

Developers define requirements in a structured format rather than free-form natural language. This includes:

Function signatures and parameter types
Input constraints and valid ranges
Expected output formats
Edge cases and error handling requirements
Performance constraints

Phase 2: Tests

Test suites are generated and refined based on the specification. The LLM proposes test cases, and developers review and modify them. This phase often reveals gaps in the specification.

Phase 3: Function

The LLM generates code that satisfies both the specification and tests. Automated test execution provides immediate feedback, enabling accept-or-regenerate decisions.

flowchart LR
    A[Define Spec] --> B[Generate & Refine Tests]
    B --> C[Generate Code]
    C --> D{Run Tests}
    D -->|Pass| E[Accept]
    D -->|Fail| F[Revise Spec or Tests]
    F --> B

This workflow applies TDD (Test-Driven Development) principles to LLM code generation. Lock down the specification first, make it verifiable through tests, then generate code that satisfies them.

Research Questions

The central question is: how does human intervention in specification and test refinement influence the quality and dynamics of LLM-generated code? The study analyzes this from four dimensions:

Dimension	Analysis
Effectiveness	Test pass rates, code quality scores (cyclomatic complexity, maintainability index), verification against ground truth solutions
Efficiency	Generation latency (spec input to first output), task completion time, iteration count
Iteration Patterns	Types of modifications (logic fixes, API corrections, edge case handling), failure classification
Human Intervention	Ratio of auto-generated vs. manually edited code, edit distance between initial and final code, intervention classification (minor fixes, major restructuring, complete rewrites)

Experiment Design

32 professional developers (2+ years of experience) solve medium-difficulty problems from LiveCodeBench, a continuously updated benchmark comprising real-world programming problems sourced from platforms like LeetCode and CodeForces.

The experiment sets up four conditions with varying levels of supplementary information to compare what impacts code quality the most:

Condition	Information Provided
A	Specification only
B	Specification + reference implementation samples
C	Specification + test cases
D	Specification + reference implementation + test cases

CURRANTE uses IDE-level keystroke analysis to distinguish auto-generated code from manual input, capturing fine-grained interaction logs that enable quantitative evaluation across all four dimensions above.

Practical Takeaways

This study has been accepted as a Stage 1 Registered Report. This means the research plan (hypotheses, methodology, and analysis methods) was peer-reviewed and approved before data collection. Publication is guaranteed regardless of results, structurally preventing HARKing (Hypothesizing After the Results are Known), which makes the study highly credible.

Results are yet to come, but the research design itself offers actionable insights.

1. The Value of Writing Specifications

Communicating “what to build” to an LLM in a structured way is an evolution of prompt engineering. CURRANTE’s approach goes further by using typed specifications rather than free-form prompts as input.

In practice, clarifying the following before asking an LLM to generate code can improve output quality:

Input/output types and constraints
Edge case enumeration
Expected error handling behavior

2. Test-First LLM Usage

The spec-test-implement ordering is an effective pattern for LLM code generation. Writing tests first (or having the LLM generate them for your review) enables automated verification of implementation correctness.

3. Optimizing Human Intervention Points

Rather than delegating everything to the LLM, it’s important to identify where humans should intervene. This research tests the hypothesis that specification definition and test review are the highest-impact intervention points.

Conclusion

Giovanni Rosa et al.’s study aims to shift LLM code generation from a “throw a prompt and hope” style to a structured, specification-driven process.

CURRANTE’s three-phase workflow (spec → tests → implementation) reinterprets TDD principles for the LLM era and is an attempt to redefine “the human role” in AI-assisted development.

Results are yet to come, but the direction suggested by the research design itself — constrain LLMs with clear specifications and tests rather than letting them write code freely — is a principle you can apply in your daily development right now.

That’s all for this introduction to the empirical study on specification-driven code generation, sent from the field.