How to Build a Deterministic Repeatable AI Coding Harness
A deterministic AI coding harness comes down to controlling five variables: the model version, the temperature/seed parameters, the system prompt, the context window contents, and the output validation. Pin all five, and you get repeatable results across runs and team members. The harness itself is a thin orchestration layer, typically a Python or TypeScript script, that assembles a frozen prompt, calls the API with fixed parameters, and validates the output against a schema or test suite before writing anything to disk.
The Core Harness: Pinned Model, Seed, and Structured Output
The most impactful thing you can do is pin the model version and set temperature: 0 with a fixed seed. OpenAI and Anthropic both support this, though the guarantees differ. The following minimal harness calls Claude with deterministic settings and validates the output.
import anthropic
import json
import hashlib
client = anthropic.Anthropic()
# Freeze every input to the model.
MODEL = "claude-sonnet-4-20250514"
SYSTEM_PROMPT = open("prompts/codegen_system.md").read()
TEMPERATURE = 0.0
def generate_code(user_prompt: str, context_files: list[str]) -> str:
# Build a deterministic context block from sorted file paths.
context = "\n".join(open(f).read() for f in sorted(context_files))
# Hash inputs so you can cache and compare across runs.
input_hash = hashlib.sha256(f"{MODEL}{SYSTEM_PROMPT}{context}{user_prompt}".encode()).hexdigest()[:12]
response = client.messages.create(
model=MODEL, max_tokens=4096, temperature=TEMPERATURE,
system=SYSTEM_PROMPT,
messages=[{"role": "user", "content": f"\n{context}\n \n\n{user_prompt}"}],
)
return response.content[0].text, input_hash
Prompt Versioning and Frozen Context
The system prompt is code. Treat it like code: commit it to version control, tag releases, and never edit it in place. Your harness should load prompts from files, not inline strings, so diffs are visible in pull requests. The same applies to context. If your harness injects reference files into the prompt, sort them deterministically and hash the combined input. Two runs with the same hash must produce the same output; if they don't, your harness has a leak.
This directory structure enforces that discipline.
ai-harness/
├── prompts/
│ ├── codegen_system.md # v-tagged system prompt
│ ├── review_system.md
│ └── CHANGELOG.md
├── schemas/
│ └── codegen_output.json # JSON Schema for structured output
├── context/
│ └── .context_manifest.json # lists files to inject, in order
├── cache/
│ └── {input_hash}.json # cached responses
├── harness.py
└── tests/
└── test_harness_determinism.pyimport json
from pathlib import Path
def load_context(manifest_path: str) -> list[str]:
# Load file list from a committed manifest for reproducibility.
manifest = json.loads(Path(manifest_path).read_text())
return sorted(manifest["files"])
def check_cache(input_hash: str) -> str | None:
# Return cached output if the exact same inputs were seen before.
cache_file = Path(f"cache/{input_hash}.json")
if cache_file.exists():
return json.loads(cache_file.read_text())["output"]
return None
def write_cache(input_hash: str, output: str):
Path("cache").mkdir(exist_ok=True)
Path(f"cache/{input_hash}.json").write_text(
json.dumps({"hash": input_hash, "output": output}, indent=2)
)
Output Validation: The Gatekeeper
Deterministic inputs still don't guarantee correct outputs. Your harness must validate every response before it touches your codebase. A production harness should apply three levels of validation: schema validation (does the output have the right structure), syntax validation (does it parse as valid code), and semantic validation (does it pass your test suite). The schema check is cheap and catches the most common failure mode, where the model wraps code in markdown fences or adds conversational preamble.
import subprocess
import ast
def validate_python_output(code: str) -> bool:
# Parse the generated code to confirm it is valid Python.
try:
ast.parse(code)
except SyntaxError as e:
print(f"Syntax error in generated code: {e}")
return False
return True
def run_tests_on_output(code: str, test_file: str) -> bool:
# Write generated code to a temp file and run the test suite.
Path("_generated_tmp.py").write_text(code)
result = subprocess.run(
["python", "-m", "pytest", test_file, "-x", "--tb=short"],
capture_output=True, text=True
)
Path("_generated_tmp.py").unlink()
return result.returncode == 0
Deterministic Retry with Feedback Loop
When validation fails, a good harness doesn't give up. It feeds the error back to the model as a correction round. But this retry loop itself must be deterministic: cap the number of retries, include the full error text verbatim (not a summary), and log every attempt. Without a cap, you get infinite loops that burn tokens and still fail.
MAX_RETRIES = 3
def generate_with_retry(user_prompt: str, context_files: list[str], test_file: str) -> str:
code, input_hash = generate_code(user_prompt, context_files)
for attempt in range(MAX_RETRIES):
if not validate_python_output(code):
# Feed the exact error back for a targeted fix.
code, _ = generate_code(
f"{user_prompt}\n\nPrevious attempt had a syntax error. Fix it:\n```\n{code}\n```",
context_files
)
continue
if run_tests_on_output(code, test_file):
write_cache(input_hash, code)
return code
# Feed test failure back to the model.
code, _ = generate_code(
f"{user_prompt}\n\nPrevious attempt failed tests. Fix it:\n```\n{code}\n```",
context_files
)
raise RuntimeError(f"Failed to generate valid code after {MAX_RETRIES} retries")
Tool Comparison: AI Coding Harness Approaches
Custom Python harness (shown above) gives you full control over caching, validation, and prompt versioning. This is the right choice when you need auditable, reproducible codegen for CI pipelines or regulated environments. The downside is that you maintain it yourself.
Claude Code (CLI agent) fits interactive development better. It has built-in context management through CLAUDE.md files that act as persistent memory across sessions. You can make it semi-deterministic by committing your CLAUDE.md and using --print mode in CI, but it's designed for human-in-the-loop workflows, not fully automated pipelines.
Aider takes a middle ground. It integrates directly with git, creates commits for each change, and supports --model pinning. It's excellent for iterative refactoring but less controllable than a custom harness for structured output validation.
Use a custom harness for CI/CD codegen pipelines. Use Claude Code or Aider for developer-facing workflows where a human reviews every change.
#!/usr/bin/env bash
# CI pipeline step: generate code deterministically and run tests.
set -euo pipefail
# Pin the exact harness and prompt versions.
export HARNESS_PROMPT_VERSION="v2.4.1"
export ANTHROPIC_MODEL="claude-sonnet-4-20250514"
# Run the harness with a specific task prompt.
python harness.py \
--task prompts/tasks/add_pagination.md \
--context context/.context_manifest.json \
--tests tests/test_pagination.py \
--output src/pagination.py
# Verify determinism: re-run and diff.
python harness.py \
--task prompts/tasks/add_pagination.md \
--context context/.context_manifest.json \
--tests tests/test_pagination.py \
--output src/pagination_check.py
diff src/pagination.py src/pagination_check.py
Common Gotchas in Deterministic AI Code Generation
Temperature 0 isn't truly deterministic. Both OpenAI and Anthropic document that temperature: 0 reduces but doesn't eliminate variation. OpenAI's seed parameter gets closer but still has caveats around infrastructure changes. Anthropic doesn't expose a seed parameter. The only way to guarantee identical output is to cache it. That's why the input hash and cache layer above aren't optional; they're essential.
Context ordering matters enormously. If you inject three files into the prompt in a different order, you get different output even with temperature 0. Always sort deterministically. Alphabetical by path is the simplest approach.
Model deprecation breaks everything. When you pin claude-sonnet-4-20250514 and Anthropic eventually retires that version, your harness breaks. Build an alert that checks model availability on a schedule and gives your team lead time to re-validate prompts against the new version.
Long context degrades repeatability. The longer your context window, the more sensitive the output becomes to tiny input changes. Keep injected context under 30% of the model's context window. If you need more, use a retrieval step to select only the relevant files, but make that retrieval step itself deterministic (embedding model pinned, similarity threshold fixed, tie-breaking by file path).