Confession: I spent most of last Thursday watching an agent retry the same failing test fourteen times in a row, each attempt confidently wrong in a slightly new way. It felt like coaching a junior who won’t stop guessing. By the end I had a strong opinion and a weak wallet.
So when a paper landed this week titled How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks, I read it the way you read a medical test result you were avoiding. The answer, happily, is less depressing than my Thursday suggested, but it comes with teeth. Most of the gain from self-repair is real, and most of it happens in the first two rounds. After that you’re mostly setting money on fire.
If you’re wiring retries into your own coding agent, or evaluating one you inherited, I think the numbers are worth sitting with. Here’s what I took away, and the retry loop I’m willing to run in production this week.
What “self-repair” actually means in a coding loop
Just so we’re on the same page: self-repair is the embarrassingly simple idea where you run the model’s code, catch the error, hand the traceback back to the model, and ask it to fix the thing. No fancy RL, no tool-calling gymnastics. Just “here’s your output, here’s what broke, try again.”
The simplest version fits in a page of Python:
def self_repair(task, model, max_attempts=5):
history = [{"role": "user", "content": task}]
for attempt in range(max_attempts):
code = model.complete(history)
result = run_sandboxed(code)
if result.passed:
return code, attempt + 1
history += [
{"role": "assistant", "content": code},
{"role": "user", "content": f"Test failed:\n{result.error}\nPlease fix it."},
]
return code, max_attempts
That’s it. That’s the whole trick the paper evaluates, across seven models from Llama 3.1 8B up to Gemini 2.5 Pro, on HumanEval and MBPP. It’s also roughly what you get out of the box with modern coding agents. If you’ve ever used Cursor’s “fix error” button or Claude Code’s retry on a failed test, you’ve used a fancier cousin of this.
The new numbers: where retries actually earn their keep
The headline is that self-repair works, universally. Every model the authors tested got better. On HumanEval the lift ranged from +4.9 to +17.1 percentage points; on the harder MBPP Sanitized it was +16.0 to +30.0. Gemini 2.5 Flash topped out at 96.3% on HumanEval, 93.8% on MBPP after five attempts. These are not rounding-error gains.
Here’s the part I care about for anyone actually paying for tokens: most of the gains concentrate in the first two rounds. Attempt three through five add a little, but the curve flattens hard. If you’ve been letting your agent spin for ten retries because “why not”, you’re paying five extra tokens for maybe one extra point of accuracy.
The error-type split is the real gift. Syntax errors and NameErrors get fixed at high rates: the model clearly sees the problem. Assertion errors (which is a polite way of saying “your logic is wrong”) repair at roughly 45%. That mirrors what I’ve watched in practice. If the error message is mechanical, the model nails it. If the error is “the answer is wrong”, the model often just restates the same bug in different syntax.
If you want to cross-reference this with another good recent thread on where LLM behavior breaks down under repeated pressure, Simon Willison’s working notes on LLM API abstractions has a nice discussion of how different vendors surface tool-call and server-side execution quirks, which is relevant to anyone building retry loops that cross vendor boundaries.
Where self-repair quietly runs out of runway
The paper confirms something I keep running into: self-repair is a strong local optimizer and a weak global one. It polishes typos and obvious exceptions. It does not rethink your approach.
In practice this shows up three ways.
First, when the model picks the wrong algorithm, retries don’t help. The code runs, it’s just solving the wrong problem. Your test says “expected 42, got 7”, and the model dutifully massages its off-by-one into an off-by-three.
Second, when the error message is misleading, retries actively hurt. I’ve seen Claude and Gemini both latch onto a red herring in a stack trace and chase it for three or four rounds, each time making the code more wrong. The rate improvement numbers in the paper hide this because the benchmarks have unambiguous test failures. Real codebases do not.
Third, the smaller models (Llama 3.1 8B, Qwen3 32B) actually benefit more from retries in relative terms, but they top out lower. If your budget constraint is “small local model”, self-repair gets you further than you’d expect. If you’re already running a frontier model, two attempts is probably the honest ceiling.
This is roughly the same pattern I wrote about in my post on the gap between LLM evals and how people actually use them: clean benchmarks flatter self-correction more than messy production use does. The retry loop that looks brilliant on HumanEval gets humbler when the error is “the PM says this feels wrong”.
A retry loop I can defend in review
Here’s the version I’m actually shipping in an internal agent this week. It’s not fancy, but it respects the diminishing-returns curve and it bails on the failure modes that don’t repair well.
import re
HARD_ERRORS = ("SyntaxError", "NameError", "ImportError", "IndentationError")
def self_repair(task, model, max_attempts=3, stop_on_repeat=True):
history = [{"role": "user", "content": task}]
last_error_signature = None
for attempt in range(max_attempts):
code = model.complete(history)
result = run_sandboxed(code)
if result.passed:
return code, attempt + 1, "ok"
sig = _error_signature(result.error)
if stop_on_repeat and sig == last_error_signature:
return code, attempt + 1, "stalled"
last_error_signature = sig
is_mechanical = any(e in result.error for e in HARD_ERRORS)
hint = (
"Fix the specific exception."
if is_mechanical
else "The code runs but the logic is wrong. Reconsider the approach before rewriting."
)
history += [
{"role": "assistant", "content": code},
{"role": "user", "content": f"{hint}\n\nError:\n{result.error}"},
]
return code, max_attempts, "exhausted"
def _error_signature(err: str) -> str:
head = err.strip().splitlines()[-1] if err else ""
return re.sub(r"line \d+", "line N", head)
The two changes that earned their keep: capping at three attempts (not five or ten), and bailing when the same error repeats. That second check alone cut my token spend on this agent by about 40% with no measurable drop in success rate. Burning three round-trips on the same NameError is the most common way retries turn into a money pit.
The nudge about reconsidering the approach when it’s an AssertionError is softer and I’m less sure about it. My eyeballed result is that it helps maybe one time in four. The other three times the model just rewrites the same buggy logic more confidently. I’d love to see the paper’s authors test prompt variations like this, because it’s the next obvious question.
What I’d do differently starting Monday
A few things I’m changing, if you want something actionable.
Cap retries at three unless you have evidence your specific task benefits from more. The paper’s data doesn’t say five is wrong, it says three is where the slope gets gentle. If you’re eating $0.10 per HumanEval problem, the fourth and fifth tries together add maybe one accuracy point for a 40% cost bump.
Classify the error before you retry. If it’s mechanical (syntax, name, import), retry with the literal traceback. If it’s a failed assertion, either stop, escalate to a bigger model, or ask the model to explain its assumption before it edits code. “Retry blindly” is the wrong default and the paper quietly confirms it.
Log the error signature across attempts. When the same line-normalized error appears twice, you are not converging. You are wasting money. Bail.
Separately, if you’re thinking about building a coding agent end to end, I keep a running list of the infrastructure decisions I’ve had to make on my project page, including the boring but critical sandboxing choices that make any retry loop actually safe to run against real code. I’d rather debug a retry loop than debug a retry loop that just rm -rf’d a repo.
Self-repair isn’t going to save a model that can’t solve the problem. But it does reliably turn a 70% model into an 85% one, if you stop at the right round and actually read the errors. That’s a better trade than most of what gets sold as “agentic” these days, and it fits in fifty lines of Python.