{"id":288,"date":"2026-05-28T05:00:44","date_gmt":"2026-05-28T05:00:44","guid":{"rendered":"https:\/\/abrarqasim.com\/blog\/ai-debugging-why-i-stopped-letting-one-model-fix-itself\/"},"modified":"2026-05-28T05:00:44","modified_gmt":"2026-05-28T05:00:44","slug":"ai-debugging-why-i-stopped-letting-one-model-fix-itself","status":"publish","type":"post","link":"https:\/\/abrarqasim.com\/blog\/ai-debugging-why-i-stopped-letting-one-model-fix-itself\/","title":{"rendered":"AI Debugging: Why I Stopped Letting One Model Fix Itself"},"content":{"rendered":"<p>I watched an AI coding agent fix the same bug four times last week. Same bug. Four separate attempts, and each one arrived with total confidence: &ldquo;The issue was a missing await. I&rsquo;ve corrected it.&rdquo; Then the test failed again, I pasted the error back in, and it produced a fresh, equally certain explanation. By the third round it was rewriting lines that had nothing to do with the failure. By the fourth I gave up and read the stack trace myself, which took about ninety seconds.<\/p>\n<p>I&rsquo;ve done that dance enough times to finally ask the obvious question. Why am I letting the same model that wrote the bug also decide whether it&rsquo;s fixed? If a junior dev kept grading their own homework and handing it back, I&rsquo;d notice within a day. When a model does it, I just keep feeding errors into the same chat window and hoping the next roll comes up green.<\/p>\n<p>A paper that landed on arXiv this month put real numbers on that hunch. The numbers are not subtle, and they changed how I wire up AI coding loops.<\/p>\n<h2 id=\"the-loop-most-of-us-actually-run\">The loop most of us actually run<\/h2>\n<p>Here is the workflow nearly everyone uses with an AI coding tool. You ask for code. You run it. It breaks. You paste the error back and say &ldquo;fix it.&rdquo; You repeat that until the thing works or you give up and fix it by hand.<\/p>\n<p>That loop has one model playing every role at once. Author, reviewer, and judge. It writes the code, it decides what went wrong, and it decides when the work is done. The model most committed to the original approach is also the one grading it.<\/p>\n<p>When you debug your own code, you at least know you&rsquo;re biased toward your first idea, and you can push back against that bias. A model doesn&rsquo;t push back against anything. It re-reads its own output, finds it plausible (it generated that output precisely because the tokens looked plausible), and then tweaks around the edges. I have watched agents &ldquo;fix&rdquo; a logic error by wrapping it in a try block that swallows the exception. Test goes green. Bug is still sitting right there, now harder to find than before.<\/p>\n<p>The single-model loop feels like progress because something changes every round. It just isn&rsquo;t progress that converges on anything.<\/p>\n<h2 id=\"what-the-a-pros-paper-measured\">What the A-ProS paper measured<\/h2>\n<p>A group of researchers built a system called A-ProS and asked one clean question: does it help to split code generation from debugging across different models? They tested it on 367 competitive programming problems pulled from ICPC World Finals between 2011 and 2024, plus <a href=\"https:\/\/codeforces.com\/\" rel=\"nofollow noopener\" target=\"_blank\">Codeforces<\/a> problems rated 1200 to 1800. Competitive programming is a useful test bed because the verdict is binary. An automated judge either accepts your solution or it rejects it. No partial credit, no vibes.<\/p>\n<p>The setup was a two-by-three grid. Two code generators, GPT-4 and GPT-5, each paired with three separate debugging models acting as critics: Codestral, Llama 3.3 70B, and DeepSeek-R1. The generator writes a solution. A different model reads the execution feedback and proposes the fix. Then the loop runs again, up to three rounds.<\/p>\n<p>The results, reported in the <a href=\"https:\/\/arxiv.org\/abs\/2605.18073\" rel=\"nofollow noopener\" target=\"_blank\">A-ProS paper<\/a>, are what got my attention. GPT-5 went from 39 problems solved on the first attempt to somewhere between 85 and 90 after three rounds of cross-model refinement. GPT-4 climbed from 15 to roughly 31 to 38. Both generators close to doubled their solve count, and the only thing that changed was the structure of the feedback loop. Same models. Different plumbing.<\/p>\n<h2 id=\"why-a-different-model-is-the-part-that-matters\">Why a different model is the part that matters<\/h2>\n<p>You could skim those numbers and conclude &ldquo;more refinement rounds equals better,&rdquo; then move on. That is not quite the lesson. The refinement compounds because a different model runs the critique.<\/p>\n<p>Think about what a separate critic brings to the table. It never committed to the original solution. It has different training data, different blind spots, a different instinct for what a stack trace is telling you. When it reads the failing test, it reads it cold, the way a reviewer reads a pull request from a stranger. The original generator reads that same trace and mostly sees evidence that its approach was basically fine and just needs a small patch.<\/p>\n<p>The A-ProS authors also ran an ablation showing the refinement has to be stateful. The critic needs the history of what was already attempted. Without it, the critic happily suggests the same dead-end fix that the previous round already disproved. That matches my experience with single-model loops exactly. Strip out the memory of failed attempts and the model just cycles through the same three wrong ideas in rotation.<\/p>\n<p>This is the same reason code review works between humans. Not because reviewers are smarter than authors. Because they are not the author, and they did not spend the last hour quietly convincing themselves the design was good.<\/p>\n<h2 id=\"how-i-rewired-my-own-loop\">How I rewired my own loop<\/h2>\n<p>I can&rsquo;t run a two-by-three model grid for every bug, and neither can you. But the principle scales down to one developer with two browser tabs.<\/p>\n<p>Here is the version I want to stop doing. One model, every role:<\/p>\n<pre><code class=\"language-python\"># one model plays author, reviewer, and judge\ndef fix_until_green(task):\n    code = model.generate(task)\n    for _ in range(5):\n        result = run_tests(code)\n        if result.passed:\n            return code\n        code = model.generate(\n            f&quot;This failed:\\n{result.error}\\nFix it.&quot;\n        )\n    return code\n<\/code><\/pre>\n<p>And here is the version that separates the jobs, and where possible the models:<\/p>\n<pre><code class=\"language-python\"># generator and critic are different models\ndef fix_until_green(task, max_rounds=3):\n    code = generator.generate(task)\n    history = []\n    for _ in range(max_rounds):\n        result = run_tests(code)\n        if result.passed:\n            return code\n        history.append({&quot;code&quot;: code, &quot;error&quot;: result.error})\n        # the critic sees every prior attempt, not just the latest error\n        diagnosis = critic.review(task, history)\n        code = generator.apply_fix(code, diagnosis)\n    return code\n<\/code><\/pre>\n<p>Two things changed and both of them carry weight. The critic is a separate model with exactly one job: read the failure and the history, then say what is actually wrong. And it receives every previous attempt, so it cannot keep recommending a fix that already failed last round.<\/p>\n<p>In practice, for me, this looks low-tech. I generate with one tool. Then I open a fresh chat in a different model and paste the code plus the error with no leading commentary from me. No &ldquo;I think the problem is in the parser.&rdquo; I let the second model build its own theory from scratch. When the two models land on the same cause, the fix almost always holds. When they disagree, that disagreement is the most useful signal I get all day, because it tells me exactly where to go look myself. I went deeper on which tools I pair this way in my <a href=\"https:\/\/abrarqasim.com\/blog\/cursor-vs-copilot-vs-claude-code-2026-what-i-reach-for\" rel=\"noopener\">comparison of Cursor, Copilot, and Claude Code<\/a>.<\/p>\n<h2 id=\"what-the-headline-number-leaves-out\">What the headline number leaves out<\/h2>\n<p>Doubling your solve rate sounds incredible until you check the denominator. GPT-5&rsquo;s best run solved around 90 problems out of 367. The strongest configuration in the entire study still failed roughly three of every four competitive programming problems it was handed.<\/p>\n<p>So this is not &ldquo;AI solves coding.&rdquo; It is &ldquo;a better-structured loop gets more out of the same models.&rdquo; Competitive programming is also its own special world: tight constraints and clean specs, with an automated judge waiting at the end. Most of my real bugs do not arrive with a judge attached. They arrive as a Slack message saying the dashboard &ldquo;looks weird sometimes,&rdquo; which is a debugging problem and a translation problem at the same time. Benchmarks built on real repositories, like <a href=\"https:\/\/www.swebench.com\/\" rel=\"nofollow noopener\" target=\"_blank\">SWE-bench<\/a>, consistently report lower numbers than tidy competitive sets, and that gap is the honest picture.<\/p>\n<p>The takeaway I trust is narrower than the headline, and more useful for it. If you already have a way to produce real execution feedback, a test suite, a type checker, a failing repro, then routing that feedback through a second and different model is nearly free and clearly helps. If you do not have that feedback loop, no amount of model shuffling rescues you. The signal has to be real first. That is the same reason I run a dedicated review pass before merging anything an agent wrote, which I broke down in my <a href=\"https:\/\/abrarqasim.com\/blog\/ai-code-review-tools-what-i-run-before-merging-ai-written-prs\" rel=\"noopener\">post on AI code review tools<\/a>.<\/p>\n<h2 id=\"one-thing-to-try-this-week\">One thing to try this week<\/h2>\n<p>Take the next bug your AI tool cannot crack in two rounds. Stop the loop right there. Open a different model, hand it only the code and the error with no theory of your own attached, and ask it to diagnose the cause before it writes a single line of fix. See if the second opinion lands somewhere the first one kept missing.<\/p>\n<p>That is the whole move. Different model, separate job, real feedback, and a memory of what already failed. It is not exotic, and it is roughly what the A-ProS results point at. Most of the AI features I build into client apps now use some version of this generator-and-critic split, and a few of them are written up in my <a href=\"https:\/\/abrarqasim.com\/work\" rel=\"noopener\">work<\/a>. It is a boring pattern. Boring patterns tend to be the ones that survive contact with production.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A new paper on autonomous programming found that separating code generation from debugging across models nearly doubled solved problems. Here&#8217;s the takeaway.<\/p>\n","protected":false},"author":2,"featured_media":287,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"rank_math_title":"","rank_math_description":"A new paper on autonomous programming found that separating code generation from debugging across models nearly doubled solved problems. Here's the takeaway.","rank_math_focus_keyword":"ai debugging","rank_math_canonical_url":"","rank_math_robots":"","footnotes":""},"categories":[4],"tags":[335,334,336,5],"class_list":["post-288","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","tag-ai-coding-agents-2","tag-ai-debugging","tag-developer-workflow","tag-llm"],"_links":{"self":[{"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/posts\/288","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/comments?post=288"}],"version-history":[{"count":0,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/posts\/288\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/media\/287"}],"wp:attachment":[{"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/media?parent=288"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/categories?post=288"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/tags?post=288"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}