{"id":284,"date":"2026-05-27T05:05:10","date_gmt":"2026-05-27T05:05:10","guid":{"rendered":"https:\/\/abrarqasim.com\/blog\/agentic-coding-why-the-feedback-loop-beats-a-smarter-model\/"},"modified":"2026-05-27T05:05:10","modified_gmt":"2026-05-27T05:05:10","slug":"agentic-coding-why-the-feedback-loop-beats-a-smarter-model","status":"publish","type":"post","link":"https:\/\/abrarqasim.com\/blog\/agentic-coding-why-the-feedback-loop-beats-a-smarter-model\/","title":{"rendered":"Agentic Coding: Why the Feedback Loop Beats a Smarter Model"},"content":{"rendered":"<p>Confession: for most of last year I picked AI coding tools the lazy way. A new model dropped, I read the benchmark tweet, I switched. If something scored four points higher on an eval, that was my afternoon decided.<\/p>\n<p>Then I actually timed myself. I wasn&rsquo;t shipping working code any faster. I was shipping the same broken first draft faster, and then spending the same forty minutes fixing it by hand. The model got smarter. My afternoon did not.<\/p>\n<p>Here&rsquo;s the part I&rsquo;d been ignoring, and it&rsquo;s almost embarrassingly simple. One prompt to the best model on earth is still one guess. The thing that finally moved my output wasn&rsquo;t a better guesser. It was wrapping an ordinary guesser in a loop that runs the code, reads the error, and tries again. That loop has a name now, agentic coding, and there&rsquo;s finally a study that puts a number on what it&rsquo;s worth.<\/p>\n<h2 id=\"the-loop-is-the-product-not-the-model\">The loop is the product, not the model<\/h2>\n<p>Think about how you write a function. You don&rsquo;t type it out, ship it unread, and walk away. You run it. You read the red text. You poke at the thing that broke. The real skill was never the first draft. It&rsquo;s the second, third, and fourth.<\/p>\n<p>For two years we asked AI to do the one thing no working developer does: write once, never check. A single completion, graded on whether the first attempt happened to land. No wonder it felt like a coin flip.<\/p>\n<p>Agentic coding closes that gap. The model writes code, a harness runs it, the failure goes back to the model, and it tries again with the error in hand. The model is one part of that. The harness, the boring plumbing that compiles and tests and captures the traceback, is doing real work that used to fall on you.<\/p>\n<p>This isn&rsquo;t a new idea. The <a href=\"https:\/\/arxiv.org\/abs\/2303.17651\" rel=\"nofollow noopener\" target=\"_blank\">Self-Refine paper<\/a> showed back in 2023 that a single model improves its own output when it generates feedback on that output and revises, with no extra training and no fine-tuning. What changed since is that the idea stopped being a research demo and got wired into the tools we open every day.<\/p>\n<h2 id=\"what-the-a-pros-numbers-actually-show\">What the A-ProS numbers actually show<\/h2>\n<p>The study I keep coming back to is <a href=\"https:\/\/arxiv.org\/abs\/2605.18073\" rel=\"nofollow noopener\" target=\"_blank\">A-ProS<\/a>, published this year. The researchers built an autonomous coding agent and tested it on 367 competitive programming problems pulled from ICPC World Finals and mid-tier Codeforces rounds. They split the job in two: one model writes the solution, and separate models act as debugging critics that read the failures.<\/p>\n<p>The headline I care about: their GPT-5 workflow solved 39 problems on the first attempt and 85 to 90 after three rounds of refinement. The GPT-4 workflow climbed from 15 to somewhere between 31 and 38.<\/p>\n<p>Sit with those two numbers for a second. GPT-4 inside the loop lands around 35. GPT-5 with no loop lands at 39. A weaker model that&rsquo;s allowed to see its own mistakes roughly matches a stronger model that isn&rsquo;t. The loop bought more than a full model generation of capability, and a model generation is the thing we all keep waiting for and paying for.<\/p>\n<p>One more detail worth stealing. A-ProS ran a controlled ablation on 47 problems and found that stateful refinement, where the agent remembers what it already tried, beats refinement that starts cold each round. The loop needs memory. An agent that forgets its last three failures will cheerfully hand you fix number one again on round four.<\/p>\n<h2 id=\"a-feedback-loop-you-can-build-in-an-afternoon\">A feedback loop you can build in an afternoon<\/h2>\n<p>You don&rsquo;t need an agent framework to get most of this. Here&rsquo;s the version of &ldquo;AI writes code&rdquo; that most people actually run:<\/p>\n<pre><code class=\"language-python\">def one_shot(task: str) -&gt; str:\n    return llm(f&quot;Write a Python function. {task}&quot;)\n<\/code><\/pre>\n<p>One call, one guess, and you become the harness. Here&rsquo;s the same thing with a loop wrapped around it:<\/p>\n<pre><code class=\"language-python\">def with_feedback(task: str, tests: str, max_rounds: int = 3) -&gt; str:\n    code = llm(f&quot;Write a Python function. {task}&quot;)\n    history = []\n\n    for attempt in range(max_rounds):\n        result = run_in_sandbox(code + &quot;\\n&quot; + tests)\n        if result.passed:\n            return code\n\n        history.append(result.stderr)\n        transcript = &quot;\\n\\n&quot;.join(\n            f&quot;Attempt {i + 1} failed:\\n{err}&quot;\n            for i, err in enumerate(history)\n        )\n        code = llm(\n            &quot;Your code failed its tests. Fix it.\\n&quot;\n            f&quot;Task: {task}\\n&quot;\n            f&quot;{transcript}\\n&quot;\n            f&quot;Current code:\\n{code}&quot;\n        )\n\n    return code  # best effort once the rounds run out\n<\/code><\/pre>\n<p>The whole trick lives in two places. <code>run_in_sandbox<\/code> gives you a real pass-or-fail signal instead of a vibe, and it has to be a sandbox, because you&rsquo;re about to execute code a model wrote without reading it first. And passing the full <code>history<\/code> back, not just the latest error, is the stateful part the A-ProS ablation cared about. Drop that history and you&rsquo;ve built the goldfish version of the agent.<\/p>\n<p>This is maybe twenty lines on top of the one-shot call. It&rsquo;s also the line between a model that guesses and a model that debugs. If you want to get fancier later, the obvious next step is routing by difficulty, so a one-liner skips the loop and a hard function gets all three rounds.<\/p>\n<h2 id=\"where-the-loop-still-falls-apart\">Where the loop still falls apart<\/h2>\n<p>I&rsquo;d be lying if I said this fixed everything. It didn&rsquo;t.<\/p>\n<p>The loop is only as good as its signal. Competitive programming hands you a clean pass-or-fail for free. Your actual codebase does not, unless you&rsquo;ve written tests. With no tests, the loop has nothing real to optimize against, and the model will happily &ldquo;fix&rdquo; your code into a fresh kind of wrong. If you want the loop to work, the tests are the homework. I wrote up the specific checks I run before merging anything a model produced in my post on <a href=\"https:\/\/abrarqasim.com\/blog\/ai-code-review-tools-what-i-run-before-merging-ai-written-prs\" rel=\"noopener\">reviewing AI-written pull requests<\/a>, and that habit matters more once an agent is in the picture, not less.<\/p>\n<p>There&rsquo;s also the gaming problem. An agent that only sees &ldquo;make the tests pass&rdquo; will, given enough rounds, write code that passes the tests without solving the problem. You still read the diff. The loop changes how the code gets written. It doesn&rsquo;t retire your judgment.<\/p>\n<p>And it isn&rsquo;t free. Three rounds means three or four model calls plus three sandbox runs. That&rsquo;s fine for a gnarly function and silly for a one-liner. Spend the calls where the difficulty actually is.<\/p>\n<p>One last reality check. Even strong agent setups don&rsquo;t resolve every real-world issue. <a href=\"https:\/\/www.swebench.com\" rel=\"nofollow noopener\" target=\"_blank\">SWE-bench<\/a>, which grades models on actual GitHub issues, keeps a verified set of 500 problems that human engineers confirmed are solvable, and no system clears all of them. The loop is a large, measurable gain. It&rsquo;s not a magic wand, and anyone selling it as one hasn&rsquo;t run it on their own repo.<\/p>\n<h2 id=\"what-id-try-this-week\">What I&rsquo;d try this week<\/h2>\n<p>Pick the AI coding tool you already pay for. Stop judging it by the model name on the box. Ask one question instead: when its code fails, does anything feed that failure back automatically, or are you the one copying the traceback into the chat?<\/p>\n<p>If you&rsquo;re doing that by hand, you are the loop, and that&rsquo;s the forty minutes I burned every afternoon last year. Wire up even a crude three-round loop around a cheaper model, give it real tests to run, and measure it against the expensive one-shot setup you&rsquo;ve got now. I&rsquo;ve been rebuilding my own tooling around this idea, and some of it is in my <a href=\"https:\/\/abrarqasim.com\/work\" rel=\"noopener\">recent work<\/a>.<\/p>\n<p>My bet, and the A-ProS numbers back it, is that the loop wins, and it&rsquo;s not close.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A smarter model still writes the same broken first draft. What actually moved my output was wrapping a cheaper model in a feedback loop that debugs itself.<\/p>\n","protected":false},"author":2,"featured_media":283,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"rank_math_title":"","rank_math_description":"A smarter model still writes the same broken first draft. What actually moved my output was wrapping a cheaper model in a feedback loop that debugs itself.","rank_math_focus_keyword":"agentic coding","rank_math_canonical_url":"","rank_math_robots":"","footnotes":""},"categories":[4],"tags":[330,331,317,72,332,5],"class_list":["post-284","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","tag-agentic-coding","tag-ai-coding-agents","tag-ai-tools","tag-code-generation","tag-feedback-loop","tag-llm"],"_links":{"self":[{"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/posts\/284","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/comments?post=284"}],"version-history":[{"count":0,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/posts\/284\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/media\/283"}],"wp:attachment":[{"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/media?parent=284"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/categories?post=284"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/tags?post=284"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}