I was debugging a RAG app on Tuesday evening, absolutely convinced the retriever had lost its mind. Three hours in, I realized the retriever was fine. The reasoning model was quietly skipping the middle step of a four-step chain and then confidently stitching together an answer over the gap. It sounded right. It cited real sources. It was wrong in a way that almost no evaluation suite would have caught.
That bug sent me down a rabbit hole I should have visited months ago: where exactly do reasoning LLMs fail, and why do those failures look so polished on the way out? This week a few new papers gave me the vocabulary I was missing, and I want to share the only three things I’ll actually change in my stack because of them.
The failure mode nobody warns you about
Most of us think about LLM mistakes in two buckets: hallucinations and refusals. Neither category captures what tripped me up. The model didn’t hallucinate a fact and it didn’t refuse. It performed three of four reasoning steps correctly, silently dropped step three, and presented a complete-looking answer. A new arxiv paper bluntly titled “Reasoning Fails Where Step Flow Breaks” calls this a step-flow rupture, and argues that most reasoning failures in production are not logic errors inside a step but continuity errors between steps.
This matched my debugging experience so closely that I printed the abstract and taped it above my monitor. The polite way to say it: we’ve been evaluating our chains at the wrong granularity. Step-level accuracy is what matters, not end-to-end accuracy on a handful of test questions.
Why your evals are lying to you

Here’s the part that stung. My eval suite for that RAG app was the standard thing: 40 curated question-answer pairs, measured as exact-match plus an LLM-judge for partial credit. On paper it was at 87 percent. In reality, the model was hitting maybe 60 percent on any chain with more than three reasoning hops, because end-to-end scoring happily gives you credit when the model arrives at a correct-looking answer for the wrong reasons.
I tested this claim by logging the intermediate reasoning traces for a week and hand-grading 50 random chains. The gap between “final answer looked right” and “every step was actually right” was roughly 22 points. Twenty-two. If anyone tells you their chain is at 90 percent accuracy, ask them whether that number is measured at the step level or the answer level. If it’s the answer level, mentally lop off 15 to 25 percent and you’ll be in the right ballpark.
The implication for builders is uncomfortable. End-to-end scoring is still useful as a cheap canary, but it is not the number you should put on a slide when you are pitching reliability to a customer. If you want to be honest about a chain’s reliability you need to score the steps, not the answer.
The uncertainty signal we’ve been ignoring
The second paper that rewired my thinking is SELFDOUBT, which proposes a “hedge-to-verify ratio” as an uncertainty signal for reasoning LLMs. The intuition is simple. When a model is uncertain, it produces hedging tokens: “likely”, “probably”, “I’m not sure, but”. When it is confident, it produces verification tokens: “because”, “as shown”, “this confirms”. The ratio between those two over a single chain gives you a shockingly reliable confidence score without needing logprobs, without needing an external judge, and without needing a second model call.
I wired a crude version of this into my app yesterday afternoon. Took about twenty lines of code. The false-positive rate on “confident but wrong” answers dropped by roughly a third in the first hundred requests. I don’t want to oversell a one-day experiment, but the cost-to-value ratio is absurd and I’d be embarrassed not to try it.
The caveat I’d flag: this technique helps you detect uncertainty, not fix it. It’s a smoke detector, not a sprinkler system. You still need a fallback path for when the smoke detector goes off, and that’s usually the harder engineering problem.
What I’m actually changing this week
Three things, in priority order.
First, I’m rewriting my eval harness to score at the step level. This is the unsexy one that will take the longest. My new harness will log every intermediate reasoning step with a reference rubric, and each step gets a pass/fail independent of whether the final answer lands. I expect my reported accuracy numbers to drop by 15 to 20 points once I do this, and I’m going to publish the old and new numbers side by side so I’m not tempted to pretend the drop didn’t happen.
Second, I’m adding the hedge-to-verify ratio as a soft signal in the response path. If the ratio crosses a threshold I’ll fall back to a simpler deterministic path rather than trusting the chain. This is a soft mitigation and it will catch some of the cases where a step-flow rupture would otherwise ship to users.
Third, and this is the one I’m least confident about: I’m going to try chain decomposition before generation. Instead of asking the model to reason through a four-step problem in a single generation, I’ll prompt it to produce the step plan first, validate each step is well-formed, and only then execute. This doubles the latency budget, so I’m only doing it on the highest-stakes chains. The open question is whether the latency hit is worth the reliability improvement, and I’ll have a real answer in about two weeks.
A quick note on cost. Chain decomposition roughly doubles your token count per request because you’re paying for the plan and the execution. If your margins are already thin, the hedge-to-verify ratio is a much cheaper place to start because it adds zero extra calls. I’d tell most teams to ship the ratio check this week and experiment with decomposition only if the ratio check alone isn’t enough. Order of operations matters when the latency budget is real.
One more thing I want to be fair about: none of this is a silver bullet. Step-flow ruptures are a symptom of models being trained to produce fluent text, and the fluency is what makes the failures hard to spot. Any mitigation is going to be a trade between latency, cost, and trust, and the right trade depends on how expensive a wrong answer is in your product. For a consumer chatbot the cost of a confidently wrong answer is low. For a tool that writes code against a production database the cost is high. Pick your threshold accordingly.
If you’ve read my earlier post on where open-source LLMs actually stand in 2026, you’ll notice these changes are all upstream of the model choice itself. That’s on purpose. I think most teams obsess over which model to pick and then under-invest in the evaluation harness around it, which is backwards. A better eval harness on a mid-tier model beats a worse eval harness on a frontier model almost every time.
The uncomfortable takeaway
Here is the thing I keep circling back to. The reason my RAG bug was invisible for weeks was not that the model was getting worse. It was that my eval suite was never built to see the failure mode in the first place. I was measuring the answer and pretending I was measuring the reasoning, and the two are not the same thing.
If you take one idea from this post, let it be this: audit your eval harness before you audit your model. Read one of your chains end-to-end, step by step, and ask yourself whether your current scoring system would have caught a silently skipped step. If the answer is no, that’s the first thing to fix. It is less fun than switching models and it is more important.
If you want to chat about any of this, or you’ve hit a similar bug and want to compare notes, I’m around. I keep a few ongoing projects listed on my portfolio at abrarqasim.com and I read my inbox. I’d rather hear about a counterexample than another “your chain is at 95 percent” claim that doesn’t survive step-level scoring.
Short version for the impatient: score steps, not answers; watch hedge-to-verify ratios; and be honest about the gap between the two.