Skip to content
AI

Your reasoning model is more fragile than the benchmarks say

Your reasoning model is more fragile than the benchmarks say

Your reasoning model is more fragile than the benchmarks say

I ran a quick test last week. Took a math problem that Claude Opus and o3 both ace on AIME 2024, changed the variable names from x and y to Greek letters, and fed it to an open-weight reasoning model. Accuracy dropped by over 40%. Same math. Different symbols. Total collapse.

This isn’t just me poking at edge cases for fun. A new paper called Robust Reasoning Benchmark did something similar but way more systematic. The researchers built 14 perturbation techniques and applied them to the full AIME 2024 dataset, then tested 8 models. The results should worry anyone shipping products on top of reasoning models.

What the perturbations actually are

The paper doesn’t just swap variable names. They tried 14 different modifications to standard math problems, none of which change the underlying math. Things like using Unicode symbols instead of ASCII, reordering the premises in a word problem, adding irrelevant context sentences, or changing number formats (writing “one hundred” instead of “100”).

These aren’t adversarial attacks in the traditional sense. Nobody’s injecting prompts or confusing the tokenizer on purpose. They’re the kind of variation that shows up naturally when real users type real problems. A student might write “alpha” instead of “x”. A finance person might format numbers with commas. These are normal things.

The damage is worse than you’d guess

Your reasoning model is more fragile than the benchmarks say

Frontier models handled it reasonably well. GPT-5.1 and Gemini 2.5 Pro barely flinched. Claude Opus 4.6 was resilient too, though the paper noted some accuracy decay on sequential problems (more on that in a second).

Open-weight reasoning models got destroyed. The paper reports average accuracy drops of up to 55% across perturbations. On some individual perturbation types, certain models hit 100% accuracy loss. That means every single answer was wrong on a problem set they normally get mostly right.

The models tested ranged from 7B to 120B parameters. Size didn’t help much. A 70B reasoning model that scores 80%+ on standard AIME still fell apart when the formatting changed.

I’ve seen this pattern before in my own work. I was building an evaluation pipeline for a client last year and kept getting inconsistent scores. Turned out the evaluation prompts had slightly different whitespace in different runs, and the reasoning model was sensitive enough that it mattered. I wrote about the gap between API and chat evaluations around that time, but this paper puts real numbers on the problem.

The sequential problem is maybe scarier

Here’s the part that got me. The researchers also tested something separate: they gave models multiple unperturbed math problems in sequence within a single context window. No perturbations at all. Just “solve problem 1, now solve problem 2, now solve problem 3.”

Accuracy decayed on later problems. The intermediate reasoning steps from earlier problems polluted the attention mechanism, and subsequent answers got worse. This happened across open-weight models from 7B to 120B, and it happened with Claude Opus too.

If you’re batching multiple reasoning tasks in one context window to save on API costs (and a lot of people do this), your later tasks are getting worse answers than your earlier ones. The model isn’t forgetting. It’s being confused by its own earlier work.

This is the kind of thing that’s almost impossible to catch in standard evals because most benchmarks test one problem at a time. In production, you’re rarely doing that.

What this means if you’re building on reasoning models

First, don’t trust benchmark scores as deployment guarantees. AIME scores tell you what a model can do in ideal conditions with standardized formatting. Your users won’t type in ideal conditions. If your application takes free-form input and feeds it to a reasoning model, you need to test with messy, real-world formatting. Swap variable names, add extra whitespace, change number formats, reorder sentences. If accuracy drops dramatically, you have a robustness problem that no amount of prompt engineering will fully fix.

Second, be careful with context window batching. If you’re packing multiple reasoning tasks into one call, benchmark the later tasks separately. You might find that the third or fourth problem in a batch performs significantly worse than the same problem in isolation. The savings from batching might not be worth the accuracy hit.

Third, model selection matters more than model size. The paper found that frontier models (GPT-5.1, Gemini 2.5 Pro) were far more robust than open-weight alternatives, even at comparable parameter counts. If your application depends on reasoning reliability and not just peak accuracy, you might need to stay on frontier APIs rather than self-hosting. This is frustrating if you’re trying to control costs or run on-prem, but the robustness gap is real.

A formatting layer might help (but it’s a band-aid)

One mitigation I’ve been trying: normalize inputs before they hit the reasoning model. Strip weird Unicode, standardize number formats, canonicalize variable names. It’s a preprocessing layer that costs almost nothing and can absorb some of the variation that trips models up.

It’s not a complete fix. You can’t normalize away premise reordering or irrelevant context without understanding the problem deeply enough that you might as well solve it yourself. But for the low-hanging fruit like character encoding and number formatting, it helps.

The Robust Reasoning Benchmark paper is on arXiv if you want the full results. The perturbation pipeline they built is actually useful as a testing tool. If you’re evaluating reasoning models for production use, running your test set through their 14 perturbation types would tell you a lot more than standard benchmarks do.

I also found the LMSYS Chatbot Arena useful for getting a gut feel for model robustness, since users there type things every possible way. It’s not rigorous but it gives you a sense of which models handle messy inputs gracefully.

For more on how I evaluate models for production use, I’ve been documenting my process on my site. The short version: standard benchmarks are a starting point, not an endpoint. Real-world robustness testing is where the surprises live.