LLMs Do Worse at Math When the Names Aren't English

Okay, I was poking at a set of grade-school math problems last weekend, trying to see if Claude Opus 4.6 and GPT-5 would behave differently when I changed the names but kept the numbers. I expected nothing interesting. What I found kind of bugged me.

Both models solved the “Janet bakes 18 muffins” version fine. Both models flubbed the exact same problem when “Janet” became “Ayesha” and the muffins became samosas in Karachi. Same arithmetic. Same logic. Different answer. I did not believe it at first, so I ran it ten more times. The pattern held, and then a paper confirmed I was not imagining things.

The study I wish I had read first

Researchers recently published a benchmark that rewrites GSM8K problems for six non-Western cultural contexts: Haiti, Moldova, Pakistan, Solomon Islands, Somalia, and Suriname. They kept the mathematical operations identical. They only swapped cultural entities: names, foods, places, currency. Then they ran 14 frontier models through 1,198 problems each.

Accuracy dropped across every single model. The drops ranged from 0.3% on Claude 3.5 Sonnet at the low end to 5.9% on LLaMA 3.1 8B at the high end. The statistical tests ruled out noise (p < 0.01 via McNemar). Of the 18,887 failed instances the authors looked at, roughly 55% of errors were reasoning failures, and another 35% were calculation errors.

Read that again. Calculation errors. On problems where the numbers never changed.

Why swapping “Janet” for “Ayesha” breaks math

$LLMs Do Worse at Math When the Names Aren't English$

My first instinct was to blame tokenization. Less frequent names tokenize into more pieces, which burns more of a model’s attention budget before it even gets to the arithmetic. That story is partly true. It is not the whole story, because the drops happened across models with very different tokenizers and very different training corpora.

What the paper suggests, and what I now believe after my own follow-ups, is that models build an implicit expectation about a problem from its surface features. If the surface says “Janet, cafeteria, cupcakes,” the model pattern-matches to the thousands of similar grade-school problems it has seen. If the surface says “Ayesha, canteen, samosas,” the model has to do more lifting to recognize this as a GSM8K-shaped problem at all, and it spends some of its reasoning budget on that recognition instead of on the sum itself.

You can reproduce this at home in about ten minutes. It is not a subtle effect.

This is not a story about AI being racist

I want to be careful here, because the “AI is biased” framing gets clicks and it also gets people to stop thinking. This result is not proof that a model is prejudiced. It is proof that a model’s reasoning is not as decoupled from surface features as its marketing claims. That is a different, and much more useful, problem.

The practical implication for anyone shipping LLM features for a global audience is that your evaluations are probably lying to you. If you built your eval set in English with English names, you are measuring performance on your test set, not performance on your users. The gap can be small on frontier models. It is not small on the open-weight models most startups actually deploy, where I have seen 3 to 6 point drops across the literature, including in the original GSM8K paper from OpenAI, which already noted sensitivity to surface perturbations back in 2021.

There is also a compounding effect. If your downstream system uses the LLM’s output to compute something, and then feeds the result back into another prompt, a 4% error rate at each hop becomes a very different number after three hops. I have gotten this wrong in production before. It is the kind of bug that passes every test and then shows up as “our Pakistani users keep complaining about the math being weird” in a support ticket six weeks later.

What I am changing in how I evaluate

I used to test LLMs the way most people do. Pick a benchmark, run it in English, look at the score, ship. After reading this paper I am adding three things to my workflow.

First, localized replicas of every eval I care about. If I have 50 test cases in English, I generate 50 culturally adapted replicas for every locale I plan to support. This is cheap to do with a script and a good localization prompt. The whole point is to keep the math identical and vary only the surface features.

Second, a per-locale slice in my eval report. A single average across all locales hides exactly the kind of drop this paper found. If one locale scores 88% and another scores 82%, I want to see both numbers side by side, not the mean.

Third, I am being much more skeptical about “reasoning” benchmarks in general. The whole premise of a reasoning eval is that it isolates logic from surface form. If a model’s accuracy moves when you change names and places, the eval is not isolating what it claims to isolate. I wrote more about this in my earlier piece on what reasoning LLMs actually measure when they show their work, and this new paper is a strong piece of evidence that the concern is real, not philosophical.

What you can actually do this week

If you have ten minutes, pick your five most-used prompts, swap the names and place nouns for ones common to a region you care about, and rerun them. Note any answers that change. If any of them do, you have a reproducible failure case that is worth a GitHub issue or a Linear ticket.

If you have a couple of hours, run the Haiti or Pakistan variant of GSM8K on whatever model you are about to ship on. The translated test sets are public on arXiv. You will know within an afternoon whether your model is carrying a meaningful locale penalty.

If you have a weekend, do what the authors did and build your own culturally adapted variant of an internal eval. It is the single cheapest “oh no we shipped a bug” prevention I have seen all year. I cover more of these kinds of small, cheap evaluation upgrades in my work on practical LLM deployments, because the best time to catch a locale regression is before your users find it, not after.

None of this makes your model smarter. It makes your measurements honest, which is the first step toward actually being able to improve anything.