Skip to content
AI

Your LLM Evals Are Testing the Wrong Thing

Your LLM Evals Are Testing the Wrong Thing

I spent most of last month building an eval suite for a client’s support chatbot. Hundreds of test cases, carefully scored, all running through the API. The results looked solid. The model pushed back on misinformation, stayed on topic, handled edge cases cleanly. I was feeling good about it. Then someone on the client’s team had a 15-minute conversation with the actual chat interface and walked away believing the bot had confirmed a conspiracy theory about their insurance claim.

That gap between what the API shows you and what users actually experience? There’s now a study that measured it. The numbers are worse than I expected.

The study that should make you uncomfortable

A team of researchers recently published a paper called “LLM Spirals of Delusion.” It’s an audit study that did something almost nobody does in LLM evaluation: they tested chatbots through the actual chat interface, not just the API.

They ran 56 twenty-turn conversations with ChatGPT-4o and ChatGPT-5, split evenly between API calls and the desktop/web chat interface. The test scenario: present the model with conspiratorial or delusional statements and observe how it responds over an extended conversation.

The headline finding is that the API and the chat interface produce measurably different behaviors. Models were more likely to reinforce problematic beliefs when accessed through the chat interface, which is the environment actual humans use every day. Two research assistants and GPT-5 itself graded each conversation, and the gap between API and interface performance held across all evaluators.

You can read the full paper on arXiv. I’d start with the methodology section, which lays out exactly how they kept the comparison fair.

Why the API and the chat interface are different products

Your LLM Evals Are Testing the Wrong Thing

This isn’t about the API being dishonest. It’s about the stack between the base model and the user.

When you send a message through the API, you’re sending a clean JSON payload with your own system prompt and parameters. When someone opens ChatGPT on their laptop, they get personalization layers, conversation memory, UI-driven system prompts, and whatever fine-tuning the vendor has applied to that specific interface. Those aren’t the same product, even if they share a model name.

From an engineering perspective, your API eval hits the model with controlled inputs and measures outputs. The chat interface wraps those same inputs in context you don’t control and can’t see. System prompts change without notice. Memory features inject prior conversation history. Safety filters operate at different thresholds depending on the platform version. You’re testing a clean-room version of something that ships with messy middleware attached.

The researchers were blunt about it: automated testing through the API is not sufficient to assess real-world chatbot impact. That’s their conclusion, not my spin on it.

Multi-turn conversations are where things fall apart

Most LLM eval benchmarks test single exchanges. Send a prompt, grade the response, move on. That approach completely misses how conversations evolve.

In the study’s 20-turn conversations, models gradually shifted from resisting questionable claims to reinforcing them. The researchers called these patterns “spirals.” Once the model made a small concession to keep the conversation flowing, subsequent turns built on that concession until the model was actively supporting the problematic belief.

This connects to what the AI safety community calls sycophancy, the tendency of RLHF-trained models to tell users what they want to hear. Research out of Anthropic documented this behavior, showing that models will flip correct answers to incorrect ones when users push back with disagreement. What this new study adds is evidence that sycophancy through multi-turn chat interfaces is worse than what you’d measure through the API. The interface environment, with its memory and personalization features, seems to amplify the drift.

The pattern shows up in practice too. I’ve seen support bots that test perfectly in isolation but gradually adopt a user’s framing during long troubleshooting sessions. The bot starts by providing the correct diagnosis, but after the user pushes back three or four times, it starts hedging, then agreeing, then actively helping the user pursue the wrong fix.

I’ve talked before about what to measure in LLM reasoning flows, and that advice still holds. But it assumed your eval environment is a reasonable approximation of what users experience. This study says that assumption might be wrong in ways that matter.

What this breaks in practice

If you’re building any LLM-powered product and your evaluation pipeline only hits the API, you’ve got a blind spot. Here’s what goes wrong.

Your safety evals pass because the API-level model correctly pushes back on harmful claims in single-turn tests. You ship the product. Users have multi-turn conversations through your actual UI, where context accumulates, the model adapts its tone to match the user, and safety behavior degrades gradually. By turn 12, the chatbot is agreeing with things it would have flagged at turn 1.

This is especially risky for customer support bots, health-related assistants, and anything where users bring strong prior beliefs into the conversation. A user who starts by saying “I heard my medication has been recalled” gets increasingly confident in that false belief if the bot doesn’t push back firmly at every single turn. It only takes one soft “I understand your concern” without a correction to start the spiral.

The eval pipeline that catches this doesn’t exist at most companies. Single-turn benchmarks are easy to run and they look good on a slide deck. Multi-turn adversarial testing through the real product interface requires actual automation work that most teams quietly skip.

I get why it happens. API evals are cheap, fast, and automatable with three lines of Python. Browser automation is flaky, slow, and breaks every time the UI changes. But cheap tests that miss the failure mode aren’t actually cheap. They’re invisible risks you haven’t priced yet.

What to change this week

Here’s what I’d do differently if I were setting up an eval pipeline from scratch today.

Add multi-turn test scenarios that run for at least 15 to 20 turns. Not three-turn “conversations” that barely scratch the surface. Build extended interactions where the simulated user escalates a claim gradually, and measure whether the model’s resistance to bad information holds at turn 15 the way it did at turn 2.

Test through the real interface, not just the API. If you’re deploying a chatbot, your evals need to exercise the same stack your users hit. Automate browser-level testing with Playwright or a similar tool. It’s slower than API calls. It’s also closer to what people actually experience.

Build specific sycophancy drift tests. Start with a false claim. Have the simulated user restate it with slightly more confidence each turn. Track the exact turn number where the model stops pushing back. That number is your real safety threshold, and I’d bet it’s lower than you think.

Stop trusting single-turn safety benchmarks at face value. When a vendor says their model scores 95% on safety evals, ask whether those evals were multi-turn and whether they ran through the product interface. The answer is almost always no to both.

I build LLM evaluation and integration systems for clients, and this paper has already changed how I approach scoping. The API is where you start testing, not where you finish.

Your eval pipeline is probably grading a version of your product that nobody uses. Fix that before you ship.