{"id":153,"date":"2026-04-26T05:03:21","date_gmt":"2026-04-26T05:03:21","guid":{"rendered":"https:\/\/abrarqasim.com\/blog\/llm-safety-stateless-multi-turn-attack-what-it-means\/"},"modified":"2026-04-26T05:03:21","modified_gmt":"2026-04-26T05:03:21","slug":"llm-safety-stateless-multi-turn-attack-what-it-means","status":"publish","type":"post","link":"https:\/\/abrarqasim.com\/blog\/llm-safety-stateless-multi-turn-attack-what-it-means\/","title":{"rendered":"LLM Safety Has a Stateless Multi-Turn Blind Spot (And What It Means for Your App)"},"content":{"rendered":"<p>Okay, this is going to sound dumb, but I spent half of yesterday staring at a new paper and thinking: we probably already knew this, and we still keep shipping the thing it breaks. If you run an LLM app and you&rsquo;ve been comforted by &ldquo;the model refuses bad requests,&rdquo; you should read this before the next release.<\/p>\n<p>The paper is <a href=\"https:\/\/arxiv.org\/abs\/2604.21860\" rel=\"nofollow noopener\" target=\"_blank\">Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models<\/a>. I&rsquo;m not going to rewrite it. I want to talk about the part of it that has practical consequences for the shape of LLM apps I see people build, including ones I&rsquo;ve built.<\/p>\n<h2 id=\"the-one-sentence-version-of-the-result\">The one-sentence version of the result<\/h2>\n<p>Most LLM safety moderation is stateless. Each turn gets scored on its own. An attacker that splits malicious intent across several turns, none of which looks bad in isolation, walks straight past the moderator. The authors call this Transient Turn Injection (TTI), and they show it working across a bunch of frontier and open-source models.<\/p>\n<p>That&rsquo;s it. No magic. No new adversarial math. The failure mode is an architectural seam between &ldquo;the model&rdquo; and &ldquo;the moderator layer around it,&rdquo; and the seam has been sitting there the whole time.<\/p>\n<h2 id=\"why-the-blind-spot-exists-in-the-first-place\">Why the blind spot exists in the first place<\/h2>\n<p>If you&rsquo;re a platform deploying an LLM at scale, you almost always run moderation as a separate classifier on each message. It&rsquo;s cheaper. It&rsquo;s easier to reason about. It composes with rate limiting. And for single-turn abuse it actually works pretty well.<\/p>\n<p>The problem is that &ldquo;assistant&rdquo; is a multi-turn abstraction. The user writes three benign-looking messages that, together, describe a coherent attack. The moderator sees three benign messages. The model sees the accumulated context and happily executes the composite request. Nobody is looking at the shape of the conversation.<\/p>\n<p>I wrote a related post on <a href=\"https:\/\/abrarqasim.com\/blog\/llm-context-window-numbers-sepseq-fix\/\" rel=\"noopener\">why LLM evals miss context-window failures<\/a> and this is cousin to that problem. The thing you&rsquo;re evaluating is narrower than the thing that actually runs in production, so evals keep passing while production keeps failing in the same structural way.<\/p>\n<h2 id=\"what-this-means-for-your-app-the-practical-bit\">What this means for your app (the practical bit)<\/h2>\n<p>If you&rsquo;re building on top of a frontier model through an API, three things are true:<\/p>\n<ol>\n<li>The provider&rsquo;s safety layer is better than whatever you&rsquo;re going to build yourself for most single-turn harm.<\/li>\n<li>That safety layer probably doesn&rsquo;t see your full conversation history \u2014 it sees the current turn plus whatever system prompt you send.<\/li>\n<li>Therefore, anything that relies on &ldquo;the provider will refuse bad requests&rdquo; is under-protected against multi-turn attacks that spread intent across turns.<\/li>\n<\/ol>\n<p>This isn&rsquo;t a theoretical knock. Any app that uses the pattern &ldquo;user chats, model responds, each message goes through the provider moderator&rdquo; is vulnerable to the TTI class of attack. That includes most of the customer-support bots, internal knowledge assistants, and content generators I&rsquo;ve seen deployed.<\/p>\n<p>The fix isn&rsquo;t &ldquo;switch models.&rdquo; The paper shows the technique working across many frontier models. The fix is architectural: you have to moderate the <em>conversation<\/em>, not the <em>turn<\/em>.<\/p>\n<h2 id=\"three-things-you-can-actually-do\">Three things you can actually do<\/h2>\n<h3 id=\"1-keep-a-conversation-level-risk-score\">1. Keep a conversation-level risk score<\/h3>\n<p>Maintain a running score on the conversation as a whole. Each turn contributes to the score based on its content, but so does the <em>pattern<\/em> of the conversation \u2014 requests that look like probing, requests that try to establish a premise, requests that ask the model to adopt a persona. When the conversation&rsquo;s accumulated score crosses a threshold, switch to a hardened system prompt, start logging more aggressively, or cut the session.<\/p>\n<p>You don&rsquo;t need a fancy model for this. A small classifier that scores transitions works fine. The point is that the state lives at the conversation level, not the turn level.<\/p>\n<h3 id=\"2-replay-the-full-conversation-through-a-reviewer-model-periodically\">2. Replay the full conversation through a reviewer model periodically<\/h3>\n<p>Once every N turns, or whenever the score crosses a bar, send the full transcript (not just the latest turn) to a reviewer model with a system prompt asking &ldquo;is this conversation, as a whole, trying to get me to do something my guidelines prohibit?&rdquo;<\/p>\n<p>This is more expensive than per-turn moderation, but you don&rsquo;t run it on every turn. You run it on sessions that the cheap-turn-moderator has flagged as interesting. It catches the distributed-intent case because you&rsquo;re handing the reviewer the thing it needs to see: the whole arc.<\/p>\n<p>The <a href=\"https:\/\/owasp.org\/www-project-top-10-for-large-language-model-applications\/\" rel=\"nofollow noopener\" target=\"_blank\">OWASP Top 10 for LLM applications<\/a> lists Prompt Injection as the #1 risk, and the multi-turn variant belongs in the same category. If you&rsquo;ve been treating that section as &ldquo;theoretical&rdquo; because your vendor handles it, please do not.<\/p>\n<h3 id=\"3-shorten-sessions-where-its-safe-to\">3. Shorten sessions where it&rsquo;s safe to<\/h3>\n<p>The longer the conversation, the more surface area an attacker has to spread intent across turns. For use cases that don&rsquo;t genuinely need long context (customer support, form filling, FAQ lookup), cap the session length and force a reset. It&rsquo;s a blunt instrument, but it removes the TTI attack surface almost entirely.<\/p>\n<p>This is uncomfortable because we&rsquo;ve built UX patterns around &ldquo;the assistant remembers you.&rdquo; For user-facing brand experiences it&rsquo;s fine. For anything high-trust (financial, medical, policy), short sessions are a genuinely defensible choice and it sidesteps a whole class of attacks.<\/p>\n<h2 id=\"the-part-nobody-likes-to-admit\">The part nobody likes to admit<\/h2>\n<p>The uncomfortable read of TTI is that &ldquo;safety&rdquo; in LLM apps is mostly a marketing surface, not a property of the system. Real safety in LLM deployments looks the same as it does everywhere else in security: defense in depth, assume breach, monitor the abstraction you actually run.<\/p>\n<p>I was going to write this same post six months ago about a related attack and put it off because &ldquo;the providers will patch it.&rdquo; Providers did patch some things. Then the attack surface moved, because attack surfaces always move, and now we&rsquo;re staring at the stateless-moderation gap. The <a href=\"https:\/\/arxiv.org\/abs\/2604.19139\" rel=\"nofollow noopener\" target=\"_blank\">Rise of Verbal Tics paper<\/a> from the same week is orthogonal but makes a similar point from a different angle: evaluating LLMs on isolated prompts misses properties that only emerge at conversation scale.<\/p>\n<p>If your LLM app is in production, the thing to do this week is not rewrite the system prompt. It&rsquo;s audit what your moderation layer actually looks at. Does it see individual turns or the full history? If individual turns, you now have a specific attack class to test against.<\/p>\n<h2 id=\"where-i-land-on-the-bigger-picture\">Where I land on the bigger picture<\/h2>\n<p>I&rsquo;ve stopped pretending LLM safety is a feature of the model and started treating it as a property of the system around the model. Something I wrote about from a different angle in my piece on <a href=\"https:\/\/abrarqasim.com\/blog\/your-reasoning-model-is-more-fragile-than-benchmarks-say\/\" rel=\"noopener\">why reasoning models are more fragile than benchmarks claim<\/a>.<\/p>\n<p>The infrastructure work I take on (some of which is listed on my <a href=\"https:\/\/abrarqasim.com\/about\" rel=\"noopener\">portfolio<\/a>) has increasingly included conversation-level moderation as a line item rather than a nice-to-have. It&rsquo;s a few hundred lines of code, a cheap classifier, and a policy for what to do when the risk score climbs. The payoff: you catch the attacks that slip past vendor moderation, and you&rsquo;ve got a log trail when someone argues &ldquo;our AI didn&rsquo;t do that.&rdquo;<\/p>\n<h2 id=\"one-thing-you-can-do-today\">One thing you can do today<\/h2>\n<p>Open your LLM app&rsquo;s logs. Find a session longer than ten turns. Read it end to end and ask: could I, as an attacker, have strung together something problematic across these turns that wouldn&rsquo;t have been caught by looking at any single turn?<\/p>\n<p>If the answer is &ldquo;no, obviously not,&rdquo; great. Your use case probably isn&rsquo;t a juicy target. If the answer is &ldquo;I don&rsquo;t actually know,&rdquo; you have homework. The research community just handed you a very clear name for the attack class. Use it. Test against it. Ship the fix before someone else discovers it in your logs.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A new attack exploits the fact that most LLM safety layers evaluate turns in isolation. Here&#8217;s what the Transient Turn Injection result means for anyone shipping an LLM app.<\/p>\n","protected":false},"author":2,"featured_media":152,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"rank_math_title":"","rank_math_description":"A new attack exploits the fact that most LLM safety layers evaluate turns in isolation. Here's what the Transient Turn Injection result means for anyone shipping an LLM app.","rank_math_focus_keyword":"llm safety","rank_math_canonical_url":"","rank_math_robots":"","footnotes":""},"categories":[4,151],"tags":[152,155,5,153,154],"class_list":["post-153","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","category-security","tag-ai-safety","tag-jailbreak","tag-llm","tag-prompt-injection","tag-security"],"_links":{"self":[{"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/posts\/153","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/comments?post=153"}],"version-history":[{"count":0,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/posts\/153\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/media\/152"}],"wp:attachment":[{"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/media?parent=153"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/categories?post=153"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/tags?post=153"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}