LLM Inference Is Stuck at One Token at a Time. A New Paper Pushes Back.

Confession: I spent an embarrassing amount of last weekend staring at a latency flamegraph for a self-hosted 32B model, trying to figure out where the time was going. The short version is: nowhere clever. It was going exactly where every LLM inference latency goes. Token by token. One autoregressive step at a time. No matter how many GPUs I threw at it, the generation loop was still marching forward like a kid reading aloud in third grade.

I wanted to throw my laptop out the window. Instead I ended up reading a paper that made me feel slightly better about the situation, and slightly worse about the status quo.

Why LLM inference feels slow even when the hardware is fast

If you have ever served an open-weight model in production, you already know this. Your GPU can push hundreds of thousands of tokens per second during prefill. Then decode starts and everything collapses to tens of tokens per second, because every new token has to wait for the previous one to be sampled before it can even be computed. This is the core reason vLLM, TensorRT-LLM, and llama.cpp all spend so much engineering effort on tricks like paged attention and continuous batching. They are not fixing the autoregressive bottleneck. They are working around it.

The usual workarounds are speculative decoding, where a small draft model proposes several tokens and the big model verifies them in one pass, and medusa-style multi-head prediction, where the model is surgically altered to predict a few tokens ahead. Both help. Both also feel like patches bolted onto a pipeline that was never designed to run at interactive latency in the first place.

What nobody has really touched, until recently, is the underlying assumption that a Transformer has to read its own output the way a first grader reads a sentence: one word, pause, one word, pause. Humans do not read like that, once we learn how. We chunk. We preview. We skim. We skip ahead when we already know what is coming.

The weird thing humans do when we read

LLM Inference Is Stuck at One Token at a Time. A New Paper Pushes Back.

Reading research has known for decades that fluent readers do three things at once. We hold a narrow area of sharp focus (the fovea), we grab peripheral information from the upcoming words to plan saccades (parafoveal preview), and we allocate less attention to words we have already predicted internally. That is why you can absolutely fly through a paragraph in a novel but crawl through a legal contract. The predictable text gets skimmed. The dense text gets re-read.

Transformers do none of that. Every layer attends the same way to every token. Every decode step does the same amount of work as every other decode step. There is no preview. There is no skip. There is no structural awareness that some tokens are boring and some are load-bearing. The model spends the same FLOPs on the word “the” as it does on a number that decides the answer.

That is a lot of wasted compute. And it is the thing a paper called Fovea-Block-Skip Transformer, or FBS, tries to fix.

What Fovea-Block-Skip actually changes

FBS adds three small modules to a standard Transformer. None of them are huge. All three are trainable. The trick is that they are designed to play together.

The first is Parafovea-Attention Window, or PAW. Instead of a single fixed attention window, the model gets a narrow sharp-focus window and a wider but cheaper peripheral window. The sharp window looks at the current position the way normal attention does. The peripheral window grabs a cheap summary of what is coming up next. It is, as best as I can describe it, a cost-aware preview mechanism built into the attention op itself.

The second is Chunk-Head, or CH. The model learns to predict chunk boundaries, not just token boundaries. Once a chunk is identified, the model can reallocate compute across the chunk instead of wasting the same amount of work on every token inside it. Think of it as the model learning to say, “this whole phrase is boilerplate, I can cheap out,” or “this number here matters, spend compute.”

The third is Skip-Gate, or SG. This is the part that actually skips. Based on the output of the first two modules, SG decides whether the current token can be advanced with a cheap path, or whether it needs the full-fat transformer stack. It is a learned, content-aware early-exit.

The authors report that FBS improves the quality-efficiency tradeoff across a range of benchmarks without increasing parameter counts, and that the three modules are complementary. That last bit matters, because a lot of efficiency papers show a gain from one trick in isolation and then lose it when you try to stack it with another trick.

What this means for anyone running a model in production

It is tempting to read a paper like this and think, “cool, I will swap this into vLLM next week.” You will not. FBS is an architectural change. You cannot bolt it onto a pretrained LLaMA or Qwen without retraining. For existing production systems, the near-term lesson is not “use FBS.” The near-term lesson is that the autoregressive ceiling is not a law of nature. It is a design choice, and the people who ship the next wave of open-weight models get to make it differently.

If you are a startup choosing between model families for self-hosting, this is another reason to watch the base architectures, not just the benchmark scores. A 32B model with a skim-capable architecture could feel meaningfully snappier to users than a 70B that grinds through every token. Users do not read eval scores. They feel latency.

I wrote a longer comparison of the open-weight model families worth tracking in my earlier post on where the tradeoffs in open-source LLMs actually live, and FBS is a good example of the kind of work that will show up in a future version of that post as soon as someone trains a real model with it.

The other implication is more uncomfortable. If your product latency budget is tight, you should probably stop benchmarking “tokens per second” as a single number, and start benchmarking “tokens per second on the boring parts” versus “tokens per second on the load-bearing parts.” A skim-capable architecture will look fantastic on the first number and merely normal on the second. Averaged together, it beats a uniform model. Sliced apart, it tells you where the user actually waits.

How I would test this if I had a weekend

If I had a clean weekend, three GPUs, and no family obligations, here is what I would do. First, pick a modest open-weight base, like a 7B Qwen or Llama 3. Second, instrument the decode loop to log the per-token time and the per-token entropy. Third, build a tiny toy version of Skip-Gate: a one-layer classifier that, given the last K hidden states, predicts whether the next token will be high-entropy or low-entropy. Fourth, only run the full forward pass when the classifier says high-entropy, and otherwise cheat with a smaller head.

That is not FBS. It is a duct-tape cartoon of FBS. But it would let me measure, end to end, whether skim-capable inference is worth chasing on the models I actually ship. That kind of cheap empirical check is exactly the work I like best, and it is the kind of thing I cover more of on the rest of my site.

I am going to try this experiment this week. If the numbers are boring, I will say so in a follow-up post. If they are interesting, I will write a much longer one. Either way, the autoregressive ceiling deserves more attention than it currently gets, and papers like FBS are a useful nudge in the right direction.