Attention Is Not Expensive. Remembering Attention Is.

April 2026

A note on KV cache, MQA/GQA, MLA, and why long-context models are becoming memory systems.

When I first learned attention, I thought the expensive part was the attention matrix.

That is not wrong. During training, self-attention really does create a pairwise interaction between tokens. The familiar phrase is O(n^2). It is the first thing everyone learns about Transformers, and it is still a useful warning label.

But I now think that framing is incomplete. During autoregressive inference, a different object quietly becomes the center of the system: the KV cache.

The shift
Training asks: how do we compute attention over a sequence?
Inference asks: what form should the past take so the next token can read from it cheaply?

The object we keep carrying forward

In a decoder-only Transformer, each new token produces a query q_t. That query attends to all previous keys and reads from all previous values:

q_t attends to K_1 ... K_t
and reads from V_1 ... V_t

The new query is temporary. The past keys and values are not. They must be saved for every layer, for every previous token, for every KV head.

K,V token 1 K,V token 2 K,V token 3 K,V token t-1 q_t new query The past is not recomputed from scratch. It is stored as keys and values, then read again and again.
Figure 1. In autoregressive inference, the current query is cheap to create. The expensive part is carrying a growing memory of past K/V states through the whole generation.

A rough size formula is:

KV cache ~= layers * context_length * kv_heads * head_dim * 2 * bytes

The 2 is for keys and values. This is not an implementation footnote. Once the context gets long, this object becomes one of the main economic constraints of serving a model.

KV cache size grows with: layers context heads dim 2 * * * * model depth sequence length KV representation Long context hurts because this memory grows across depth and time.
Figure 2. The cache is a product of architectural choices and workload length. Reducing it can be as important as making attention compute faster.

This changed how I think about attention. The question is not only whether each token can look back. The question is: what exactly are we forcing the model to carry forward as its past?

Why MQA and GQA exist

Standard multi-head attention gives each query head its own key and value head. That is expressive, but expensive at inference time. More KV heads means more cache.

Multi-Query Attention makes many query heads share a single K/V head. Grouped-Query Attention is the less extreme version: groups of query heads share K/V heads.

MHA MQA GQA one KV per Q head all Q heads share KV groups of Q heads share KV Black blocks are query heads. White blocks are stored K/V heads.
Figure 3. MQA and GQA are not cosmetic variants. They are cache reductions that came from serving pressure.

This is easy to describe, but the reason matters. MQA and GQA are not just small architecture variants. They are evidence that inference pressure feeds back into model design. The cache is not a byproduct anymore. It is part of the architecture.

MLA changes the question

MQA and GQA ask: can we store fewer KV heads?

DeepSeek's Multi-Head Latent Attention asks a deeper question: can we change the representation of the cache itself?

In the simplified picture, instead of storing the full key-value representation for every token in the obvious form, MLA stores a compressed latent representation and reconstructs the needed K/V-like objects through projections.

Before: cache K/V directly K,V K,V K,V K,V large per-token representation MLA: cache a latent, then project when needed compressed latent cache K V recovered for attention
Figure 4. MLA is interesting because it treats the cache representation as a design object, not a leftover tensor from attention.

The exact projection scheme matters, but the larger point is simpler: MLA treats the cache not as a leftover tensor from attention, but as a first-class object to be designed.

This is the point where attention starts to look less like a layer and more like a memory system. The model is no longer only learning how to attend. It is also learning what form its remembered past should take.

DeepSeek V4 feels less surprising after this

This is why I think DeepSeek V4 is interesting, but not as an isolated event.

Its model card describes a hybrid attention architecture combining Compressed Sparse Attention and Heavily Compressed Attention. It also claims that, in the 1M-token setting, DeepSeek-V4-Pro uses 27% of the single-token inference FLOPs and 10% of the KV cache compared with DeepSeek-V3.2.

The important idea is not simply "1M context". A million-token window is only useful if the system can afford to keep, read, and route that history. Otherwise it is a benchmark number rather than a working memory.

Long context as memory hierarchy local window: high fidelity, recent tokens compressed memory: cheaper representation of longer history sparse retrieval: select useful blocks The hard question is not how to see more tokens. It is how to decide what form the past should take.
Figure 5. DeepSeek V4 is a useful trigger because it makes the memory hierarchy visible: local, compressed, and sparse access are no longer separate ideas.

Once the KV cache becomes a central bottleneck, a hybrid attention design is much less surprising. Some parts of the past should be local. Some should be compressed. Some should be retrieved sparsely. Some may be stored in a much more compact global representation.

This makes long-context modeling feel closer to memory hierarchy design than to just making the attention matrix bigger.

What changed in my model

I used to think attention was expensive because every token looks at every previous token. I now think that is only half of the story.

In production, the more interesting question is not just how tokens look back, but what form the past is allowed to take.

This is a better way for me to read recent model architecture papers. MQA, GQA, MLA, sparse attention, and hybrid attention are not separate tricks. They are different answers to the same pressure: the model wants more context, but the system cannot afford to remember everything in the most literal form.

Attention is not just getting longer. It is being forced to become more selective about memory.

References

  1. Noam Shazeer, Fast Transformer Decoding: One Write-Head is All You Need
  2. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
  3. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
  4. DeepSeek-V3 Technical Report
  5. DeepSeek-V4-Pro Technical Report and Model Card
  6. vLLM: Efficient Memory Management with PagedAttention