Attention Is Not Expensive. Remembering Attention Is.
A note on KV cache, MQA/GQA, MLA, and why long-context models are becoming memory systems.
When I first learned attention, I thought the expensive part was the attention matrix.
That is not wrong. During training, self-attention really does create a
pairwise interaction between tokens. The familiar phrase is
O(n^2). It is the first thing everyone learns about
Transformers, and it is still a useful warning label.
But I now think that framing is incomplete. During autoregressive inference, a different object quietly becomes the center of the system: the KV cache.
Inference asks: what form should the past take so the next token can read from it cheaply?
The object we keep carrying forward
In a decoder-only Transformer, each new token produces a query
q_t. That query attends to all previous keys and reads from
all previous values:
q_t attends to K_1 ... K_t
and reads from V_1 ... V_t
The new query is temporary. The past keys and values are not. They must be saved for every layer, for every previous token, for every KV head.
A rough size formula is:
KV cache ~= layers * context_length * kv_heads * head_dim * 2 * bytes
The 2 is for keys and values. This is not an implementation
footnote. Once the context gets long, this object becomes one of the main
economic constraints of serving a model.
This changed how I think about attention. The question is not only whether each token can look back. The question is: what exactly are we forcing the model to carry forward as its past?
Why MQA and GQA exist
Standard multi-head attention gives each query head its own key and value head. That is expressive, but expensive at inference time. More KV heads means more cache.
Multi-Query Attention makes many query heads share a single K/V head. Grouped-Query Attention is the less extreme version: groups of query heads share K/V heads.
This is easy to describe, but the reason matters. MQA and GQA are not just small architecture variants. They are evidence that inference pressure feeds back into model design. The cache is not a byproduct anymore. It is part of the architecture.
MLA changes the question
MQA and GQA ask: can we store fewer KV heads?
DeepSeek's Multi-Head Latent Attention asks a deeper question: can we change the representation of the cache itself?
In the simplified picture, instead of storing the full key-value representation for every token in the obvious form, MLA stores a compressed latent representation and reconstructs the needed K/V-like objects through projections.
The exact projection scheme matters, but the larger point is simpler: MLA treats the cache not as a leftover tensor from attention, but as a first-class object to be designed.
This is the point where attention starts to look less like a layer and more like a memory system. The model is no longer only learning how to attend. It is also learning what form its remembered past should take.
DeepSeek V4 feels less surprising after this
This is why I think DeepSeek V4 is interesting, but not as an isolated event.
Its model card describes a hybrid attention architecture combining Compressed Sparse Attention and Heavily Compressed Attention. It also claims that, in the 1M-token setting, DeepSeek-V4-Pro uses 27% of the single-token inference FLOPs and 10% of the KV cache compared with DeepSeek-V3.2.
The important idea is not simply "1M context". A million-token window is only useful if the system can afford to keep, read, and route that history. Otherwise it is a benchmark number rather than a working memory.
Once the KV cache becomes a central bottleneck, a hybrid attention design is much less surprising. Some parts of the past should be local. Some should be compressed. Some should be retrieved sparsely. Some may be stored in a much more compact global representation.
This makes long-context modeling feel closer to memory hierarchy design than to just making the attention matrix bigger.
What changed in my model
I used to think attention was expensive because every token looks at every previous token. I now think that is only half of the story.
In production, the more interesting question is not just how tokens look back, but what form the past is allowed to take.
This is a better way for me to read recent model architecture papers. MQA, GQA, MLA, sparse attention, and hybrid attention are not separate tricks. They are different answers to the same pressure: the model wants more context, but the system cannot afford to remember everything in the most literal form.
Attention is not just getting longer. It is being forced to become more selective about memory.
References
- Noam Shazeer, Fast Transformer Decoding: One Write-Head is All You Need
- GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
- DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
- DeepSeek-V3 Technical Report
- DeepSeek-V4-Pro Technical Report and Model Card
- vLLM: Efficient Memory Management with PagedAttention