KV Cache Compression and Its Infra Problems

📝 Weian Mao, Yukang Chen, Wei Huang, Shuai Yang, Luozhou Wang, Song Han
📅 June 12, 2026 ⏳ ~10 min read

Consider serving OpenClaw on a local GPU with a capable reasoning model. The agent picks up a simple task — read a handful of documents, write a weekly report — and crashes halfway through. Not because the task is hard, but because the GPU ran out of memory.

This blog examines KV cache compression through an infrastructure lens: not only what the algorithms do, but whether they can actually run in production. An entire field is devoted to compressing the KV cache so that long inference fits in GPU memory; most of it works in the lab, much less in real deployment. The gap comes down to two infrastructure problems that papers almost never discuss. This post surveys the field's main ideas, traces where they collide with these problems, and shows how a geometric property hidden beneath RoPE has recently been used to resolve both.

The hard part of KV cache compression is not choosing which tokens to keep — it is two collisions with production infrastructure. Existing methods need historical attention scores, but the FlashAttention kernel never writes those scores into GPU memory. And paged attention reclaims memory only when a block holds no KV at all — yet after token eviction the survivors leave almost every block partially occupied, so the "freed" memory never actually comes back.
Contents
  1. Background: The KV Cache and Memory Exhaustion in Long Inference
  2. The Existing Methods — and Two Infrastructure Problems
  3. A System Solution to the Two Infrastructure Problems
  4. KV Cache Infra in Video Generation
  5. Closing Thoughts

1. Background: The KV Cache and Memory Exhaustion in Long Inference

When a Transformer generates a token, it computes a query, a key, and a value vector; the new query attends over all previous keys and values to read the model's own history. The KV cache is what makes this affordable: each token's K and V are computed once and saved, and every later step reuses them instead of recomputing — O(n) work per step instead of O(n²). But nothing ever shrinks it: every generated token appends new K and V rows across all layers and heads.

The arithmetic ends badly. Qwen3-32B with 4-bit quantized weights — a configuration people actually deploy — crashes with an out-of-memory error on a 24 GB GPU after roughly 24,000 generated tokens, short of the 32K-token traces reasoning models routinely produce on hard problems (Figure 1).

KV cache growth consuming GPU memory (schematic) A horizontal bar represents GPU memory, schematic and not to scale. A fixed segment of about one third holds model weights plus runtime; the remaining two thirds fills with the growing KV cache until it reaches a dashed out-of-memory line at the bar's right end. Below the bar, an LLM box emits output tokens into a history box aligned directly under the KV cache segment; the history fills with token squares one by one, and with each new token the KV cache segment above extends in sync. GPU memory (schematic — not to scale) Weights + runtime (fixed) KV cache grows with every generated token → OOM LLM generating… output tokens history — every token's K/V stays here Every generated token appends its keys and values to the history — that history is the KV cache. Nothing is freed while generation continues, so long outputs march the bar into the OOM wall.
Figure 1. Schematic, not to scale: every token the model generates appends its keys and values to the running history — the KV cache — so cache memory grows with output length while the weights stay fixed. Let generation run long enough, and the cache alone drives the GPU out of memory.

Compression is possible because attention is sparse. A small fraction of tokens collect the overwhelming majority of attention weight; a token that never gets attended to could, in principle, be deleted from the cache without changing the model's outputs. The question is: which tokens are safe to delete, and when?

2. The Existing Methods — and Two Infrastructure Problems

The field launched in 2023 from one observation: attention is highly non-uniform. Models devote a disproportionate share of attention weight to the very first tokens — "attention sinks," not semantically important but always present (StreamingLLM [2]) — while roughly 20% of tokens collect 80% of the total weight, "heavy hitters" that tend to stay important over time (H2O [3]; Scissorhands [4]). The standard recipe followed (Figure 2): always keep the sinks, always keep a sliding window of recent tokens, and keep a budget of heavy hitters from the middle. On long-document tasks — question answering, summarization, code retrieval — the recipe works well.

StreamingLLM: attention sinks plus a sliding window A row of tokens. The first two are attention sinks and are always kept. A window outline covers the four most recent tokens. As generation proceeds, new tokens appear on the right, the window slides forward, and tokens that fall out of it are evicted into dashed ghost slots. The sinks never move and are never evicted. StreamingLLM — keep the sinks, keep a sliding window, evict the rest ← oldest        token sequence grows during generation        newest → sink KEEP ✓ sink KEEP ✓ sliding window — kept sliding window — kept sliding window — kept sliding window — kept sliding window — kept sliding window — kept sliding window — kept Attention sink — always kept In sliding window — kept Fell out of window — evicted
Figure 2. StreamingLLM keeps only two kinds of tokens: a handful of attention sinks at the very start, and a sliding window of the most recent tokens. As generation proceeds (animated), the window slides forward and everything that falls out of it is evicted — memory stays constant, but information outside the window is permanently lost.

The recipe leaves its key question open: how does a method know which middle tokens are the heavy hitters? The answer that defines the largest category of compression methods is to read the model's own historical attention scores. Every decode step computes attention over the whole history anyway, and those scores are a running record of which cached tokens the model actually uses. H2O [3] is the canonical example: it maintains, for every cached token, a cumulative sum of the attention that token has received across all decode steps so far, and after each step it evicts the lowest-scoring token to hold the cache at a fixed budget (Figure 3).

H2O: cumulative historical attention scoring A query box at the top attends to eight cached tokens. With each decode step, every token's score bar grows by the attention it just received. After several steps, tokens with high cumulative scores are kept; tokens with low scores are evicted into dashed ghost slots. H2O — every decode step adds to each token's cumulative attention score new token step t step t+1 step t+2 sink KEEP ✓ evict KEEP ✓ evict evict KEEP ✓ evict recent KEEP ✓ Cumulative attention score (grows each step) Lowest scores — evicted at each step
Figure 3. H2O's "historical attention" scoring. Every decode step, the newest token attends over the whole cache (flashing arrows), and each cached token's cumulative score grows by the attention it just received. As decoding proceeds, the lowest-scoring tokens are evicted one per step, holding the cache at a fixed budget. These per-step scores are exactly what FlashAttention never writes to memory — Infrastructure Problem 1, below.

SnapKV [6] is the most influential refinement of the same idea. Instead of accumulating scores continuously, it scores once (Figure 4): the most recent W tokens of the prompt form an "observation window," and the attention they pay to the full history decides — at the end of prefill, once and for all — which tokens stay. This removes the per-step bookkeeping and avoids cumulative scoring's bias toward tokens that have simply been around longer. The category has many further variants on the same ingredients: Scissorhands [4] and TOVA [5] swap the decode-time scoring signal; PyramidKV [7] and Ada-KV [8] split the one-shot budget per layer and per head.

But the observation window cannot be made large. Because RoPE — the rotation that encodes each token's position into its query vector — orients every query by its position, only the most recent queries, empirically about 25, reflect where the model is actually attending. So any piece of information that draws no attention during one phase of the reasoning process gets evicted — even if later reasoning depends on it.

Observation-window snapshot scoring The most recent W tokens form an observation window. The attention they pay to history identifies heavy-hitter tokens. High-scoring tokens are kept within budget B; low-scoring tokens are evicted. Observation-Window Snapshot Scoring (SnapKV-style) ← token history (oldest left → newest right) → sink KEEP ✓ evict evict KEEP ✓ evict evict KEEP ✓ evict evict W KEEP ✓ W KEEP ✓ W KEEP ✓ observation window (W recent tokens) attn score
Figure 4. Snapshot-based methods in the SnapKV style assign each cached token an importance score by measuring how much attention it receives from a window of recent queries; tokens outside the top-B budget are evicted.

This entire category — whether it scores continuously like H2O or once like SnapKV — rests on observing attention scores; among the methods above, only StreamingLLM's fixed sink-plus-recency rule needs none. That shared requirement, not the quality of any particular heuristic, collides with production infrastructure in two ways.

Infrastructure Problem 1: FlashAttention does not expose attention scores. Production inference runs on FlashAttention, which tiles the attention computation through SRAM (the GPU's small, fast on-chip memory) and never materializes the N×N score matrix in HBM (the GPU's main memory). That is what makes it fast — and it means the scores eviction methods want to observe are never written anywhere compression code can read them. This hits hardest where the method needs the full attention history. H2O-style cumulative scoring requires the attention scores of every decode step — the running sums cannot be updated without observing each step's scores as they are computed. The reference H2O implementation resolves the conflict by falling back to eager attention, materializing the full score matrix and abandoning FlashAttention outright.
Infrastructure Problem 2: Repeated token eviction cannot actually free GPU memory in production systems. Production serving systems like vLLM manage the KV cache with paged attention: GPU memory is divided into fixed-size physical blocks, each holding KV data for about 16 tokens, and a block can be freed only when it is completely empty. Repeated decode-time eviction scatters survivors across blocks (Figure 5). Evicting 14,400 of 16,000 tokens leaves 1,600 survivors spread over the roughly 1,000 blocks originally allocated; almost every block keeps at least one survivor, so the allocator reclaims almost nothing. R-KV [10] — the strongest prior evictor designed specifically for reasoning models — reports 90% memory savings, but those savings were measured with pre-allocated contiguous tensors, not in vLLM, and its own Appendix D acknowledges that integrating with paged attention "presents a non-trivial challenge and may require further investigation." Quest [9] sidesteps the block-reclamation problem by keeping the full cache and only selecting which pages each query reads, so KV memory still grows with context length.
Paged attention fragmentation after token eviction Before eviction: four paged blocks all full. After evicting most tokens, survivors scatter across all blocks. No block is empty, so none can be freed. Before eviction — 4 blocks, all full Block 1 Block 2 Block 3 Block 4 All 4 slots in each block occupied — 4 blocks × 4 tokens = 16 tokens After evicting most — survivors scattered Block 1 Block 2 Block 3 Block 4 Survivor token Ghost slot (evicted) Every block still has a survivor → zero blocks freed VRAM usage unchanged despite evicting most tokens evict most
Figure 5. After a decode-time eviction pass (H2O, R-KV style), one survivor token lands in each physical block. Since no block is completely empty, the vLLM block allocator cannot reclaim any of them — GPU memory usage is identical before and after eviction.

3. A System Solution to the Two Infrastructure Problems

The scoring function does not need attention scores at all.

The way out starts from a different question. Instead of "which tokens received high attention recently?", ask: "does the geometry of the model's learned representation space predict that a token will be important?" The distinction changes what information a method needs at runtime — and therefore what infrastructure it requires. TriAttention [1] is the method built on this question.

Solution to Problem 1: No Attention Scores Needed — Pre-RoPE Geometry

TriAttention's answer to Problem 1 is to never need attention scores in the first place. It decides which KV entries to keep from the geometry of the model's learned representation space — no attention score is ever observed at runtime, so the records FlashAttention refuses to write are records the method never asks for. Problem 1 is not worked around; it simply does not apply.

The mechanism rests on a stable geometric property of the model's learned Q/K vectors — Figure 6 walks through it step by step.

— playing —
Figure 6. The mechanism, animated step by step on real Q/K data (DeepSeek-R1, layer 10, head 1). The live caption above the progress bar narrates each step as it plays; the full loop runs about 73 seconds.

Solution to Problem 2: Forward-Packing Compaction

A score by itself frees nothing. In paged memory, a block can only be returned to the allocator once every slot in it is dead — eviction scores merely mark tokens, so the survivors must be physically consolidated until whole blocks actually empty out. TriAttention runs this consolidation roughly every 128 decoded tokens, and there are two ways to do it. The order-preserving repack (Figure 7) slides every survivor forward so the cache stays in original token order: no extra position bookkeeping, and the changes to an inference engine like vLLM stay minimal. The hole-filling variant (Figure 8) instead drops the newest survivors straight into the slots vacated by eviction: far less data movement per compaction, at the cost of a scrambled physical order that must then be tracked explicitly. Both variants live in the TriAttention codebase (the hole-filling one since v0.1.0) and remain maintained. The two figures below play the same round of decoding through each strategy.

Order-preserving repack: survivors slide forward, one whole block is freed Six memory blocks of four slots each. B1 to B5 hold the past context, tokens T0 through T19, every slot full; B6 receives the current round's newly generated tokens T20 to T23, standing in for about 128 decoded tokens. Scoring evicts T2, T9 and T13 from the history and T21 from the new round, leaving dashed empty frames. Every surviving token then slides forward to close the holes while keeping its original order, so B1 to B5 end up holding T0, T1, T3, T4, then T5 to T8, then T10, T11, T12, T14, then T15 to T18, then T19, T20, T22, T23 — and B6 ends up completely empty and is returned to the allocator as a freed block. Order-preserving repack — survivors slide forward, the free space drifts to the tail past context — every slot occupied current round T2 T9 T13 T21 T0 T1 T3 T4 T5 T6 T7 T8 T10 T11 T12 T14 T15 T16 T17 T18 T19 T20 T22 T23 freed ✓ B1 B2 B3 B4 B5 B6 history token (kept) new-round token evicted slot (ghost) freed block decoding — a new round (~128 tokens) fills B6 scoring — deciding which tokens to keep: 4 lose their slots repacking — every survivor slides forward in order; the holes drift to the tail B6 is empty — one whole block returned to the allocator
Figure 7. Order-preserving repack, played on one round of decoding (B6's four slots stand in for ~128 new tokens). After scoring, every surviving entry slides forward, block by block, so the cache stays dense and in original token order — follow the IDs — while the holes drift to the tail until B6 holds nothing at all. The move is a single gather–clone–scatter pass: every survivor is read before any is written, so overlapping source and destination ranges cannot corrupt each other. The emptied block goes straight back to the paged allocator.
Hole-filling: survivors stay put, the new round drops into the holes Same six blocks and the same evictions as the previous figure: B1 to B5 hold tokens T0 through T19 with every slot full, B6 receives the current round's new tokens T20 to T23, and scoring evicts T2, T9, T13 and T21. This time the history survivors never move: the new round's three survivors T20, T22 and T23 drop directly into the three holes in B1, B3 and B4, so B1 ends up reading T0, T1, T20, T3 — dense but out of original order. B6 again ends up empty and is returned to the allocator as a freed block. Hole-filling — survivors stay put, the new round drops into the holes past context — every slot occupied current round T2 T9 T13 T21 T0 T1 T3 T4 T5 T6 T7 T8 T10 T11 T12 T14 T15 T16 T17 T18 T19 T20 T22 T23 freed ✓ B1 B2 B3 B4 B5 B6 history token (never moves) new-round token evicted slot (ghost) freed block decoding — a new round (~128 tokens) fills B6 scoring — deciding which tokens to keep: 4 lose their slots hole-filling — only B6's survivors move, straight into the holes B6 is empty — same block freed with 3 copies instead of 18, order scrambled
Figure 8. Hole-filling on the exact same input as Figure 7. The history survivors in B1–B5 never move; only the new round's three survivors drop into the three holes that scoring opened. The same whole block is reclaimed for a fraction of the data movement — 3 copies instead of 18 — but the physical order comes out scrambled (T20 now sits between T1 and T3), so each entry's logical position must be tracked separately.

Results

The table below shows accuracy on competition math, where KV cache compression matters most. For AIME 2024 and AIME 2025 the KV budget is 2,048 tokens — one-sixteenth of what a full 32K-token trace would otherwise cache; for MATH500 it is 512. All methods run at the same budget.

Method AIME 2024 AIME 2025 MATH 500
Full Attention (no compression) 57.1% 40.8% 69.6%
SnapKV [6] 34.6% 20.0% 49.2%
R-KV [10] 25.4% 17.5% 46.4%
TriAttention [1] 42.1% 32.9% 56.0%

The accuracy gap is large. At budget 2,048, TriAttention nearly doubles R-KV's AIME 2025 accuracy (32.9% vs. 17.5%). At a slightly larger budget of 3,072 tokens, it matches the full-attention baseline (40.8%) while delivering 2.5× higher throughput (563 vs. 223 tokens/second) and 10.7× KV memory reduction. Code is available at github.com/WeianMao/triattention.

Accuracy-matched comparison on AIME 2025: TriAttention vs. full attention Two bar-chart panels comparing full attention and TriAttention on AIME 2025 with Qwen3-8B at identical 40.8 percent accuracy. Left panel, decoding throughput in tokens per second: full attention reaches 222.8, TriAttention reaches 563.5, annotated as 2.5 times faster. Right panel, relative KV cache memory: full attention is 100 percent, TriAttention is 9.3 percent, annotated as 10.7 times smaller. Full Attention TriAttention Decoding throughput (tokens/s) 222.8 563.5 2.5× faster KV cache memory (relative) 100% 9.3% 10.7× smaller
Figure 9. Accuracy-matched comparison on AIME 2025 (Qwen3-8B, 32K-token generation): at a KV budget of 3,072 tokens, TriAttention matches full attention’s 40.8% accuracy while decoding 2.5× faster (563.5 vs. 222.8 tokens/s) and holding 10.7× less KV cache memory.

4. KV Cache Infra in Video Generation

The memory pressure of long inference reappears in video generation — at larger scale. A video model's tokens are spatial patches rather than words, and an autoregressive generator caches KV entries for every frame produced so far, exactly the way an LLM caches past tokens; within seconds of 480p video, the cache outgrows the model’s own weights [11]. The video community has pursued the same two tracks as text — quantization and token eviction — and it has learned the lesson this blog keeps returning to: the algorithmic idea is half the work, and the demons live in the infrastructure details. Each of the three systems below earns its gains there.

On the quantization track, Quant VideoGen [11] pushes the cache down to 2 bits per element. The enabler is a property specific to video: adjacent frames are nearly identical, so the cache is full of near-duplicate tokens; grouping those tokens and storing each one’s small residual from its group average instead of full-precision K and V tensors makes the data quantization-friendly, and progressively refining and quantizing the residuals yields up to 7× memory compression — no retraining, and less than 4% latency overhead. LongLive 2.0 [13] takes the same track to production scale, quantizing both weights and KV cache to NVFP4 (4-bit) on a 5B-parameter generator for 1.84× throughput at negligible quality cost — and its detail worth dwelling on is the fused parallel dequantization kernel (Figure 10). Storing the cache in 4 bits means every block must be dequantized before each attention step, and the naive approach launches one GPU kernel per block. Dequantization, however, is embarrassingly parallel in principle — every 4-bit value can be decoded independently of all the others — and the per-block loop hides that parallelism behind a queue of small sequential launches; the fused kernel instead issues one launch whose threads map one-to-one onto every packed pair of FP4 values across the whole cache window, presenting the entire workload to the GPU at once. In practice, this keeps the overall quantization/dequantization overhead below 2%.

LongLive 2.0 pipeline: an NVFP4 KV cache window of chunks is dequantized in parallel on the model GPU, while latent chunks are asynchronously VAE-decoded to video on a second GPU.
Figure 10. LongLive 2.0’s two-GPU inference pipeline. The model on GPU 0 attends over a sliding window of NVFP4-quantized KV cache chunks — cutting cache memory by roughly 10 GB — with a fused parallel dequantization kernel expanding the quantized window before attention; finished latent chunks stream to GPU 1, where the VAE decodes them to video asynchronously, overlapping decoding with generation for the annotated 1.5× speedup.

On the eviction track, §3's system solution carries over directly: TriAttention's trigonometric scoring [1] has been applied to LongLive [12], a real-time video generator built on Wan2.1-T2V-1.3B. Each KV entry corresponds to spatial patches from one frame, and the scoring decides which frames to evict — no attention scores observed, nothing for the serving stack to fight — cutting the cache by 50% with negligible quality drop; the integration ships in the TriAttention repository.

Both domains are converging on the same insight: good compression is not primarily about finding the right scoring heuristic. It is about understanding why a signal has the structure it has — and exploiting that structure at the kernel level.

5. Closing Thoughts

From 2023 to 2025 the field's framing was: find the right heuristic for which tokens matter — heavy hitters, sinks plus recency, observation windows, redundancy. All reasonable hypotheses. But these methods run into two walls. Methods that select tokens by observed attention hit the first: FlashAttention never exposes the scores they need. Methods that evict repeatedly during decode hit the second: eviction frees no GPU memory when the allocator works in whole blocks and the survivors stay scattered. A method that cannot clear the second wall saves compute but not memory — the scarcer resource during long inference.

TriAttention clears both walls by changing the question: score tokens from the model's stable pre-RoPE geometry rather than from observed attention, and physically consolidate survivors into a dense prefix so that tail blocks are actually freed. The result is a method that reduces the physical memory footprint of the KV cache in production paged-attention deployment — and, because pre-RoPE geometry predicts a token's importance at any distance rather than through a ~25-query observation window, sustains accuracy at compression ratios where observation-window methods collapse.

One open question remains. TriAttention uses a uniform KV budget across heads, yet the per-head distance-preference curves it computes already classify heads by behavior — local, sink, range-specific. The natural next step is to allocate budget in proportion to a head's complexity. Whether the pre-RoPE insight generalizes further — to model interpretability, to dynamic sparse attention, to other modalities — is an open problem worth exploring.

References

  1. TriAttention: Efficient Long Reasoning with Trigonometric KV Compression. Mao W., Lin X., Huang W., Xie Y., Fu T., Zhuang B., Han S., Chen Y. ICML, 2026. github.com/WeianMao/triattention
  2. Efficient Streaming Language Models with Attention Sinks. Xiao G., Tian Y., Chen B., Han S., Lewis M. ICLR, 2024. arXiv:2309.17453
  3. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. Zhang Z., Sheng Y., Zhou T., Chen T., Zheng L., Cai R., Song Z., Tian Y., Ré C., Barrett C., Wang Z., Chen B. NeurIPS, 2023. arXiv:2306.14048
  4. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time. Liu Z., Desai A., Liao F., Wang W., Xie V., Xu Z., Kyrillidis A., Shrivastava A. NeurIPS, 2023. arXiv:2305.17118
  5. Transformers are Multi-State RNNs. Oren M., Hassid M., Yarden N., Adi Y., Schwartz R. EMNLP, 2024. arXiv:2401.06104 (TOVA)
  6. SnapKV: LLM Knows What You Are Looking for Before Generation. Li Y., Huang Y., Yang B., Venkitesh B., Locatelli A., Ye H., Cai T., Lewis P., Chen D. NeurIPS, 2024. arXiv:2404.14469
  7. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling. Cai Z., Zhang Y., Gao B., Liu Y., Li Y., Liu T., Lu K., Xiong W., Dong Y., Hu J., Xiao W. arXiv preprint, 2024. arXiv:2406.02069
  8. Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference. Feng Y., Lv J., Cao Y., Xie X., Zhou S.K. NeurIPS, 2025. arXiv:2407.11550
  9. Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference. Tang J., Zhao Y., Zhu K., Xiao G., Kasikci B., Han S. ICML, 2024. arXiv:2406.10774
  10. R-KV: Redundancy-aware KV Cache Compression for Reasoning Models. Cai Z., Xiao W., Sun H., Luo C., Zhang Y., et al. NeurIPS, 2025. arXiv:2505.24133
  11. Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization. Xi H., Yang S., Zhao Y., Li M., Cai H., et al., Han S., Keutzer K. ICML, 2026. arXiv:2602.02958
  12. LongLive: Real-time Interactive Long Video Generation. Yang S., Huang W., Chu R., Xiao Y., Zhao Y., et al., Han S., Chen Y. ICLR, 2026. arXiv:2509.22622
  13. LongLive 2.0: An NVFP4 Parallel Infrastructure for Long Video Generation. Chen Y., Wang L., Huang W., Yang S., Zhang B., Xiao Y., Chu R., Mao W., Hu Q., Liu S., Zhao Y., Mao H., Chen Y.-C., Xie E., Qi X., Han S. arXiv preprint, 2026. arXiv:2605.18739