context
Long context isn't a capability. It's a standing charge.
Contents
One active user on Llama 3 70B at 128K context takes 40GB of GPU memory. That’s half an H100. Per token, the cache footprint runs 320KB in BF16. Every additional user you put on that GPU costs you another half-H100-worth of memory before a single forward pass has run.
This is part two of a three-article series on context. The first covered why the advertised window is bigger than the work-window. This one covers what it costs you per active user to keep that window open. The third covers what the harness around the model looks like in 2026 once you’ve accepted the cost.
The numbers the slides don’t show
The headline context window is sold in tokens. The thing that actually costs money is bytes of GPU memory. The two are different, and the conversion factor varies by roughly 40x across models shipping today.
Put plainly: a 1M context window doesn’t cost the same on every model, and the spread is the story.
Take four production models at the same advertised window.
Llama 3 70B at 128K context costs 40GB per active user, half an H100. Per token, the key-value cache runs 320KB in BF16. The formula is the same one every transformer paper uses: 2 times the number of KV heads, times the head dimension, times the number of layers, times 2 bytes per element.
DeepSeek V3.2 at 1M context costs 83.9GiB per active user. Per token, the cache runs about 87,000 bytes. That’s Multi-Latent Attention doing some work, but it’s still nearly 22GB per million tokens, and the cache scales linearly with the window.
DeepSeek V4-Pro at 1M context costs 9.62GiB per active user. Per token, the cache runs about 9,800 bytes. CSA plus HCA plus SWA get the footprint down to roughly 10 percent of V3.2’s. Same vendor, same family, roughly one tenth the memory per token.
Qwen 3.6-35B-A3B at 128K context costs 2.5GB per active user. Per token, the cache runs about 20,480 bytes. Hybrid Gated DeltaNet with only 10 attention layers. The KV cache is mostly gone, replaced by a different mechanism that doesn’t pay the per-token tax.
The same advertised 128K window costs anywhere from 2.5GB to 40GB per active user, depending on which model you picked. Same input, same vendor marketing slide, 16x spread. The architecture choice is the cost.
Why the cache scales the way it does
The KV cache is the model’s working memory for everything you’ve already sent. Every token the model reads stays in cache so it can be attended to on every subsequent decode step. At 1K tokens, the cache is small enough to ignore. At 100K, it’s a budget item. At 1M, it’s the binding constraint.
The binding constraint shifts during inference. During the prefill phase, when the model reads your input, it’s compute-bound. The H100 is busy doing matrix multiplies and the cache is being written. During the decode phase, when the model is generating tokens one at a time, the bottleneck moves to memory bandwidth. The model has to read the entire cache to attend to every token it generates, and the cache is now 40GB or 80GB or whatever the model has accumulated.
The numbers from published studies put the cache read at between 68 percent and 94 percent of decode latency at 500K context. The GPU is mostly waiting on its own memory, not on the model. The bottleneck isn’t the model thinking. The bottleneck is the model looking at everything you sent it, on every token, no matter how far back the relevant material sits.
This is the thing that makes long context expensive even when the input tokens themselves are cheap. The cost is in the cache reads, and the cache reads scale linearly with the window. Double the window, double the per-decode-step memory traffic. Quadruple the window, quadruple it. There’s no engineering reason for the model to read tokens from the start of your prompt when it’s generating the last token, but that’s how attention works. It reads everything, every time.
What the per-call cost actually looks like
The published per-call numbers, mid 2026, are all over the place.
DeepSeek V4 Pro at 256K context runs about $0.445 per call. GPT-5.5 at the same context runs about $1.28 per call. GPT-5.4 Pro at the same context runs about $7.68 per call. That’s the same window, same input length, three vendors, a 17x spread on the dollar number.
The headline figure for this article sits in that spread: 200K tokens on Opus at $1.00 per call. The same task on a 4K RAG pipeline runs about $0.06. 17x is the per-call ratio between reading the full window and retrieving only what the answer needs.
The architectural reasons for the spread sit in the next section. The cost figure is the symptom. The architecture is the cause.
What the architecture choices actually do
The five-orders-of-magnitude spread in cache footprint comes from four architectural moves, and they’re worth knowing about up front because they explain why some 1M windows cost 17x less than others.
Multi-Head Attention, the original, has the worst case. One KV head per query head. Eight heads, eight KV tensors per layer, 320KB per token on Llama 3 70B. The Shazeer 2019 paper that introduced MQA cut that by sharing a single KV head across all query heads. GQA, the Ainslie 2023 paper (arXiv 2305.13245), generalised the trick to a small fixed number of shared heads.
MLA, the DeepSeek-V2 paper (arXiv 2405.04434), compressed the keys and values into a single low-rank latent vector. The paper reports a 93.3 percent KV reduction relative to standard MHA. The mechanism is what’s behind DeepSeek V3.2’s 87,000 bytes per token at 1M. Without MLA’s 93.3 percent compression (inverting to roughly 15x the cache size), that number would be around 1.3MB per token.
Sliding Window Attention, from Mistral 7B (arXiv 2310.06825), capped the cache by only attending to a window of W equals 4096 tokens. The model can’t see anything older than the window, so the cache stops growing. The trade-off is the model also can’t use anything older than the window. W is the upper bound on effective context for the model, no matter how big the advertised window is.
Hybrid architectures, the Qwen 3.6 family being the cleanest example, replace most of the attention layers with a different mechanism. DeltaNet and similar state-space variants don’t keep a per-token KV cache at all. They keep running state. The cache size stops depending on token count and depends on the model’s hidden dimensions instead. That’s the move that gets Qwen’s per-token cost down to 20,480 bytes.
A sixth move worth knowing about is native sparse attention at decode time. NSA, the DeepSeek paper from early 2025 (arXiv 2502.11089), does coarse-grained token compression plus fine-grained token selection plus sliding windows, all natively trainable, all hardware-aligned. The reported numbers match full attention quality at a fraction of the cache reads. MoBA, the Moonshot paper (arXiv 2502.13189), takes a different route: block-level MoE routing with parameterless top-k gating, production-deployed in Kimi’s long-context requests. Neither of these is a free lunch, but both are evidence that the architectural menu is widening, not narrowing, in 2026.
Five moves, five different cost profiles. Picking a model is picking an architecture. Picking an architecture is picking a cost.
Where the cache economics get interesting
The cache cost isn’t just a function of the model. It’s a function of the cache hit rate, the time the cache sits between reads, and what’s backing the storage.
Anthropic’s prompt caching breaks even at 1.4 reads per write on the 5-minute tier. Below 1.4, the cache costs more than it saves. Above it, every additional read is free. The 1-hour tier needs a 67 percent cache hit rate before it pays for itself. Most production agents don’t hit 67 percent on the 1-hour tier. They do on the 5-minute tier. Pick the tier the workload actually supports.
Tutti’s SSD-backed KV cache is roughly 100x cheaper per GB than DRAM-backed cache. The published numbers from their evaluation are 78.3 percent time-to-first-token reduction and 27 percent cost reduction versus LMCache-SSD. The trade-off is latency. SSD is slower than HBM. For prefill-heavy workloads, that trade-off makes sense. For decode-heavy workloads at 500K context, it doesn’t. Tutti is a fit for the right shape of workload, not a universal win.
The 90 percent compression cliff is the binding constraint on all of this. Compression ratios below 90 percent start killing retrieval accuracy. TurboQuant’s tbq3_0 in llama.cpp hits 81 percent KV memory reduction at FP16-equivalent perplexity. That’s below the 90 percent cliff and it works because they also preserve the attention pattern, not just the cache size. The lesson: KV cache compression is a mechanism design problem, not a number-tweaking problem. Below 90 percent, you’re trading accuracy for memory, and the trade gets worse the further below 90 percent you push.
None of this is free further up the stack either. Memory systems that build their own summarisation tiers on top of a model’s context window report paying 20 or more LLM calls in a single interaction just to keep those tiers current. The cache moves cut the GPU bill. The orchestration layer sitting above it can still spend the saving back.
What the kernels actually look like
The architecture choice only matters if the kernels can run it. The 2026 state of the art for KV cache attention has three names worth knowing, and the difference between them is the difference between a research paper and a production system.
FlashAttention 4, from the Dao-AILab group, is the current reference. It’s written in CuTeDSL, targets Hopper and Blackwell specifically, and uses async warp-specialised kernels to hit 1.3x over FlashAttention 3 on the H100. The big change is memory: O(n) versus O(n squared) for the standard attention implementation. Install is one line, pip install flash-attn-4. FlashAttention 2.5 added paged KV cache support that’s compatible with vLLM’s PagedAttention. FlashAttention 3 hits 660 TFLOPS on the H800. The progression is the same pattern the rest of the field follows: each generation buys a real speedup, and the speedup is paid for by more careful memory access, not by smarter maths.
FlashMLA, from DeepSeek, is the production reference for MLA-style attention at decode time. The KV cache format is FP8: 656 bytes per token, made up of 512 bytes of quantised NoPE in float8_e4m3, 16 bytes of scale factors, and 128 bytes of RoPE in BF16. The sparse indices are page-aware. The decode loop is get_mla_metadata once, flash_mla_with_kvcache per layer. That’s the shape of the code that lets DeepSeek V3.2 ship 1M context at a cost most MHA models can’t match at 128K.
vLLM prefix caching is the production reference for cache reuse across requests. The mechanism is block-level hashing: a hash covers parent hash, block tokens, and extra hashes for LoRA IDs, multimodal input hashes, and cache salts. The block size is fixed and small enough to keep collision risk manageable. xxhash is faster than sha256 but warns about collision risk in multi-tenant environments. There’s an LRU free queue. None of it is glamorous, all of it is load-bearing.
The three names together explain why long context is a viable product category at all in 2026. Without the kernels, the architectural moves are theoretical. With them, the cost ratios land where the architecture said they would.
What this means if you’re shipping
The architectural argument is settled. RAG is dominant past 500K tokens, and it’s dominant because the cost ratio is dominant. 17x isn’t a quality preference, it’s a budget line. Past 500K, RAG with a reranker is cheaper than long context by enough that the long-context choice only makes sense if RAG can’t get you the answer.
The corollary is less obvious. The 17x number compounds. If you’re running 1000 active users on a 200K window on Opus, you’re spending $1,000 per turn of your entire user base. The same 1000 users on a 4K RAG pipeline costs $60 per turn. The 17x is per turn, not per session. Across a session of 20 turns, the same answer costs you $20 on long context versus $1.20 on RAG. The annualised number on a million-turn-per-day production system is the difference between a line item and a balance-sheet event.
The 4K figure on the RAG side isn’t arbitrary. It’s what a two-step retrieval interface gets you. Ten full matches at 1,500 tokens each is 15,000 tokens injected into context whether the agent uses them or not. Identifiers and short previews for the same ten matches run about 450 tokens, and only the records the agent actually needs get fetched in full after that. Across 20 recall steps in a session, the difference compounds to around 200,000 tokens saved. That’s not a model choice or an architecture choice. It’s a retrieval-interface decision, and it costs nothing to adopt.
The decision isn’t “use long context” or “use RAG”. It’s “use long context for the queries where the answer can’t be retrieved, and use RAG for the queries where it can”. Most production systems are somewhere around 80 percent RAG-friendly, 20 percent genuinely long-context. The 80 percent should not be paying for the 20 percent’s window.
The shame’s in not knowing which trade-off you’re making. The 17x is real. The architecture moves are real. The compression cliff is real. The decision is which slice of your traffic pays the long-context tax, and whether that slice is the one that genuinely needs it.
The third article in this series covers what the harness around the model looks like when the cost is this asymmetric and the model isn’t going to get cheaper. The short version: the work is moving into the prompt, into the tools, into the subagent structure, and into the cache policy. The model is the constant. Everything around it is what’s changing.
The unit of engineering isn’t the model. It’s the harness. That’s where the rest of the series goes from here.
Tagged
Share & discuss
The X Article covers the same ground in a different form. The site version is the canonical one; the X version exists for the conversation in the replies.