context

You don't have a context problem. You have a harness problem.

08 July 2026 · ~12 min read · by Steven Batchelor-Manning

Contents

What context engineering is
How big the effect actually is
What the field is shipping
What still doesn't work
What this looks like at the source-code level
Where this leaves you

You don't have a context problem. You have a harness problem. - hero image.

The harness effect on real workloads runs 5 to 40 percentage points. The model effect between frontier models on the same benchmark runs 5 to 15 percentage points. The scaffold matters more than what’s inside it.

The cleanest single data point on this: same model, different harness, sixteen points apart. Cursor on Claude Opus 4.6 scores 93. Claude Code on Claude Opus 4.6 scores 77. The harness, the system prompt, the tool surface, the way the context is sliced before it reaches the model, all of that is in that sixteen points. None of it is fixed by buying a bigger window.

This is part three of a three-article series on context. The first covered why the advertised window is bigger than the work-window. The second covered what it costs per active user to keep that window open. This one covers what the field has settled on as the engineering answer to both problems: context engineering, the discipline of shaping what reaches the model so the model can do the work.

What context engineering is

The term has a vendor source. Anthropic’s late-2025 article on context engineering for agents put it into mainstream engineering discourse, and their compact_20260112 API made it operational. The mechanism is simple enough to fit on a slide: the conversation approaches a configurable threshold (default 150K), the API summarises older context, the summarised version replaces the original. Claude Code auto-compacts at 50 percent context usage. The same Workaccount2 HN workaround from June 2025, the one where users manually pasted their own summaries when the conversation got long, is now a product feature.

Put plainly: the harness is now a first-class engineering surface, and the work most teams used to do in the model is moving into the prompt, the tools, the subagent structure, and the cache policy.

Three things follow from that shift. The first is that the field has a vocabulary for the discipline, and the vocabulary is “context engineering” rather than “prompt engineering” or “RAG”. The second is that the harness is where most of the engineering effort is going, not into the model. The third is that the discipline has a measurable effect, and the measurement is bigger than the inter-model gap on most tasks.

How big the effect actually is

The harness effect, measured across published benchmarks, runs 5 to 40 percentage points from scaffold alone. The model effect between frontier models on the same benchmark runs 5 to 15 percentage points. Both ranges are real. Both come from the same benchmark families. Neither is a guess.

The Cursor 93 versus Claude Code 77 result from Article 1 is the cleanest single data point. Same model, same tasks, sixteen points apart. The harness, the system prompt, the tool surface, the way the context is sliced before it reaches the model, all of that is in that sixteen points. Anthropic reports similar patterns internally on their own compact-enabled tooling: same model, different scaffold, ten-point swings are routine, twenty-point swings are achievable, forty-point swings require deliberate redesign.

LOCA-bench (arXiv 2602.07962, ICML 2026) reports a 10 percentage point improvement from context engineering alone, which lands inside the harness-effect range and above most inter-model gaps on the same benchmark. The benchmark isolates the eight mitigation strategies this article covers and measures each in isolation. Compaction plus tool-result clearing plus memory tools produces the largest combined gains among the tested strategies, with programmatic tool calling as the strongest single-strategy finding. The pattern LOCA-bench reveals is that the harness isn’t a wrapper around the model. The harness is the system, and the model is one component of it.

The practitioner consensus is that harness quality matters as much as the model you put in it. Cursor employs people whose full-time job is to rewrite system prompts and tool descriptions every time a new model ships. That’s the operation in 2026. The system prompt and tool surface are not fixed code. They’re living engineering artefacts that get re-tuned for every model release. The same dynamic shows up across the agent framework ecosystem: Claude Code, Cursor, Codex CLI, Windsurf, Cline. The teams shipping production agents treat the harness as a versioned, tested, deployed artefact. The model is the dependency.

The implication is uncomfortable. Most of what gets attributed to the model is the model plus the harness plus the system prompt plus the way the developer chose to fill the context. None of those is fixed by buying a bigger window. All of them are within the developer’s control.

Harness effect versus model effect: two parallel vertical ranges, the wider one on the left, the narrower one on the right.

What the field is shipping

Eight strategies have converged across the production systems in 2026. They are not mutually exclusive. Most systems ship three or four of them at once.

Context compaction sits at the top of the list. Anthropic’s compact_20260112 is the canonical reference, but the pattern is now standard across Claude Code, OpenAI’s ChatGPT auto-summarise, Cursor’s long-session mode, and most agent frameworks that run for more than a few hundred turns. The discipline is to summarise earlier context before it crowds out later context, and to do it at a configurable threshold.

Compaction isn’t free at the point it runs. Memory-tiering systems that lean on summarisation as their main budget mechanism report paying 20 or more LLM calls in a single interaction to keep their summaries current. The harness that adopts compaction has to budget for the compaction, not just for the context it saves.

Structured prompt templates, the second strategy, lock the context shape. The system prompt is structured. The tool descriptions are structured. The user messages get a wrapper. The variability of natural-language prompting gets bounded by templates that the system itself can parse. LOCA-bench (arXiv 2602.07962, ICML 2026) treats structured prompting as a separate experimental condition because the effect is large enough to be worth isolating.

Just-in-time retrieval, the third strategy, is RAG. RAG with a reranker. RAG past 500K tokens is mathematically dominant, by enough that the long-context choice only makes sense for the slice of traffic RAG can’t service. The strategy isn’t new. What’s new is treating it as one strategy among several rather than as a competitor to long context. They’re complementary, and the 17x cost ratio from Article 2 is the reason RAG handles the 80 percent while long context handles the 20 percent.

The cheapest version of this strategy doesn’t need a reranker at all. Returning identifiers and short previews instead of full records, then fetching only what the agent asks for, turns 15,000 tokens of unread matches into roughly 450 tokens of signal, with the full record paid for only when it’s actually used. Across a long session that difference is the gap between compacting every few turns and not needing to compact at all.

Tool-result clearing, the fourth strategy, is the discipline of removing tool outputs from context after the model has processed them. Claude Code implements this for shell and file-read tool outputs. LOCA-bench’s context-reset strategy tests it explicitly. The mechanism is simple, the effect is noticeable: tool outputs are often kilobytes of structured data the model needed once but doesn’t need to keep in the active context.

Programmatic tool calling, the fifth strategy, is the strongest finding in the LOCA-bench paper. Writing code that orchestrates tool calls, rather than calling tools directly, substantially reduces the intermediate context the model has to carry, and it improves orchestration accuracy at the same time. The model writes a Python script, runs the script, reads the script’s output. The tool calls don’t appear in the active context. The intermediate state stays in the script. The reduction in context pressure is the largest single lever LOCA-bench measured.

Subagents with context firewalls, the sixth strategy, are how Claude Code’s Agent Teams handle multi-step work. Each subagent operates in its own context window with its own worktree. The parent agent sees the subagent’s summary, not its full trace. If one context window rots, you use many small ones instead. The pattern is operationally similar to processes with isolated address spaces. The benefit is that context pressure is per-subagent, not per-conversation.

Summarisation mid-conversation, the seventh strategy, sits between compaction and tool-result clearing. The discipline is to summarise a tool’s output or a sub-task’s result before it goes back into the parent context. It’s compaction applied at finer granularity, and it’s most useful in agent loops where each step’s output would otherwise compound.

Attention steering and calibration, the eighth strategy, is the most architectural. Differential Transformer, the Microsoft and Tsinghua paper (arXiv 2410.05258), reframes attention as the difference of two softmax maps, which cancels noise and promotes sparse patterns. The paper reports 30 percent less hallucination in QA and robustness to order permutation in in-context learning. The mechanism directly addresses the attention noise that causes context rot. The takeaway for the harness engineer is that some of the rot can be steered away at the model level, not just managed at the prompt level.

The eighth strategy is sometimes called “recite then reason”, and it isn’t quite the same thing. Recite-then-reason asks the model to restate the relevant material from its context before answering. It’s a prompting discipline, not an architectural one, and it’s worth keeping in the toolkit but not as the headline strategy.

Eight strategies, eight different points on the cost-versus-complexity curve. The practitioner who ships all eight is shipping a system, not a wrapper. The practitioner who ships three is shipping a wrapper. The difference is the engineering budget, not the model.

Eight mitigation strategies on the cost-versus-complexity curve, arranged as a grid of cubes, the strongest one marked in lime.

What still doesn’t work

The mitigations are real. The gap they close is real. The gap they don’t close is also real, and it’s worth naming.

NIAH benchmarks still systematically overestimate usable context compared to MRCR v2. The HELMET paper (ICLR 2025) confirms NIAH does not predict downstream task performance: GPT-5.5 can score 96 percent on NIAH-2 at 1M tokens and still fall well short on multi-needle MRCR at 128K. Vendors quote the high NIAH number. Production agents hit the MRCR floor. The two numbers live on the same vendor slide.

The intra-family gap is also still large. Claude Sonnet 4.5 to Claude Opus 4.6 was an 18.5 percent to 76 percent jump on MRCR v2 at 1M, the widest intra-family gap of 2026. Opus 4.7, the newest of the family, sits at 32.2 percent on the same benchmark at 1M, well above Sonnet 4.5 but well below Opus 4.6. The lesson: the model gets better, the gap to the leader sometimes narrows and sometimes widens, but the gap to the work-window the model can actually use remains. The harness has to keep doing the engineering work the model can’t.

Put plainly: the harness is closing the gap the model can’t. The work-window is smaller than the advertised window. The harness makes the work-window larger. It does not make the work-window equal to the advertised window. Nothing does.

What this looks like at the source-code level

The eight strategies aren’t abstractions. They show up as code in the systems shipping in 2026, and the mechanisms are worth knowing about up front because they explain what production looks like.

vLLM prefix caching, the same mechanism from Article 2, is the harness-level implementation of cache reuse across requests. Block-level hashing, fixed-size blocks, LRU free queue. The harness decides which blocks survive across requests. The model just sees the cache hits as part of its context.

Anthropic’s compact_20260112 is the harness-level implementation of compaction. The conversation approaches 150K (default), the API summarises older context, the summarised version replaces the original. The harness decides when to compact. The model just sees the compacted context as the new state.

Subagents with isolated worktrees are the harness-level implementation of context firewalls. Each subagent runs in its own context window. The parent agent sees the summary. The harness decides the partitioning. The model just sees its own context.

Self-guiding tool responses are the harness-level implementation of the cheapest strategy in this list. Every tool response closes with a hint about what to call next, a follow-up identifier, a suggested next tool, a warning about what the result implies. The agent learns the API by using it rather than by reading documentation for it. The harness decides what the hint says. The model just follows the thread it’s handed.

Four mechanisms, four different harness-level decisions, four different places where the engineering effort goes. The model is the constant. The harness is where the system actually lives.

Where this leaves you

If you’ve been filling 200K-token windows and wondering why the agent loses the thread, the answer is almost never that you needed 400K. It’s that the harness isn’t doing the engineering work the model can’t. The advertised number is the size of the door. The number that does work is the size of the room the model can see when it’s actually doing the work. The harness is what shapes that room.

The eight strategies above are what the field has converged on. Pick three to start. Compaction, tool-result clearing, structured prompts. Those three shippable in a weekend. The rest follow when the first three stop closing the gap.

The shame’s in not knowing which trade-off you’re making. The harness matters more than the model. The cost is asymmetric. The work-window is smaller than the slide says. The discipline is context engineering, and the engineering is in the harness, not in the model.

The field has known this for two years and said it loudly for one. The shift from “buy a bigger window” to “build a better harness” is the working consensus of 2026. The model is the constant. The work is in everything around it.

Differential attention as two parallel flows of small cubes converging into a single output stream, with a single violet cube at the convergence point marking the difference operation.

Tagged

#context #llm #memory #architecture #engineering

Share & discuss

Share on X Discuss on X

The X Article covers the same ground in a different form. The site version is the canonical one; the X version exists for the conversation in the replies.