You don't have a context problem. You have a harness problem.
Harness effect is 5 to 40 percentage points. Model effect is 5 to 15. Same model, different scaffold, ten-point swings are routine. The work is in the harness, not the model.
Exploration, building, and the things worth shipping. The work that earns the build - skipping the boilerplate, aimed at meaningful impact.
Lenses borrowed from running high-performance engineering teams, applied to context, inference, memory. When the open-source gap is worth closing, the work goes on this site, in the posts, and on Hugging Face when it's a quant.
Steven Batchelor-Manning.
Harness effect is 5 to 40 percentage points. Model effect is 5 to 15. Same model, different scaffold, ten-point swings are routine. The work is in the harness, not the model.
Long context isn't a capability, it's a standing charge. The same task can cost seventeen times more read in full than retrieved.
The advertised context window is 2 to 8 times larger than the effective context for multi-hop work, and 50 to 100 times larger than the effective context for reasoning. The number on the slide is the size of the door. The number that does work is the size of the room.