llm memory research
You are not running out of tokens. You are wasting them. Here is the difference.
Contents
I have been down the rabbit hole on context budget management for a while now. The assumption I kept running into, before I started this research, was that longer context windows had more or less retired the problem. If you can fit 200,000 tokens in a single call, the argument goes, you stop worrying about what goes in.
That assumption is wrong. It is not even close to right.
The finding that keeps surfacing across the 19 systems I looked at is that bigger windows intensify the budget problem rather than dissolve it. A 200K-token window does not pay equal attention to all 200K tokens. Performance degrades long before the window fills. The degradation is non-uniform: material in the middle of a long context is reliably attended to less than material at the edges. And the agent’s actual task occupies a fixed slice of the window regardless of how large the window is, which means everything else is overhead competing for the same attention budget.
The systems that handle this well have converged on six mechanisms. None of them are exotic. Several are embarrassingly simple. But the ones that skip them pay for it.
The six mechanisms
1. Compaction passes
The most visible mechanism. You take a long conversation or a large memory segment, summarise it, and replace the original with the summary. MemoryOS does this at the segment level: its segment summariser fires when a conversation segment grows past a threshold, collapsing it to a compact representation before it can crowd out working context. The Karpathy-pattern wikis (purpose.md, overview.md) do a version of this at the knowledge level: the wiki is the compacted form of everything the agent has learned about a topic, maintained across sessions.
The trade-off is information loss. Compaction is a lossy operation by definition. The summary captures what the summariser judged relevant at the time of compaction. If the agent later needs a detail that was not judged relevant, it is gone. This is not a reason to avoid compaction, but it is a reason not to treat it as the only mechanism.
There is a second cost that is easy to miss. Compaction is not free at runtime. MemoryOS can pay 20 or more LLM calls in a single interaction to maintain its segment summaries. For systems with high interaction frequency, that is a real operational cost.
2. Result-preview truncation
Rather than returning full memory content on every retrieval, return a short preview and let the agent decide whether to fetch the full record. supermemory exposes snippet-length controls that let callers tune how much text comes back per result. mem9 goes further: it decorates source turns with three environment variables (MEM9_SOURCE_TURN_MIN_SCORE, MEM9_SOURCE_TURN_PER_MEMORY_LIMIT, MEM9_SOURCE_TURN_TOTAL_LIMIT) that give operators precise control over how many source turns appear and at what minimum relevance score.
The trade-off is an extra tool call. If the agent needs the full content, it has to ask for it explicitly. For most retrieval patterns this is the right trade-off: the agent gets enough signal to decide whether the record is relevant before paying the token cost of reading it in full.
3. Two-step retrieval
A specific and important variant of preview truncation. Search returns identifiers and short previews. A separate GetByID call fetches the full record when needed. mem9’s MemoryRepo interface is built around this pattern: search and fetch are distinct operations with distinct token footprints.
The numbers make the case plainly. Ten matches at 1,500 tokens each is 15,000 tokens injected into context whether the agent uses them or not. Two-step retrieval returns 10 identifiers and short previews at roughly 450 tokens total, then fetches only the records the agent actually needs. Across 20 recall steps in a session, that difference compounds to around 200,000 tokens saved.
This is the cheapest discipline you can adopt. It requires no architectural change to the memory store, no additional LLM calls, and no information loss. It is a retrieval interface decision.
4. Decompose-then-recall
Rather than sending the full user query to the retrieval layer, decompose it into sub-queries first. SimpleMem’s intent-aware retrieval planner breaks incoming queries into atomic retrieval intents before hitting the memory store. GitNexus does something similar with its query tool decomposition: complex queries are split into targeted sub-queries, each of which retrieves a focused slice of the memory graph.
The benefit is precision. A decomposed query retrieves less irrelevant material, which means less noise in context. The trade-off is latency: decomposition adds a planning step before retrieval begins. For interactive agents this matters. For batch or background agents it usually does not.
5. Tiered storage as budget filter
If you have already built a tiered memory architecture (the subject of last week’s piece), you get budget filtering as a side effect. supermemory’s three-tier model means that hot, frequently-accessed material lives in a tier that returns compact, high-signal results. Cold material is in a tier that is not queried by default. Hindsight’s observation tier works the same way: raw observations are not injected into context directly; they are promoted to higher tiers before they become retrieval candidates.
The trade-off is recall completeness. Material that has not been promoted may be relevant but will not surface in a standard retrieval pass. This is the same trade-off as compaction, but the failure mode is different: instead of losing information through summarisation, you lose it through demotion.
6. Self-guiding tool responses
The least discussed mechanism, and one of the more interesting ones. Rather than leaving the agent to decide what to do after a tool call, the tool response itself includes a hint about what to do next. GitNexus appends a ---\n**Next:** block to tool responses, suggesting follow-up actions. mem9 decorates source turns with structured metadata that guides the agent’s next retrieval step.
The effect is that the agent spends fewer tokens on planning between tool calls. The tool response carries enough structure to make the next step obvious. The trade-off is prompt-engineering effort: writing good self-guiding responses requires knowing in advance what the agent is likely to need next, which is not always possible.
The Tolaria limit case
Tolaria is worth looking at separately because it represents the logical endpoint of budget discipline taken to its extreme. ADR-0009 documents the decision to remove embeddings entirely from the system. Tolaria uses substring-only search. No vector index, no semantic retrieval, no embedding calls.
The reasoning is direct: the cheapest token is the one you never retrieve in the first place. Embedding-based retrieval returns semantically similar results, which means it returns results the agent did not explicitly ask for. Some of those results are useful. Many are not. All of them cost tokens.
Tolaria’s position is that the cost of irrelevant-but-similar results, compounded across a session, exceeds the benefit of semantic recall for its use case. Whether that trade-off holds for your system depends on what your system is for. For systems where queries are precise and structured (code navigation, document lookup by identifier), Tolaria’s position is defensible. For systems where queries are vague and exploratory, removing embeddings breaks recall in ways that are hard to recover from.
The value of the Tolaria case is not that you should copy it. It is that it makes the cost of semantic retrieval visible in a way that most systems do not.
The case against compaction-only systems
Several of the 19 systems rely on compaction as their primary or only budget mechanism. The failure modes are worth naming.
The first is that summarisation loses details that were not judged relevant at compaction time but become relevant later. This is not a hypothetical: it is the standard failure mode of any lossy compression scheme applied to information whose future relevance is unknown.
The second is that compaction is a hot-path cost. MemoryOS paying 20+ LLM calls per interaction is not unusual for compaction-heavy systems. At scale, that cost is not negligible.
The third, and most subtle, is that compaction without an escape hatch is slow forgetting. If the only way to reduce context size is to summarise, and summaries are lossy, then the system is continuously discarding information with no way to recover it. Two-step retrieval, tiered storage, and result-preview truncation all preserve the original record. Compaction does not.
None of this means compaction is wrong. It means compaction alone is not enough.
Recency weighting and the persistent queue
Two mechanisms that do not fit neatly into the six categories above are worth noting.
graymatter uses RRF fusion with recency at half-weight. This is not a budget mechanism in the strict sense, but it functions as one: by down-weighting older material in retrieval rankings, it reduces the probability that stale, low-signal records crowd out recent, high-signal ones. The effect is soft tiering through ranking weights rather than explicit tier promotion.
llm-wiki’s 540-line ingest queue state machine takes a different approach. The queue serialises ingest operations and applies a four-signal relevance ranker before anything enters the memory store. Budget control happens at write time rather than read time. Material that does not clear the relevance threshold is not stored, which means it cannot be retrieved and cannot consume context. This is indirect budget control, but it is durable: the savings compound across every future session.
What the well-designed systems have in common
Looking across the 19 systems, the ones that handle context budgets well share a few properties.
They treat retrieval as a two-step operation rather than a one-step injection. They return previews before full records. They preserve original records rather than replacing them with summaries. They give operators control over retrieval volume through explicit parameters rather than hardcoded defaults. And they think about budget at write time as well as read time.
The ones that handle it poorly tend to rely on a single mechanism, usually compaction, and treat the context window as a buffer to be filled rather than a resource to be managed.
The closing position from the research is simple. Bigger windows demand more discipline, not less. Not because filling them is wrong in principle, but because filling them with the wrong material costs more than leaving the room empty.
Next week: the shift from memory-as-injection to memory-as-tools, how the 19 systems handle the boundary between what gets pushed into context automatically and what the agent has to ask for explicitly.
Tagged