<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>s-batman</title><description>Personal posts of Steven Batchelor-Manning - context engineering, LLM inference, memory systems.</description><link>https://blog.sbatman.com/</link><language>en-gb</language><copyright>© 2026 Steven Batchelor-Manning</copyright><managingEditor>steven@sbatman.com (Steven Batchelor-Manning)</managingEditor><webMaster>steven@sbatman.com (Steven Batchelor-Manning)</webMaster><lastBuildDate>Tue, 30 Jun 2026 19:09:14 GMT</lastBuildDate><ttl>60</ttl><item><title>You bought a 1M context window. You got 50x less than you paid for.</title><link>https://blog.sbatman.com/posts/2026-06-14-context-weight/</link><guid isPermaLink="true">https://blog.sbatman.com/posts/2026-06-14-context-weight/</guid><description>The advertised context window is 2 to 8 times larger than the effective context for multi-hop work, and 50 to 100 times larger than the effective context for reasoning. The number on the slide is the size of the door. The number that does work is the size of the room.</description><pubDate>Sun, 14 Jun 2026 00:00:00 GMT</pubDate><content:encoded>&lt;img src=&quot;/posts/context-weight/hero.png&quot; alt=&quot;You bought a 1M context window. You got 50x less than you paid for. - hero image.&quot; class=&quot;hero-banner-post&quot; /&gt;

Every vendor&apos;s headline context number is a lie. Not a small lie. A 50 to 100 times lie, depending on what the model is being asked to do.

The architecture accepts the input. The model does not read it.

The headline is uncomfortable. Vendors quote window sizes that are 2 to 8 times larger than what the model can actually use for retrieval, and 50 to 100 times larger than what it can use for reasoning. Both numbers come from the same body of benchmark work. Neither is a guess. The same vendor&apos;s newer model on the same multi-needle benchmark can score four times higher than the model it replaces, at the same advertised window. A 10M-token window can lose to a 2M-token window on comprehension of a single book.

This article opens a three-part run on context: weight, cost, and management. Subsequent articles cover what the work-window actually is, what it costs you per active user, and how the systems being shipped in 2026 organise themselves around the gap.

## What advertised context actually means

Every frontier vendor publishes a context window number. Gemini 3 Pro: 1M. Claude Opus 4.6: 1M. GPT-5: 400K. Llama 4 Scout: 10M, the largest of any production model. These numbers are real in one narrow sense. The architecture accepts that many tokens as input. The model does not break when you hand it a prompt that long.

What the architecture accepts and what the model can use are two different quantities. The first is the size of the door. The second is how much of what&apos;s in the room the model can actually see when it&apos;s asked to do work.

The way to find the second number is to test the model on tasks that require using material from across the window, then watch where the score falls off. That&apos;s what the recent wave of long-context benchmarks is doing. MRCR v2 puts eight needles in the haystack and asks the model to recall them in order. NoLiMa asks the model to reason across passages where the keyword overlap has been deliberately stripped out, so retrieval by similarity can&apos;t carry it. HELMET tests downstream task performance at 128K. Fiction.LiveBench gives the model full books and asks comprehension questions that only work if the model tracked what was in the middle.

Each of these is a different lens. None of them is the vendor&apos;s needle-in-a-haystack test, and that&apos;s the point.

## What the numbers actually look like

The per-model picture, as of mid 2026, is uneven in a way that should embarrass the field.

Anthropic&apos;s MRCR v2 8-needle test at 1M tokens shows Sonnet 4.5 at 18.5 percent and Opus 4.6 at 76 percent. Same vendor, same benchmark, same advertised window. The newer generation is over four times better at the task the window was sold to do. If the window were the thing that mattered, those two numbers would be close. They are not.

Llama 4 Scout advertises 10M tokens and scores 15.6 percent on Fiction.LiveBench at 128K. Gemini 2.5 Pro on the same test scores 90.6 percent. Scout has 80 times the advertised window of older Gemini generations and a fraction of their effective context on the harder tests. The ratio of advertised to effective context, on this benchmark, is the worst of any model shipping in 2026.

The ofox.ai benchmark set, which is the most cited practitioner-facing comparison right now, shows the same spread. Gemini 3.1 Pro Deep Think hits 99 percent on NIAH-2 single-needle at 1M. Most of the other models cluster much lower on the harder tests at the same length. The single-needle number is what vendors put in slides. The harder tests are what production agents hit when the user pastes in a 400-page document and asks a question about page 312.

The summarising claim, drawn from the same source material: advertised context windows are typically 2 to 8 times the effective context for multi-hop work, and 50 to 100 times the effective context for reasoning tasks. Both ends of that range are real. Both come from the same benchmark families.

## Why the framing matters more than the number

Different vendors describe the same problem in different ways, and the framing they pick tells you how seriously they&apos;re taking it.

Anthropic uses the phrase context rot. Their September 2025 article on effective context engineering for agents put the term into mainstream engineering discourse. The framing is front-footed. The vendor is naming a problem they say their newer models handle better, and pointing to the difference between advertised and effective as the gap they&apos;re closing. Sonnet 4.5 to Opus 4.6 is the proof point.

DeepMind prefers effective context. Same underlying phenomenon, more neutral language. They publish effective-length numbers on specific tests rather than claiming the window is fully usable.

OpenAI leans on needle-in-a-haystack. NIAH is the most generous of the long-context tests. It puts a single isolated fact in a haystack and asks the model to recall it. The model doesn&apos;t have to use information across the window, only from one position. Vendor benchmark numbers that look like 96 percent recall at 1M are usually NIAH-2 single-needle. They don&apos;t predict how the same model will do on a comprehension question that spans the document.

Meta advertises a 10M window on Scout and provides no specific effective-length numbers. The marketing is the message.

The asymmetry is worth holding. When the vendor is naming the gap, the gap is being worked on. When the vendor is showing only the most generous benchmark, the gap is being hidden.

## What the test results actually say

MRCR v2 multi-needle is the test most often cited as the honest one. It asks the model to recall eight pieces of information from across the window and reproduce them in order. The order requirement is what kills naive retrieval. Even a model that finds every needle can fail the test if it can&apos;t recover the sequence.

NoLiMa is the reasoning test. It strips literal keyword overlap from the question and from the supporting passages, so the model has to do semantic inference rather than pattern-matching. At 64K context, NoLiMa scores are noticeably lower than at 4K, even on the best models. The drop is the gap between retrieval and reasoning. It&apos;s the gap between finding the right passage and being able to use it once found.

HELMET tests downstream task performance at 128K. RAG, in-context learning, re-ranking, summarisation. The scores on HELMET are uniformly lower than the vendor headline numbers. The drop is the same shape as the drop on NoLiMa. More tokens, less useful per token.

Fiction.LiveBench is the test that&apos;s hardest to argue with. The model gets a real book. The questions are about events and relationships in the book. There&apos;s no clever prompting that fixes a model that lost track of what happened on page 40 by the time it gets to page 200.

Llama 4 Scout at 15.6 percent on Fiction.LiveBench at 128K is the data point that anchors the whole conversation. A 10M-token window doesn&apos;t help if the model can&apos;t answer questions about a book at one percent of that length.

The clearest single comparison across these tests is GPT-5.5 on NIAH-2 single-needle at 1M versus GPT-5.5 on MRCR v2 multi-needle at 128K. The first number is 96 percent. The second falls well short. Same model, same vendor, same marketing page. The single-needle number is what fills the slide. The multi-needle number is what fills the production incident log.

## What the Chroma study actually showed

The empirical work on the gap that matters most is the Chroma Research context rot study, published July 14 2025. It tested 18 models from four vendors on four experiment types. The headline is that there is no cliff. Degradation is monotonic from the shortest contexts onward. The rot is continuous, not a step.

The four experiments matter because they isolate different variables. The first looked at needle-question similarity and found that lower cosine similarity between the embedded question and the embedded needle predicts a faster degradation rate. The second looked at distractors and found they lower aggregate accuracy by about one percentage point across all models and haystacks combined. That number is small, and it surprised people who assumed distractors were the problem. The third looked at needle-haystack similarity and concluded it is not the controlling variable. The fourth looked at haystack structure and found that shuffling the haystack destroys performance more than adding distractor needles. The structure of the surrounding text matters more than the count of irrelevant chunks.

Put plainly: the model is not bad at ignoring irrelevant text. It&apos;s bad at using relevant text once there&apos;s enough of it. The failure is on the use side, not the filter side.

## Where the work actually sits

The result that most changes how I think about agent architecture is a side-finding from the same body of work. When the same model is run through different harnesses, the gap between harnesses is bigger than the gap between models on most tasks.

| Harness | Model | Score | Delta |
|---|---|---:|---:|
| Cursor | Claude Opus 4.6 | 93% | — |
| Claude Code | Claude Opus 4.6 | 77% | −16pp |

Cursor on Claude Opus 4.6 scores 93 percent. Claude Code on Claude Opus 4.6 scores 77 percent. Same model, same tasks, sixteen points apart. The harness, the system prompt, the tool surface, the way the context is sliced before it reaches the model, all of that is in that sixteen points.

The implication is uncomfortable. Most of what gets attributed to the model is the model plus the harness plus the system prompt plus the way the developer chose to fill the context. None of those is fixed by buying a bigger window. All of them are within the developer&apos;s control.

The pattern across all this is consistent enough to state plainly. The advertised number is the size of the door. The number that does work is the size of the room the model can see when it&apos;s actually doing the work. The two are different. The difference is not small. The difference is not closing on its own. Buying a bigger window does not close the gap. Building a better harness, slicing context more deliberately, and choosing what to put in front of the model based on the task, those close the gap.

Once the gap is real, three things follow for anyone building an agent in 2026.

The first is that the choice of window is downstream of the choice of test. A team that picks a model based on the NIAH-2 single-needle number is going to ship a system that breaks on the multi-needle and reasoning tests the model can&apos;t pass. The benchmark that gets cited in the procurement meeting is the benchmark that should be the most suspect, not the most reassuring.

The second is that the cost of context is not the cost of the input. It&apos;s the cost of every retrieval step, every turn, every rerank call, multiplied by whatever fraction of the window the model can actually attend to. Doubling the window doesn&apos;t double the useful work. It dilutes it.

The third is that the way the context is sliced before it reaches the model is part of the model, in every practical sense. The Cursor 93 percent versus Claude Code 77 percent isn&apos;t an edge case. It&apos;s the central case. Two teams, same model, sixteen points apart, all in the harness and the system prompt and the context curation. That&apos;s where the engineering goes from here.

That&apos;s the argument this series opens with. The rest of the run is about what each of those levers looks like in practice, what it costs, and how the 19 systems I&apos;ve been studying approach it.

If you&apos;ve been filling 200K-token windows and wondering why the agent loses the thread, the answer is almost never that you needed 400K. The shame&apos;s in still treating the advertised number as the engineering number after the benchmark families have made clear it isn&apos;t.</content:encoded><category>context</category><category>llm</category><category>memory</category><category>benchmarks</category><category>engineering</category><author>steven@sbatman.com (Steven Batchelor-Manning)</author></item><item><title>Your memory system does not need to decide what the agent sees. The agent does.</title><link>https://blog.sbatman.com/posts/2026-06-02-llm-memory-research-09/</link><guid isPermaLink="true">https://blog.sbatman.com/posts/2026-06-02-llm-memory-research-09/</guid><description>Nineteen agent-memory systems quietly reversed their biggest design choice: they stopped injecting context and gave the agent tools instead. Here is why.</description><pubDate>Tue, 02 Jun 2026 00:00:00 GMT</pubDate><content:encoded>&lt;img src=&quot;/posts/llm-memory-research/week-09/hero.png&quot; alt=&quot;Your memory system does not need to decide what the agent sees. The agent does. - hero image.&quot; class=&quot;hero-banner-post&quot; /&gt;


I&apos;ve been down the rabbit hole on how memory reaches the model for a while now. The assumption I kept running into, before this research, was that the hard part of agent memory was retrieval quality. Get the right chunks, the thinking goes, and the rest is plumbing.

That assumption is half right. Retrieval quality matters. But the mechanism that delivers those chunks to the model matters more, and the field has quietly reversed its position on it without most people noticing. The mature systems have all converged on the same shape: the agent is given memory tools, and the agent decides when and how to use them. The middleware is no longer a predictor of what the agent needs. It&apos;s a service the agent calls.

This is not a small change. It happened almost without comment. And if you&apos;re still building injection-style memory in 2026, you&apos;re working harder than you need to.

---

## What injection looks like

Early RAG was injective. The pattern was simple enough to fit on a slide:

1. Embed every document in the corpus.
2. On every user turn, embed the query.
3. Run a top-K nearest-neighbour search.
4. Concatenate the K results into the system prompt.
5. Send the resulting blob to the model.

The model never asks for context. The middleware decides what&apos;s relevant. The model sees the result as a fait accompli, prepended above the user&apos;s actual question. This is automatic context injection. It was the dominant shape of RAG from roughly 2023 to 2024 and it&apos;s still what most teams mean when they say &quot;we added RAG&quot; without further qualification.

It works, up to a point. The point where it stops working is the point where the agent needs to do something the middleware didn&apos;t predict.

Feel the difference concretely. Suppose the user has a memory store containing notes about a deployment incident from the prior week, a long-standing preference for terse responses, a half-finished design document about an authentication refactor, and a transcript of yesterday&apos;s standup mentioning that the auth work has been deprioritised. The user asks: &quot;Where are we on the auth refactor?&quot;

Under injection, the middleware embeds the query, searches, and concatenates the top-K chunks. The model gets back a chunk from the incident postmortem about an auth-related rollback, a chunk from the design doc, a chunk from the standup mentioning deprioritisation, and five other chunks that happened to share the word &quot;authentication&quot; with varying degrees of relevance. The model reads all eight. It answers. The user sees the answer. Nobody sees the eight chunks.

Under tools, the agent reads the question, decides it needs to check memory, calls a search tool with a query it composed itself, gets back identifiers and previews, reads the previews, decides which records are relevant, and fetches only those. The trace shows every step. The agent retrieved two records, not eight. It chose them. The cost reflects the choice.

The difference isn&apos;t retrieval quality. The same chunks exist in both traces. The difference is agency: who decided what the model sees.

---

## Four failure modes of injection

### 1. The irrelevant-chunk tax

Injection pays retrieval cost on every turn whether or not memory was relevant. If the user asks &quot;what time is it?&quot;, the middleware still embeds the query, still searches the store, still concatenates K chunks into the prompt. The model still processes them. None of that work was necessary. On a system doing thousands of agent turns per day, the waste is real and it compounds.

### 2. The wrong-K problem

Top-K retrieval returns the K chunks most similar to the query embedding. But similarity to the query isn&apos;t the same as relevance to the task. The middleware can&apos;t tell the difference because it doesn&apos;t understand the task. It understands the query string. The model could tell the difference, but by the time the model sees the chunks, they&apos;re already in the prompt. The model can ignore them, but it can&apos;t un-retrieve them. The attention cost is already paid.

### 3. The follow-up block

Injection is a one-shot operation. The middleware injects once per turn. If the model reads the injected chunks and realises it needs something else, a different memory item, a related document, a prior conversation that wasn&apos;t in the top-K, it has no way to get it. The retrieval layer has already fired and closed. The agent is stuck with what it was given.

This is the failure mode that hurts most in practice. The agent has the memory. It just can&apos;t get to it.

### 4. The opaque trace

When memory is injected, the conversation trace doesn&apos;t show what the agent looked at. It shows what the middleware decided to give the agent. If the agent produces a wrong answer because it was given the wrong chunks, debugging means reconstructing what the middleware retrieved and why. The trace is opaque to the very person who needs it most: the developer trying to fix the system.

---

## What tools look like

The tool-based pattern inverts the relationship. The agent is given named operations that read and sometimes write the memory store. The agent decides when to call them. The middleware becomes a service, not a gatekeeper.

Of the 19 systems I went through, the mature ones have all converged on this shape. Supermemory, Graymatter, OpenContext, Tolaria, second-brain, MemoryOS, GitNexus, mem9 all expose memory as a tool surface rather than as automatic injection. The systems that do something closer to injection are the ones the field treats as the prior state of the art, not the current one.

The tool surfaces vary in size and shape, and the variation is itself instructive.

### The small end: Graymatter

Graymatter&apos;s MCP server exposes five tools. Three of them are the core: `Remember`, `Recall`, and `memory_reflect`. The last of these is the most expressive: it lets the agent update an existing fact, forget an existing fact, or link a fact to a knowledge-graph entity. The agent maintains its own memory mid-session rather than the memory layer maintaining itself.

The smallness is the pitch. The README is explicit: &quot;the small surface is its whole pitch.&quot; An agent author can hold the entire memory API in their head and the LLM can hold the entire tool description in its prompt. Anything more elaborate becomes another thing to teach.

### The mid-band: Supermemory and OpenContext

Supermemory&apos;s MCP server exposes four tools plus two resources and one prompt. The `memory` tool handles saves and deletes. The `recall` tool handles search. The aggressive tool description is worth noting: &quot;DO NOT USE ANY OTHER MEMORY TOOL ONLY USE THIS ONE.&quot; That&apos;s prompt-engineering the tool description itself, a hack, but a working one. Tool descriptions are themselves prompts, and the same care that goes into system prompts should go into tool descriptions.

OpenContext registers nine tools split into three groups: read, write, and metadata. The write group is deliberately incomplete. The MCP server can register an empty file but not write its body. The agent edits the file directly using its existing file-editing tool. This is the clearest example in the 19 systems of splitting the read and write paths intentionally, letting the memory layer own discovery and resolution while letting the agent&apos;s own tooling own mutation.

OpenContext also encodes cost governance in the tool surface itself. There&apos;s no `oc_index_build` MCP tool because building the index calls a paid embedding API. Indexing is strictly CLI-driven. The skill text the agent reads on install reinforces this: &quot;do NOT run it unless the user explicitly approves.&quot; Policy in the tool surface, not in out-of-band documentation.

### The narrow-by-design end: Tolaria

Tolaria&apos;s position is worth quoting: &quot;The agent has full shell access. These MCP tools provide Tolaria-specific capabilities that native tools can&apos;t replace.&quot; Six tools, all vault-aware reads and UI-steering actions. The agent owns the write path through its native filesystem tools. Tolaria owns the vault-aware read path and the UI surface. The asymmetry is intentional and it&apos;s what keeps the tool surface comprehensible to the LLM.

### The radical end: second-brain

The most extreme point on the spectrum. The agent isn&apos;t given pre-baked recall verbs. It&apos;s given the database: a read-only SELECT and PRAGMA tool over the SQLite memory store, with table-level scoping enforced at the SQLite C-API authorizer hook. The agent writes SQL. If it needs a different join, a different filter, a different projection, it just writes a different query.

second-brain also ships pre-baked recall tools for common cases: `hybrid_search`, `lexical_search`, `semantic_search`. These sit alongside the raw SQL tool. The agent can use the convenience verbs when they suffice and drop down to SQL when they don&apos;t. The tool surface doesn&apos;t need to be one or the other. It can offer both, and let the agent choose.

### MemoryOS: three tools, hidden hierarchy

MemoryOS wraps its three-tier hierarchical store behind three tools: `add_memory`, `retrieve_memory`, and `get_user_profile`. The agent doesn&apos;t need to know about the short-term, mid-term, and long-term tiers. It calls `retrieve_memory(query)` and the system returns the best match across whichever tier owns the data. The complexity of the hierarchy stays inside the memory engine. The tool surface stays small.

The internal model can be as sophisticated as it needs to be, heat-gated promotion, dialogue-chain reconstruction, parallel two-stage retrieval, without any of that sophistication leaking into the agent&apos;s tool descriptions. If the engine wants to reorganise its tiers behind the scenes, the agent doesn&apos;t have to know.

### mem9: same surface, many wrappers

mem9 publishes a REST API and surfaces it through plugins for multiple agent frameworks. The plugin shapes are all over the place, the Claude Code plugin is bash hooks that curl the REST API, the Codex plugin is Node hooks with client-side conversation parsing, but the underlying agent surface is the same five or so operations: `store`, `search`, `get`, `update`, `remove`, plus an `ingest` for whole-conversation handoff.

The same tool surface drives very different agent integrations because the surface itself is small enough to wrap many ways. The Claude Code shell hooks are 80 lines because there isn&apos;t much to wrap. Memory logic stays in the server. Plugins stay thin.

---

## The inversion: oh-my-kiro

One system inverts the pattern in a way that&apos;s worth understanding. oh-my-kiro doesn&apos;t give the agent memory tools. Instead, it interposes on the agent&apos;s existing tool calls. When the agent calls a file-editing tool, oh-my-kiro&apos;s hook system intercepts the call, extracts the relevant context, and stores it in the memory layer. The agent never explicitly asks for memory. The memory layer learns from what the agent does.

This isn&apos;t injection. The middleware isn&apos;t predicting what the agent needs and prepending it. It&apos;s observing what the agent does and recording it. Injection is push. oh-my-kiro&apos;s hooks are pull-by-observation. Both have a place, and a mature agent system might well combine them: tools for active recall, hooks for passive capture.

---

## The highest-leverage refinement

Across the tool-based systems, one pattern costs almost nothing to adopt: self-guiding tool responses.

Every tool response ends with a hint about what to do next. GitNexus does this explicitly: every tool response includes a `Next:` line suggesting the most likely follow-up call. The agent learns the API through use rather than through documentation. The trace becomes self-documenting. A developer reading it can see not just what the agent called but what the system suggested the agent call next.

GitNexus has seven or more tools grouped by process: hybrid search, 360-degree symbol context, impact analysis, even a Cypher escape hatch for graph queries. A surface that size could easily overwhelm an LLM that&apos;s never seen it before. The `Next:` hints are what make it navigable. The agent calls a tool, reads the response, and the response tells it what to consider next. No documentation required.

The cost is one line per tool response. The benefit is that the agent stops guessing about API shape and starts following the system&apos;s own understanding of what comes next.

---

## The two-step rhythm

The natural structural fit for tool-based memory is two-step retrieval. Search returns identifiers and short previews. A separate GetByID call fetches the full record when needed. mem9&apos;s MemoryRepo interface is built around this pattern. MemoryOS wraps it behind a single `retrieve_memory` call but the internal implementation is the same: search first, then fetch.

The numbers make the case plainly. Ten matches at 1,500 tokens each is 15,000 tokens injected into context whether the agent uses them or not. Two-step retrieval returns 10 identifiers and short previews at roughly 450 tokens total, then fetches only the records the agent actually needs. Across 20 recall steps in a session, that difference compounds to around 200,000 tokens saved.

No architectural change to the memory store. No additional LLM calls. No information loss. It&apos;s a retrieval interface decision.

---

## What the shift implies

If your memory system still auto-injects, you&apos;re working too hard. You&apos;re predicting what the agent needs without the agent&apos;s input. You&apos;re paying retrieval cost on every turn whether or not memory was relevant. You&apos;re blocking the agent from following up on partial results. You&apos;re making the trace opaque to your own debugging. You&apos;re doing the agent&apos;s job for it.

The agent should ask. Give it the tools and trust it to use them. Make the tools&apos; responses guide the next call. Make the search-then-read split the default rhythm. Then get out of the way and let the trace tell you, after the fact, what the agent actually needed, which is the question you should have been asking all along.</content:encoded><category>memory</category><category>llm</category><category>agents</category><category>architecture</category><category>tools</category><author>steven@sbatman.com (Steven Batchelor-Manning)</author></item><item><title>You are not running out of tokens. You are wasting them. Here is the difference.</title><link>https://blog.sbatman.com/posts/2026-05-30-llm-memory-research-08/</link><guid isPermaLink="true">https://blog.sbatman.com/posts/2026-05-30-llm-memory-research-08/</guid><description>Most agents are not running out of tokens. They are wasting them. Six mechanisms from 19 systems for keeping context budgets under control.</description><pubDate>Sat, 30 May 2026 00:00:00 GMT</pubDate><content:encoded>&lt;img src=&quot;/posts/llm-memory-research/week-08/hero.png&quot; alt=&quot;You are not running out of tokens. You are wasting them. Here is the difference. - hero image.&quot; class=&quot;hero-banner-post&quot; /&gt;


I have been down the rabbit hole on context budget management for a while now. The assumption I kept running into, before I started this research, was that longer context windows had more or less retired the problem. If you can fit 200,000 tokens in a single call, the argument goes, you stop worrying about what goes in.

That assumption is wrong. It is not even close to right.

The finding that keeps surfacing across the 19 systems I looked at is that bigger windows intensify the budget problem rather than dissolve it. A 200K-token window does not pay equal attention to all 200K tokens. Performance degrades long before the window fills. The degradation is non-uniform: material in the middle of a long context is reliably attended to less than material at the edges. And the agent&apos;s actual task occupies a fixed slice of the window regardless of how large the window is, which means everything else is overhead competing for the same attention budget.

The systems that handle this well have converged on six mechanisms. None of them are exotic. Several are embarrassingly simple. But the ones that skip them pay for it.

---

## The six mechanisms

### 1. Compaction passes

The most visible mechanism. You take a long conversation or a large memory segment, summarise it, and replace the original with the summary. MemoryOS does this at the segment level: its segment summariser fires when a conversation segment grows past a threshold, collapsing it to a compact representation before it can crowd out working context. The Karpathy-pattern wikis (purpose.md, overview.md) do a version of this at the knowledge level: the wiki is the compacted form of everything the agent has learned about a topic, maintained across sessions.

The trade-off is information loss. Compaction is a lossy operation by definition. The summary captures what the summariser judged relevant at the time of compaction. If the agent later needs a detail that was not judged relevant, it is gone. This is not a reason to avoid compaction, but it is a reason not to treat it as the only mechanism.

There is a second cost that is easy to miss. Compaction is not free at runtime. MemoryOS can pay 20 or more LLM calls in a single interaction to maintain its segment summaries. For systems with high interaction frequency, that is a real operational cost.

### 2. Result-preview truncation

Rather than returning full memory content on every retrieval, return a short preview and let the agent decide whether to fetch the full record. supermemory exposes snippet-length controls that let callers tune how much text comes back per result. mem9 goes further: it decorates source turns with three environment variables (MEM9_SOURCE_TURN_MIN_SCORE, MEM9_SOURCE_TURN_PER_MEMORY_LIMIT, MEM9_SOURCE_TURN_TOTAL_LIMIT) that give operators precise control over how many source turns appear and at what minimum relevance score.

The trade-off is an extra tool call. If the agent needs the full content, it has to ask for it explicitly. For most retrieval patterns this is the right trade-off: the agent gets enough signal to decide whether the record is relevant before paying the token cost of reading it in full.

### 3. Two-step retrieval

A specific and important variant of preview truncation. Search returns identifiers and short previews. A separate GetByID call fetches the full record when needed. mem9&apos;s MemoryRepo interface is built around this pattern: search and fetch are distinct operations with distinct token footprints.

The numbers make the case plainly. Ten matches at 1,500 tokens each is 15,000 tokens injected into context whether the agent uses them or not. Two-step retrieval returns 10 identifiers and short previews at roughly 450 tokens total, then fetches only the records the agent actually needs. Across 20 recall steps in a session, that difference compounds to around 200,000 tokens saved.

This is the cheapest discipline you can adopt. It requires no architectural change to the memory store, no additional LLM calls, and no information loss. It is a retrieval interface decision.

### 4. Decompose-then-recall

Rather than sending the full user query to the retrieval layer, decompose it into sub-queries first. SimpleMem&apos;s intent-aware retrieval planner breaks incoming queries into atomic retrieval intents before hitting the memory store. GitNexus does something similar with its query tool decomposition: complex queries are split into targeted sub-queries, each of which retrieves a focused slice of the memory graph.

The benefit is precision. A decomposed query retrieves less irrelevant material, which means less noise in context. The trade-off is latency: decomposition adds a planning step before retrieval begins. For interactive agents this matters. For batch or background agents it usually does not.

### 5. Tiered storage as budget filter

If you have already built a tiered memory architecture (the subject of last week&apos;s piece), you get budget filtering as a side effect. supermemory&apos;s three-tier model means that hot, frequently-accessed material lives in a tier that returns compact, high-signal results. Cold material is in a tier that is not queried by default. Hindsight&apos;s observation tier works the same way: raw observations are not injected into context directly; they are promoted to higher tiers before they become retrieval candidates.

The trade-off is recall completeness. Material that has not been promoted may be relevant but will not surface in a standard retrieval pass. This is the same trade-off as compaction, but the failure mode is different: instead of losing information through summarisation, you lose it through demotion.

### 6. Self-guiding tool responses

The least discussed mechanism, and one of the more interesting ones. Rather than leaving the agent to decide what to do after a tool call, the tool response itself includes a hint about what to do next. GitNexus appends a `---\n**Next:**` block to tool responses, suggesting follow-up actions. mem9 decorates source turns with structured metadata that guides the agent&apos;s next retrieval step.

The effect is that the agent spends fewer tokens on planning between tool calls. The tool response carries enough structure to make the next step obvious. The trade-off is prompt-engineering effort: writing good self-guiding responses requires knowing in advance what the agent is likely to need next, which is not always possible.

---

## The Tolaria limit case

Tolaria is worth looking at separately because it represents the logical endpoint of budget discipline taken to its extreme. ADR-0009 documents the decision to remove embeddings entirely from the system. Tolaria uses substring-only search. No vector index, no semantic retrieval, no embedding calls.

The reasoning is direct: the cheapest token is the one you never retrieve in the first place. Embedding-based retrieval returns semantically similar results, which means it returns results the agent did not explicitly ask for. Some of those results are useful. Many are not. All of them cost tokens.

Tolaria&apos;s position is that the cost of irrelevant-but-similar results, compounded across a session, exceeds the benefit of semantic recall for its use case. Whether that trade-off holds for your system depends on what your system is for. For systems where queries are precise and structured (code navigation, document lookup by identifier), Tolaria&apos;s position is defensible. For systems where queries are vague and exploratory, removing embeddings breaks recall in ways that are hard to recover from.

The value of the Tolaria case is not that you should copy it. It is that it makes the cost of semantic retrieval visible in a way that most systems do not.

---

## The case against compaction-only systems

Several of the 19 systems rely on compaction as their primary or only budget mechanism. The failure modes are worth naming.

The first is that summarisation loses details that were not judged relevant at compaction time but become relevant later. This is not a hypothetical: it is the standard failure mode of any lossy compression scheme applied to information whose future relevance is unknown.

The second is that compaction is a hot-path cost. MemoryOS paying 20+ LLM calls per interaction is not unusual for compaction-heavy systems. At scale, that cost is not negligible.

The third, and most subtle, is that compaction without an escape hatch is slow forgetting. If the only way to reduce context size is to summarise, and summaries are lossy, then the system is continuously discarding information with no way to recover it. Two-step retrieval, tiered storage, and result-preview truncation all preserve the original record. Compaction does not.

None of this means compaction is wrong. It means compaction alone is not enough.

---

## Recency weighting and the persistent queue

Two mechanisms that do not fit neatly into the six categories above are worth noting.

graymatter uses RRF fusion with recency at half-weight. This is not a budget mechanism in the strict sense, but it functions as one: by down-weighting older material in retrieval rankings, it reduces the probability that stale, low-signal records crowd out recent, high-signal ones. The effect is soft tiering through ranking weights rather than explicit tier promotion.

llm-wiki&apos;s 540-line ingest queue state machine takes a different approach. The queue serialises ingest operations and applies a four-signal relevance ranker before anything enters the memory store. Budget control happens at write time rather than read time. Material that does not clear the relevance threshold is not stored, which means it cannot be retrieved and cannot consume context. This is indirect budget control, but it is durable: the savings compound across every future session.

---

## What the well-designed systems have in common

Looking across the 19 systems, the ones that handle context budgets well share a few properties.

They treat retrieval as a two-step operation rather than a one-step injection. They return previews before full records. They preserve original records rather than replacing them with summaries. They give operators control over retrieval volume through explicit parameters rather than hardcoded defaults. And they think about budget at write time as well as read time.

The ones that handle it poorly tend to rely on a single mechanism, usually compaction, and treat the context window as a buffer to be filled rather than a resource to be managed.

The closing position from the research is simple. Bigger windows demand more discipline, not less. Not because filling them is wrong in principle, but because filling them with the wrong material costs more than leaving the room empty.

---

*Next week: the shift from memory-as-injection to memory-as-tools, how the 19 systems handle the boundary between what gets pushed into context automatically and what the agent has to ask for explicitly.*</content:encoded><category>memory</category><category>llm</category><category>agents</category><category>architecture</category><category>tokens</category><author>steven@sbatman.com (Steven Batchelor-Manning)</author></item><item><title>Storage is cheap. Attention is expensive. Are you using the system that exploits the difference?</title><link>https://blog.sbatman.com/posts/2026-05-26-llm-memory-research-07/</link><guid isPermaLink="true">https://blog.sbatman.com/posts/2026-05-26-llm-memory-research-07/</guid><description>A flat memory store is the wrong default. How 7 of 19 agent-memory systems converged on tiered storage, what the heat formula looks like, and the 7 steps to get there.</description><pubDate>Tue, 26 May 2026 00:00:00 GMT</pubDate><content:encoded>&lt;img src=&quot;/posts/llm-memory-research/week-07/hero.png&quot; alt=&quot;Storage is cheap. Attention is expensive. Are you using the system that exploits the difference? - hero image.&quot; class=&quot;hero-banner-post&quot; /&gt;


## The asymmetry that makes tiering pay back

If you change one thing about how you design agent memory, change this: stop treating storage cost and attention cost as if they&apos;re the same thing.

I&apos;ve been going deep on this across 19 systems, and the finding that keeps surfacing is that the ones that handle memory well aren&apos;t necessarily the ones with the most sophisticated retrieval. They&apos;re the ones that figured out which memories belong in the prompt at all. That&apos;s a different problem, and it has a different solution.

Storage is cheap. Attention is expensive. A 128k-context model reading 110k of mediocre context isn&apos;t, empirically, a better agent than the same model reading 8k of carefully selected context. The research on this is consistent: retrieval quality degrades as context fills with noise, and the degradation isn&apos;t linear. The model doesn&apos;t simply ignore the irrelevant material. It processes it, and the processing crowds out the signal.

Storage doesn&apos;t have this property. A fact sitting in a database costs nothing to keep. The cost only arrives when you retrieve it and load it into the prompt. Which means the question isn&apos;t &quot;should I store this?&quot; but &quot;should I retrieve this, and if so, when?&quot;

That reframing is where tiering comes from. Different memory items have different access patterns. Some things an agent needs every turn: the user&apos;s name, their current project, their stated preferences. Some things it needs often but not always: recent decisions, open questions, episodic facts from the last few sessions. Some things it should be able to find when needed but should never burden every turn: meeting notes from three months ago, completed tasks, raw transcripts, one-off reference material.

A flat store treats all three categories identically. Hot items pay the search cost of cold items. Cold items inflate the prompt with noise. There&apos;s no mechanism for items to graduate as they become more relevant, and no mechanism for items to age out as they become less relevant. The OS analogy is exact: CPUs have L1, L2, L3, RAM, SSD, and disk not because bytes are different but because frequency of access varies by orders of magnitude. Seven of the 19 systems I went through had already built explicit tiering before I started looking. Two of them are worth understanding in detail.

---

## MemoryOS as reference implementation

MemoryOS is the clearest implementation of tiered memory in the 19 systems. Three tiers, each with a distinct data shape, a distinct latency budget, and a distinct role.

The short-term tier is a Python deque with a maximum length of 10 QA pairs. No embeddings, no search index, no heat tracking. It&apos;s pure conversational raw material, the last ten exchanges, available at microsecond latency. When the deque fills, the oldest pair drains into the mid-term tier.

The mid-term tier holds up to 2000 sessions. Each session carries a summary, an embedding, a keyword set, and heat counters. The tier is indexed with Faiss and searched by cosine similarity. When the tier reaches capacity, the coldest sessions are evicted. When a session gets hot enough, it&apos;s promoted to the long-term tier.

The long-term tier is a 90-dimension psychology and alignment schema, two knowledge-base deques, one for user facts and one for assistant facts, each capped at 100 entries. This is the persistent layer, the one that survives across sessions and carries the durable model of the user.

The heat formula that governs promotion is twelve lines of Python. Three signals: visit frequency, an LFU analogue; interaction depth, a proxy for topical engagement; and recency decay, exponential with a 24-hour half-life. A segment crosses the promotion threshold at 5.0. After promotion, the visit and interaction counters reset to zero, and heat collapses back to roughly 1.0.

The design decision that&apos;s easy to miss: heat gates promotion, not retrieval. The retriever is purely semantic, cosine similarity over the mid-term embeddings. Heat is a background signal that decides whether a segment should graduate to the long-term tier. The two concerns are decoupled, and that decoupling matters more than the formula itself.

What MemoryOS leaves on the table is worth naming. The coefficients are hardcoded at 1.0, there&apos;s no mechanism for learning weights from actual usage patterns. There are no demotion paths; once something reaches the long-term tier, it stays. And the formula optimises for frequency over importance. A critical but rare fact, a partner&apos;s name, a medical condition, a hard constraint, may never cross the promotion threshold if it only surfaces once. Once the mid-term tier evicts the segment, the fact is gone.

---

## Hindsight&apos;s theory-derived tiers

Hindsight arrives at the same three-tier structure from a completely different starting point. Where MemoryOS draws on OS cache theory, Hindsight draws on cognitive science.

The three tiers are World, Experience, and Observations. World holds objective claims about the universe, ground truth, always-on, long-lived. Experience holds first-person actions of the system itself, the episodic record. Observations hold consolidated beliefs derived from World and Experience facts, carrying source memory IDs, a proof count, and a history field that tracks how the belief has evolved.

All three tiers live in the same database table, differentiated by a fact-type discriminator. Partial HNSW indexes are built per fact type. The schema is unified; the access patterns aren&apos;t.

Promotion in Hindsight isn&apos;t counter-driven. It&apos;s batched LLM-driven consolidation. When a new fact is written, it&apos;s enqueued into an async operations table. A background worker fetches the new facts alongside existing overlapping observations, builds a batch prompt, and asks the model for creates, updates, and deletes. Source memories are stamped with a consolidated-at timestamp to prevent reprocessing. Every new fact, regardless of how frequently it&apos;s been accessed, gets considered for promotion to the Observations tier.

That&apos;s the key difference from MemoryOS. By tying upper-tier promotion to consolidation rather than heat, Hindsight avoids the blind spot for critical-but-rare facts. A single consolidation pass considers every new fact. Frequency is irrelevant to whether something graduates.

Put plainly: two systems, different first principles, different implementation languages, different target use cases, and both land on three tiers with raw material at the bottom, a working layer in the middle, and a synthesised persistent layer at the top. Both use async promotion. Both carry provenance back to lower tiers. That&apos;s what convergent evolution looks like in software architecture, and it&apos;s the strongest signal I know of that a pattern is load-bearing.

---

## supermemory&apos;s genre-conditioned tiering

supermemory operates in the managed API deployment shape, which changes the implementation without changing the architecture. The three tiers are static profile, dynamic profile, and document and chunk store.

The static profile holds stable long-term facts, the hot tier. It&apos;s returned as a static array from the profile endpoint, cached at the edge, with a latency budget of roughly 50ms. The dynamic profile holds recent and episodic context, the warm tier. Many entries carry a forgetAfter field that sets a TTL. The document and chunk store is the cold tier, queried via search endpoints when needed.

Tier assignment happens at write time. An extraction LLM classifies each incoming memory with an isStatic boolean and optionally a forgetAfter value. The classification is enforced by a closed extraction prompt, uniform across all consumers.

The managed API deployment shape enables three things that an in-process designer can&apos;t directly copy but should understand. Cold-tier data can sit on cheaper hardware, object storage for raw bytes, a standard relational store for metadata and chunks, with the hot profile cached separately at the edge. The hot tier gets its own endpoint with its own SLA, separate from the search path. And the extraction prompt is centralised, which means tier assignment is consistent in a way that per-agent classification rarely is.

The trade-offs are real. A remote hot tier is only fast if the network is fast. The agent can&apos;t override the engine&apos;s tier classification. The extraction prompt is a black box. But the architectural pattern, hot tier as its own endpoint, cold tier on cheaper hardware, tier assignment at write time, is worth copying even if the deployment shape isn&apos;t.

---

## Promotion vs. fixed typology

mem9 is the contrast case that clarifies what tiering isn&apos;t.

mem9 has a memory-type column with three values: pinned, insight, and digest. Pinned memories are assigned by explicit content-write paths, manually created, protected from LLM reconciliation. Insights are assigned by every LLM-extracted write, mutable, versionable, supersedable. Both participate in the same hybrid recall with the same RRF scoring. The type field is a write-protection flag, not a retrieval filter and not a tier signal.

This is typology, not tiering. The distinction matters because the two are easy to conflate. Typology describes governance: who can mutate this memory, under what conditions. Tiering describes access patterns: how frequently is this memory needed, and what storage representation best serves that frequency. A system can have both. supermemory&apos;s isStatic is a tier signal, while a separate isInference flag is closer to a governance class. But conflating them produces the most ambiguity in practice.

If you find yourself adding a type field to your memory rows, the question to ask is whether the field describes an access pattern or a governance class. If it&apos;s an access pattern, you&apos;re building tiering. If it&apos;s a governance class, you&apos;re building typology. Both are useful. They&apos;re not the same thing.

---

## What flat costs you

Four concrete things follow from running a flat memory store.

The hot path pays the cold path&apos;s search cost. Every retrieval scans the same index over the same items. The agent looking up the user&apos;s name pays the same search cost as the agent looking up a meeting note from six months ago. At small scale this is invisible. At scale it&apos;s a latency problem.

The cold path inflates the prompt with noise. Retrieval returns semantically near items, which includes standing context, superseded facts, and material that&apos;s factually correct but irrelevant to the current turn. The model processes all of it. The signal-to-noise ratio in the context window degrades as the store grows.

There&apos;s no mechanism for items to graduate. Memory isn&apos;t static. A fleeting episodic fact from early in a relationship may, over time, become a durable signal about the user&apos;s preferences or constraints. A flat store gives no machinery for noticing that transition. The item stays in the same representation it was written in, regardless of how its relevance has changed.

There&apos;s no mechanism for items to age out. Demotion isn&apos;t deletion. A flat store that wants to remove stale material must delete it. A tiered store can move it to a colder representation, still findable, no longer burdening the hot path. The flat store forces a binary choice that the tiered store doesn&apos;t.

---

## The closing position

18 of the 19 systems implement tiering, gesture at it, or have explicit recommendations for it. The exception is mem9, which has a typology layer solving a different problem and would benefit from tiering on top of it.

If you&apos;re building agent memory from scratch, the progression the 19 systems point to is this. Identify what the agent needs every turn, that&apos;s your hot tier. Identify what it needs often but not always, that&apos;s your warm tier. Identify what it should find when needed but never burden every turn, that&apos;s your cold tier. Pick a promotion mechanism: heat-based like MemoryOS, LLM-judgement like Hindsight, or extraction-time classification like supermemory. Pick a demotion mechanism: time decay, TTL, or freshness categories. Keep promotion and retrieval scoring separate at first, the decoupling is easier to add than to unpick. Track provenance from upper tiers back to lower tiers, so you can always answer the question of where a synthesised belief came from.

The 19 systems aren&apos;t unanimous on much. On this, they are: a single flat memory store is the wrong default for any non-trivial agent memory system.

Storage is cheap. Attention is expensive. Build the system that exploits the difference.

---

*Next week: context budget management, how the 19 systems handle the constraint that the context window is finite, and what the ones that handle it well have in common.*</content:encoded><category>memory</category><category>llm</category><category>agents</category><category>architecture</category><category>storage</category><author>steven@sbatman.com (Steven Batchelor-Manning)</author></item><item><title>Almost every serious memory system made the same retrieval decision. Here&apos;s why.</title><link>https://blog.sbatman.com/posts/2026-05-21-llm-memory-research-06/</link><guid isPermaLink="true">https://blog.sbatman.com/posts/2026-05-21-llm-memory-research-06/</guid><description>Hybrid retrieval with RRF at k=60 is the consensus across 19 agent-memory systems. Here is what the reference implementation looks like and why the outlier matters.</description><pubDate>Thu, 21 May 2026 00:00:00 GMT</pubDate><content:encoded>&lt;img src=&quot;/posts/llm-memory-research/week-06/hero.png&quot; alt=&quot;Almost every serious memory system made the same retrieval decision. Here&apos;s why. - hero image.&quot; class=&quot;hero-banner-post&quot; /&gt;


Ask a flat-vector retrieval system to find the note that mentions the string `idx_memory_units_text_search`. The embedding model has no privileged representation for an arbitrary identifier. The tokeniser splits it into pieces, the encoder averages those pieces into a vector that looks much like every other identifier-shaped vector, and the note may or may not surface in the top fifty results. A keyword search returns it instantly. Now ask a keyword search to find notes about &quot;authentication&quot; when every relevant note uses the word &quot;login&quot;. Exact-match is exact-match by construction. Stemming and stop-word removal help at the margins; they do not bridge the semantic gap. The note is not found.

These two failure modes are not edge cases. They are the default failure modes of single-index retrieval, and they are why almost every system in the 19 I went deep on has either abandoned flat-vector search or wrapped it behind something else.

The something else is almost always the same thing. Hybrid retrieval -- running both lexical and semantic search and fusing their ranked lists -- has become the consensus architecture for serious memory systems. The fusion algorithm is Reciprocal Rank Fusion, almost universally at the constant `k=60`. The corpus disagrees about nearly everything else: what to extract from a conversation, when to forget, whether memory is files or rows, whether the agent should drive retrieval or be handed results. On the question of how to order a candidate list, it has converged so completely that `k=60`, inherited from a 2009 information-retrieval paper, has become a magic number copied without comment from one implementation to the next.

## Why the two lanes are not interchangeable

Dense retrieval and lexical retrieval fail in opposite directions, which is why combining them works. Dense retrieval handles semantic variation well -- &quot;login&quot; and &quot;authentication&quot; land near each other in the vector space -- but struggles with exact identifiers, rare tokens, and anything the embedding model has no privileged representation for. Lexical retrieval handles exact terms, identifiers, and rare strings well but cannot bridge synonyms or paraphrase. The failure modes are complementary. Running both and fusing the results covers the ground neither covers alone.

The fusion step matters because you cannot simply concatenate the two ranked lists. A document that ranks first in the vector lane and fifteenth in the keyword lane should score differently from one that ranks first in both. Reciprocal Rank Fusion handles this by converting each rank into a score of `1 / (k + rank)` and summing across lanes. The constant `k` controls how much weight goes to top-ranked items versus the rest of the list. At `k=60`, a rank-1 result scores `1/61` and a rank-60 result scores `1/120` -- a 2x difference. The algorithm is rank-based rather than score-based, which means it is robust to the different score distributions produced by different retrieval strategies. A cosine similarity of 0.87 and a TF-IDF score of 14.3 are not directly comparable; their ranks are.

## graymatter as the reference implementation

If the corpus has a single canonical worked example of hybrid retrieval done cleanly, it is graymatter. The implementation is short, the choices are explicit, and the whole flow fits on a screen.

Three rankings are produced independently. Vector ranking: cosine similarity between the query embedding and each stored fact&apos;s embedding. Keyword ranking: a TF-IDF-style score -- term frequency multiplied by a log-IDF factor, summed across query terms, divided by term count to dampen long facts. Not full BM25, but in the same family. Recency ranking: an exponential decay score, `exp(-lambda x age_hours)`, with a default half-life of 30 days. Each ranking produces a map from fact ID to rank. The fusion step sums `1/(60+rank)` across all three lanes for each fact, sorts descending, and returns the top results.

Three lanes. One fusion. The recency lane is the interesting addition -- it means a fact that is semantically relevant and keyword-matched but six months old will score lower than a fresher fact with similar relevance. The decay is tunable; the discipline is that recency is a first-class signal rather than a post-hoc filter.

## GitNexus: lanes compose without limit

GitNexus generalises the pattern. Where graymatter has three lanes, GitNexus has five separate full-text indexes -- one per content type -- merged by score summation, plus a BM25 lane and a dense vector lane fused with RRF. The `group_query` function runs cross-repository RRF, fusing results from multiple repositories into a single ranked list. The architecture documentation records `RRF_K=60` as the standard constant, with Elasticsearch and Pinecone running in parallel as the retrieval backends.

The lesson from GitNexus is that lanes compose. You do not need to redesign the fusion step when you add a new lane. You add the lane, produce a ranked list from it, and hand it to the same `1/(60+rank)` summation. The algorithm absorbs the new signal without modification. This is why the corpus has converged on it: it is not just correct, it is extensible.

## Hindsight: RRF as a stage in a longer pipeline

Hindsight is the most sophisticated retrieval architecture in the corpus. It runs four retrieval strategies in parallel -- dense vector, keyword, temporal, and graph-based -- fuses them with RRF, and then passes the fused list to a cross-encoder reranker. The cross-encoder reads the query and each candidate document together and produces a relevance score that is more accurate than any of the individual lane scores, at the cost of being more expensive to compute. Running it over the full candidate set would be prohibitive; running it over the top-N from RRF is tractable.

The pipeline is: four parallel lanes, RRF fusion, cross-encoder rerank, final ranked list. Each stage narrows the candidate set so the next stage can be more expensive and more accurate. RRF is not the end of the pipeline; it is the merge stage that makes the expensive final stage feasible.

Hindsight also does something the other systems do not: it preserves per-strategy scores and per-strategy ranks on the result struct alongside the fused score. The trace machinery lets an operator inspect a problematic recall and see exactly which retriever found which item, what its rank was in each lane, what the RRF score was, and what the cross-encoder did. This is unusual. Most systems return only the final result list. Hindsight&apos;s debuggability advantage is largely a consequence of this one decision -- keeping the per-lane data rather than discarding it after fusion.

## The provenance gap most systems leave open

llm-wiki illustrates the cost of not keeping per-lane data. The fusion arithmetic is correct: `RRF_K = 60` is set at `src/lib/search.ts:53` with a comment citing the Cormack 2009 paper. But the result struct&apos;s `score` field is overwritten with the RRF score, discarding the original token score and vector score. The downstream rendering code receives a single number whose provenance has been lost. A user looking at a high-relevance result cannot ask whether it ranked highly because of semantic similarity or keyword match.

This is the same pattern that appeared in the provenance piece: the cost of not having the data is deferred but not avoided. When you want to add a cross-encoder reranker, it needs the per-lane signals. When you want to build a UI that explains why a result was retrieved, it needs the per-lane signals. When you want to A/B test a new lane, you need the old per-lane numbers to compare against. The fused score is a local optimum for ordering. The per-lane scores are the substrate everything else is built on.

The discipline is worth stating plainly: one score for ranking, one score per lane for explainability, and they live in different fields.

## The outlier worth taking seriously

The strongest dissent in the corpus is not a variation on hybrid retrieval -- it is Tolaria, which removed embeddings entirely.

Tolaria is a Markdown vault manager. Earlier in its life it shipped a semantic indexer; ADR-0009 removed it. The reasoning: the operational complexity of shipping a Go binary, code-signing it, auto-installing it, and surfacing its index status in the UI was not justified by the search-quality benefit for the specific workflow. The replacement is plain substring search over title and content, with title matches ranked above content matches.

Crucially, Tolaria does not pretend that substring search is as good as hybrid retrieval. ADR-0009 is explicit: the AI agent provides an alternative for exploratory and semantic queries. The semantic-retrieval intelligence is shifted out of the system entirely and into the agent&apos;s reasoning. The agent can read manifest files, reason about which folders are relevant, and read full notes. This works because Tolaria expects to be paired with a capable agent whose context window is the retrieval budget.

This is a real architectural position. The embedded-search-engine model assumes the system is responsible for finding relevant content. The agent-as-retriever model assumes the agent is responsible and the system&apos;s job is to surface structure the agent can navigate. The two models have different cost profiles, different deployment shapes, and different failure modes. Tolaria&apos;s position wins when the corpus is small enough to fit into the agent&apos;s context window in summary form, when the agent is capable enough to navigate structure intelligently, and when the operational cost of running an embedding pipeline is not justified by the query volume. It loses when the corpus is large, when queries are latency-sensitive, or when the agent cannot be trusted to navigate structure reliably.

The honest framing: hybrid retrieval is the right default for systems that do serious retrieval. Tolaria&apos;s position is the right default for systems where the agent is the retrieval engine. Knowing which one you are building is the first decision.

## What the consensus actually says

The corpus has converged on hybrid retrieval with RRF at `k=60` because the algorithm is correct, robust, and extensible. It is correct because it covers the complementary failure modes of dense and lexical retrieval. It is robust because it operates on ranks rather than raw scores, making it insensitive to the different score distributions produced by different retrieval strategies. It is extensible because lanes compose: adding a new signal means adding a new lane, not redesigning the fusion.

The variation in the corpus is in what the lanes are, how many there are, and what gets layered on top of the fused list. graymatter shows the clean three-lane reference. GitNexus shows that lanes compose without limit. Hindsight shows that RRF is the merge stage of a longer pipeline, with a cross-encoder reranker sitting above it. mem9 shows that a managed-API system can hide the embedding step behind the database and amortise the cost across tenants. Tolaria shows that the entire hybrid-retrieval edifice rests on an assumption -- that retrieval is the system&apos;s job -- and that the assumption is challengeable.

The single piece of practical advice from the corpus compressed into one sentence: use RRF at `k=60`, keep the per-lane scores on the result struct, and spend your engineering budget on the lanes rather than on the fusion. The algorithm has been right for long enough that you can trust it. The lanes are where the leverage is. The provenance is where the debugging is.

The next piece covers tiered storage -- the pattern that separates systems that keep everything in one flat store from the ones that have learned to match the storage medium to the access pattern. That piece is coming up.</content:encoded><category>memory</category><category>llm</category><category>agents</category><category>architecture</category><category>retrieval</category><author>steven@sbatman.com (Steven Batchelor-Manning)</author></item><item><title>A fact without provenance is an island. Why every memory must carry its origin.</title><link>https://blog.sbatman.com/posts/2026-05-18-llm-memory-research-05/</link><guid isPermaLink="true">https://blog.sbatman.com/posts/2026-05-18-llm-memory-research-05/</guid><description>A fact without provenance is an island. Six levels of agent-memory provenance across 19 systems - identity, source, confidence, versioning, causal, reciprocal.</description><pubDate>Mon, 18 May 2026 00:00:00 GMT</pubDate><content:encoded>&lt;img src=&quot;/posts/llm-memory-research/week-05/hero.png&quot; alt=&quot;A fact without provenance is an island. Why every memory must carry its origin. - hero image.&quot; class=&quot;hero-banner-post&quot; /&gt;


Sit down with a flat-RAG system that has been running in production for a year and try to ask it four questions. &quot;Why did you say that?&quot; The model produced a confident answer about the customer&apos;s contract renewal date, but the trail back to the source clause does not exist. &quot;Re-validate this claim -- the source changed.&quot; The contract was amended last week. Every fact derived from it is suspect. You cannot find them because they do not know which document they came from. &quot;Decide between these two contradicting facts.&quot; One says Alex works at Google, another says Stripe. Neither carries a confidence score or a source timestamp. Recency of the write has nothing to do with recency of the evidence. &quot;Attribute this hallucination.&quot; The model said the meeting was on Thursday. There was no meeting on Thursday. You cannot tell whether the LLM invented the date or whether bad data in memory misled it.

These four failures share a single cause. They are each a missing column. A source identifier, a capture timestamp, a confidence score, a response citation log. Each costs a few bytes at write time. The cost of not having them is unbounded: every claim is unauditable, every contradiction is unresolvable, every hallucination is untraceable.

Across 19 systems the pattern is consistent. The strongest implementations treat provenance the way a court treats evidence -- every claim arrives with an unbroken chain of custody or it does not arrive at all. The weakest carry none, and pay for it every time something goes wrong in production. This piece walks the six levels of provenance the corpus has surfaced, the three tiers of implementation maturity separating them, and the honest cost of building this right from the start.

## Six levels, one discipline

Reading 19 systems back to back, six distinct levels of provenance separate themselves out. They are not a hierarchy; they are orthogonal. A system can have source provenance with no causal provenance. It can have versioning without confidence scoring. The strongest implementations cover all six. None of the weakest cover any.

Identity answers &quot;which fact, exactly?&quot; OpenContext mints UUIDs for every piece of context at ingest time and exposes them as a citation scheme the agent can use directly in responses. The identifier survives renames, moves, and reorganisations because it is bound to the content, not the path. mem9 carries a stable memory ID on every row plus an explicit version counter with If-Match concurrency protection. Without identity provenance you cannot even name what you are talking about when things go wrong.

Source answers &quot;where did it come from?&quot; Hindsight records source type and identifier on every observation, so the system can answer which conversation or document produced a given fact. mem9 carries source, agent ID, and session ID on every row. Supermemory stores document IDs alongside memories. Without source provenance you cannot cascade updates when a source changes, because you do not know which facts depend on it.

Causal answers &quot;which agent step used this?&quot; Hindsight captures every retrieved-and-used fact in an observation tier with full retrieval context -- which retriever surfaced it, what rank it achieved, and whether the agent actually consumed it. Moraine treats every trace step as its own provenance record, making the agent&apos;s entire execution recoverable as a sequence of source-addressable events. Without causal provenance you cannot distinguish &quot;the system retrieved this fact&quot; from &quot;the agent used this fact to produce that response.&quot;

Capture confidence answers &quot;how sure were we when we wrote it down?&quot; Graphify marks every edge in its knowledge graph with three-level capture confidence: CONFIRMED for deterministic extractions, LIKELY for high-confidence LLM inferences, and AMBIGUOUS for uncertain claims. The AMBIGUOUS edges surface as &quot;knowledge gaps&quot; for human review rather than being silently treated as fact. Hindsight carries a confidence score on every observation that decays over time through its freshness lifecycle -- fresh observations are trusted, stale ones are down-weighted or retired. mem9 runs near-duplicate detection in shadow mode first, recording scores without acting on them until the engineer has calibrated the threshold from real data. Without capture confidence, uncertain facts and certain facts are retrieved with equal weight, and the agent cannot discriminate between a solid claim and an educated guess.

Versioned answers &quot;what did we believe before?&quot; Supermemory treats memory as a versioned DAG with typed edges -- updates, extends, derives -- giving every belief commit history. mem9 splits the write path: in-place mutation for human edits, append-and-archive for LLM-driven rewrites where the new content semantically replaces the old. Tolaria lets Git carry the version history entirely, treating one-line diffs as a first-class user-facing artefact. Without versioned provenance you cannot rewind to what the system believed on Tuesday, because old beliefs are deleted rather than archived.

Reciprocal answers &quot;what other facts share this origin?&quot; llm-wiki weighs source overlap above direct linking in its four-signal relevance graph -- two pages that came from the same raw document are presumed more strongly related than two pages with a direct wikilink, because the LLM is unreliable at cross-linking but the sources frontmatter is mechanically maintained. EdgeQuake accumulates source IDs on entities and relationships across the corpus, so repeated mentions of the same entity from different sources strengthen its provenance weight. second-brain carries source as part of its lexical-index composite key, letting a single document be indexed from multiple pipelines independently. Without reciprocal provenance you cannot answer &quot;show me everything in this system that came from the same conversation&quot; or &quot;what else did we learn from that document?&quot;

The six are orthogonal but they reinforce each other. Source provenance is useless without identity (you need to name the fact before you can trace it). Versioning is weaker without confidence (you know the lineage of edits but not how sure the system was about any of them). Reciprocal queries depend on source being present first. The systems that carry all six are the ones whose memory does not silently rot.

## Three tiers of implementation maturity

The corpus separates into three tiers based on where provenance lives and what it can do at read time.

Tier 1 is no-provenance RAG, the starting point for most teams. Flat vector stores with content and an embedding. No source column, no confidence score, no version history. When a fact is retrieved, you get text and a similarity score. You cannot trace where it came from, how sure the system was about it, or whether it has been superseded. Every system that started here has moved toward Tier 2 over time.

Tier 2 carries provenance on the row. Source ID, confidence, version -- all present as columns alongside the fact. mem9 sits here with source, agent ID, session ID, and version on every row. Supermemory&apos;s versioned DAG is Tier 2 structure. Graphify&apos;s three-level edge confidence is Tier 2 discipline. The provenance is available for queries, but it does not automatically decorate retrieval results. You have to write the query that uses it.

Tier 3 decorates read-time results with provenance context without the caller having to ask for it. Hindsight is the reference here -- every retrieved fact arrives with its source type, confidence score, freshness state, and per-retriever ranking already attached. The agent consuming the result sees provenance as part of the fact, not as a separate lookup. mem9&apos;s source-turn decoration sits halfway between Tier 2 and Tier 3, grafting read-time context onto a Tier 2 schema.

The migration is one-way. No system starts at Tier 3 and decides to remove provenance discipline. The columns prove their value as soon as they are present. If you are designing a memory system today, the question is not &quot;do I need provenance?&quot; but &quot;which tier do I want to start at, knowing that every tier above the one I pick is harder to reach later than to bake in now?&quot;

## The cautionary case: computed and discarded

Understand-Anything illustrates what happens when the substrate for confidence exists but the column to record it does not. The system distinguishes deterministic edges (resolved by a project scanner from source files) from inferred edges (guessed by an LLM during semantic analysis). That information is real and meaningful -- a structural import edge deserves higher confidence than a non-code inference. But both are stored as weight 0.7, identical in the graph. The confidence signal is computed at write time and discarded before persistence.

Adding a confidence field would be a small change with a large information-quality payoff. The system already knows which edges are solid and which are guesses. It just does not record the distinction where it matters -- on the row that gets retrieved later, when the agent needs to discriminate between them. This pattern appears across multiple systems in the corpus: information is available at write time, cheap to capture, and then lost because nobody added the column.

## The honest cost of building this right

Provenance costs bytes. A source ID, a confidence score, a version counter -- each adds a few fields per row. At scale that matters, but it matters far less than the cost of not having them when something goes wrong in production and you cannot trace why the system produced a wrong answer.

The compute cost is lower than expected. Confidence scoring typically requires one additional LLM call at write time (or a deterministic heuristic for structural facts), which is amortised across every subsequent read. Source tracking costs nothing beyond recording an identifier that already exists. Versioning costs one extra column or one append per write, not a full snapshot. Hindsight&apos;s observation tier adds storage proportional to the number of retrieved-and-used facts, but only records what the agent actually consumed, not everything it saw.

The operational cost is the real question. Tier 3 decoration means more data flowing on every retrieval, which increases context-window usage and response latency slightly. The systems that ship this handle it by keeping decorations concise -- a source type enum, a confidence float, a freshness state -- rather than full provenance trees inline. The agent gets enough to discriminate without drowning in metadata.

## What the strongest implementations share

The exemplars across the corpus deserve restating because no two solve the same slice of the problem:

OpenContext mints UUIDs that survive any filesystem reorganisation and exposes them as a citation scheme the agent can use directly. mem9 carries source, agent ID, session ID on every row, version on every update, and decorates search results with source-turn context governed by an explicit budget. Supermemory treats memory as a versioned DAG with typed edges, giving every belief commit history. Hindsight captures every retrieved-and-used fact in an observation tier with full source provenance, evolution history, and per-retriever ranking. Graphify marks every edge with three-level capture confidence, surfacing uncertain edges for human review rather than treating them as fact.

The unifying observation is plain: provenance is not metadata, it is part of the fact. The systems that treat it that way are the ones whose memory does not silently rot, whose contradictions can be adjudicated, whose hallucinations can be traced, and whose belief history can be rewound to any point in the past without having anticipated the need.

The systems that do not are the ones whose users learn eventually that the memory they were trusting was an island all along.

Provenance is the cheapest insurance you can buy at write time. The discipline is to buy it before you find out you needed it.

The next piece walks hybrid retrieval and RRF, the pattern that lets you combine vector and keyword signals without one drowning out the other -- which matters most when provenance tells you a fact is solid but relevance alone would bury it. That piece is coming up.</content:encoded><category>memory</category><category>llm</category><category>agents</category><category>architecture</category><category>provenance</category><author>steven@sbatman.com (Steven Batchelor-Manning)</author></item><item><title>Pay at write time, read for free. The one Agentic Memory move that compounds across every retrieval</title><link>https://blog.sbatman.com/posts/2026-05-14-llm-memory-research-04/</link><guid isPermaLink="true">https://blog.sbatman.com/posts/2026-05-14-llm-memory-research-04/</guid><description>The highest-ROI pattern across 19 agent-memory systems. Pay at write time, read for free. Six forms, why they compose, and the order that pays back fastest.</description><pubDate>Thu, 14 May 2026 00:00:00 GMT</pubDate><content:encoded>&lt;img src=&quot;/posts/llm-memory-research/week-04/hero.png&quot; alt=&quot;Pay at write time, read for free. The one Agentic Memory move that compounds across every retrieval - hero image.&quot; class=&quot;hero-banner-post&quot; /&gt;


If you change one thing about how you build agent memory in 2026, change how much of the work you do at write time. After 19 systems deep on this, no other architectural decision compounds the way this one does. The systems with the cleanest recall behaviour all spend disproportionate compute when a fact enters the store. The systems that don&apos;t are paying interest on that decision forever, on every read, by an agent that no longer has the original context to reason from.

The case is built on an asymmetry so obvious it tends to slide off the eye. Every fact is written once and read many times. In real agent traffic the ratio is typically two to four orders of magnitude. A second of LLM work paid at write time, divided across thousands of subsequent reads, is a rounding error per read. A second of work skipped at write time has to be re-derived on every read, by an agent navigating around the rough edge instead of past it. That arithmetic forces the design conclusion. The corpus has already done the forcing for the field.

Let&apos;s look at what &quot;write time&quot; actually means, the six forms the pattern takes, the convergence evidence that&apos;s hard to argue with, and the honest costs of going this route.

## What write time actually is

Write time is the moment a fact enters the persisted store. The user message that gets retained. The document dropped into the ingest queue. The clipped page that lands on disk. The tool result the agent decides to keep. The background sweep that promotes raw observations into synthesised beliefs.

What write time isn&apos;t is the user-perceived response. The agent has already replied from the previous state of memory, the new state will be visible to the next query. That asynchrony is what makes the cost tolerable. Write-time work runs behind the response, not in front of it. Hindsight is sharp about this discipline. Extraction and entity resolution run synchronously, but the consolidation pass that turns facts into observations is deliberately deferred to a background sweep so retain latency stays low. mem9&apos;s reconcile phase issues an LLM call for every batch of new facts, then hands the result to a background goroutine, not the agent&apos;s response. Supermemory&apos;s document-to-chunk-to-memory pipeline does the heavy LLM work entirely off the API request path.

The systems that mix write-time work into the response path suffer for it. The systems that hold the discipline are the ones that ship.

## The six forms the pattern takes

Across the 19 systems, write-time investment has surfaced in six recognisable forms. Most mature systems run several of them at once. The ones that run several are the ones with the cleanest behaviour under load.

Online dedup-and-synthesis is the first form, and the highest-leverage. When a fact arrives, the system queries the existing store for candidates that overlap, then issues a single batch LLM call that emits per-fact actions: add, update, delete, or no-op. SimpleMem&apos;s `add_memories` is the textbook version. mem9&apos;s reconcile is the same pattern at scale. The store never accumulates near-duplicates that have to be filtered or re-ranked on every later read. A subtler benefit is that synthesis surfaces contradictions that flat-write systems never even detect. When &quot;user likes React&quot; arrives followed later by &quot;user has switched to Vue&quot;, Hindsight&apos;s consolidation refines the observation to capture the journey rather than overwriting, so the memory records preference as it evolved, not just the latest state.

Atomisation is the second form. Break a statement into the smallest individually retrievable propositions before embedding. A wall of text recovered as a single chunk is opaque to ranking. A paragraph atomised into half a dozen short, self-contained claims gives the retriever something to actually discriminate against. LLM-Wiki&apos;s two-step ingest is the cleanest expression: step one produces a structured analysis that names the entities, concepts, claims, and relations. Step two writes the actual wiki pages, each one functionally an atomised proposition. The Louvain community detection running over the graph only makes sense because the units are atoms. Run it over arbitrary chunks and the community structure means nothing.

Multi-step ingest is the third form. Once you accept that ingest doesn&apos;t have to be a single LLM pass, the pipelines fan out into richer compositions, deterministic where possible, LLM-where-necessary. Understand-Anything is the most extreme example in the corpus. Six of its nine agents follow the same internal structure: phase one writes and runs a deterministic helper script (Tree-sitter for structure, a Node script for fan-in and fan-out metrics), phase two reads the JSON output and applies LLM judgment. The LLM is explicitly told *&quot;Do NOT re-run file discovery commands or re-count lines, trust the script&apos;s results entirely&quot;*. OpenKB&apos;s compilation pipeline is four steps built around prompt-cache reuse, the cache amortising the document context across many fan-out calls. A multi-step pipeline that&apos;s naïvely implemented is brutally expensive. A multi-step pipeline that&apos;s built around the cache is cheaper per document than the single-shot equivalent.

Provenance metadata is the fourth form, and the simplest to implement. Every entry carries the source it came from, the timestamp it was captured at, optionally the confidence the system had at capture, and optionally the citations that justify it. Hindsight is the most rigorous expression: every fact carries the journalist&apos;s interrogation, what, when, where, who, why, plus typed kind and category, plus source memory ids, plus a consolidated-at timestamp. Supermemory carries provenance at every layer of its three-tier object model, memories are immutable nodes in a versioned DAG connected by typed updates, extends, and derives edges. mem9 carries source, agent id, and session id on every row, plus a versioning column that supports an append-and-archive transaction so the previous version is never lost.

Confidence scoring is the fifth form. Each fact gets a number or a state telling downstream readers how much to trust it. Hindsight uses a freshness lifecycle, observations move from fresh to confirmed as corroborating evidence arrives. Supermemory exposes a relative version distance from the primary memory so clients can render a temporal slider over a memory&apos;s history without re-querying. The hard part isn&apos;t producing the score, it&apos;s calibrating it. The deployment pattern that makes this shippable is shadow-mode: ship the heuristic dark first, collect the score distribution against real traffic, decide the threshold from the data, not from intuition.

Type tagging is the sixth form. Every atom is labelled with what kind of thing it is, concept, entity, claim, relation, event, conversation. Tolaria&apos;s frontmatter-as-type convention shows that even a convention-only type system delivers most of the value, you don&apos;t need schema validation to get the benefits of typed retrieval. The label gives the retriever a second axis to filter on, which gives the agent a way to ask for &quot;the concepts on this topic&quot; rather than &quot;the chunks that match this query&quot;.

Six forms. They aren&apos;t a checklist where you pick three. They reinforce each other.

## Why the forms compose

A subtle observation that doesn&apos;t come out of any single system but is visible across the corpus: the forms are mutually reinforcing.

Atomisation is more useful when atoms are typed, because the type tells the retriever what kind of atom it has. Type tagging is more useful when atoms have provenance, because the type plus the source lets the retriever filter on both axes. Provenance is more useful when atoms have confidence, because confidence tells the retriever how much to trust the provenance. Confidence is more useful when the system performs online dedup, because dedup folds many low-confidence corroborating sources into a single high-confidence fact with multiple source ids. And the multi-step pipeline that does all of the above is more useful than the single-shot pipeline that does one of them, because the steps can hand structured intermediate results between each other rather than hand prose summaries.

Hindsight illustrates this composition the most cleanly. The atom is the extracted fact. The fact is typed by kind and type. It carries six-dimensional provenance via what, when, where, who, why, plus event date and mentioned-at. It carries confidence implicitly via the freshness state on its derived observations. It&apos;s reconciled into observations via a background batch consolidation pass. The whole thing runs through a multi-step ingest pipeline that separates extraction, entity resolution, embedding, and consolidation. All six forms compose, and the composition is what makes the downstream observation tier legible to the reflection agent at all.

The inverse is also visible. Skipping one form weakens the others. A typed corpus without atomisation gives you typed wall-of-text. An atomised corpus without types gives you a flood of equivalent units the retriever can&apos;t discriminate between. Provenance without confidence tells you where a fact came from but not how much to trust it. Dedup without provenance loses the audit trail of which sources fed which fact. The decision isn&apos;t whether to invest at write time, it&apos;s how many of the six forms to compose, and the corpus suggests the honest answer is &quot;all of them, eventually&quot;.

## The convergence is the evidence

Anyone can find a clean architecture in a single mature system. The interesting question is whether independent systems, starting from different assumptions, end up in the same place. On write-time investment, they do.

The Karpathy LLM Wiki pattern is the clearest convergence. Karpathy&apos;s original gist did the work in a single LLM pass per document. Three independent implementations in the corpus, LLM-Wiki, OpenKB, and Understand-Anything&apos;s knowledge-graph mode, all moved away from that single-pass design within their first major iteration. LLM-Wiki landed on the two-step chain-of-thought pattern explicitly motivated by the observation that single-shot generation forgets to link to existing content. OpenKB landed on a four-step pattern explicitly motivated by prompt-cache reuse. Understand-Anything landed on a six-of-nine two-phase pattern explicitly motivated by the observation that LLMs are slow at counting and wrong at line numbers. Three different starting points, same architectural conclusion. That kind of convergence is the strongest possible evidence that the pattern is load-bearing, not stylistic.

The negative evidence is just as sharp. Systems that started without write-time investment have, at some maturation point, added it. The shift goes one way. No mature memory system in the corpus has reverted from &quot;do work at write time&quot; to &quot;do work at read time&quot;. The systems that designed for it from the start, Hindsight, Supermemory, Tolaria, carry their architecture forward gracefully. The systems that didn&apos;t are paying interest on the architectural debt forever.

Plainly: if you&apos;re starting a new memory system in 2026 and you skip write-time investment, you are choosing to repeat a journey the field has already finished and learned from.

## The honest costs

This isn&apos;t free. Anyone who&apos;s shipped a memory system will recognise the costs.

Latency goes up at the write path. Multi-step ingest pipelines take seconds, sometimes tens of seconds, per document. Supermemory&apos;s 10,000-docs-per-hour throughput is bottlenecked by extraction LLM cost at roughly three to five LLM calls per document. mem9&apos;s reconcile is a synchronous LLM call on the ingest path, even with batching it dominates the wall-clock cost of writing. OpenKB&apos;s multi-step compilation runs five LLM calls per document even with the prompt cache reused across all of them.

Token spend goes up. Online dedup costs tokens because the LLM needs to see the candidate existing facts. Atomisation costs tokens because the LLM has to be asked to split rather than summarise. Multi-step pipelines cost tokens at every step. Prompt-cache reuse blunts the marginal cost but doesn&apos;t eliminate it.

Engineering complexity goes up. A pipeline with five steps has more failure modes than a pipeline with one. mem9&apos;s extraction prompt has three fallback strategies for malformed JSON, including recovery from a known flattened-fact corruption pattern. Understand-Anything&apos;s merge script has explicit logic for recovering nodes the analysis script dropped, remapping unknown node types, restoring dropped dangling edges. Every multi-step pipeline accretes this kind of defensive code.

The trade-off is real. It&apos;s also, on every honest reckoning across the corpus, lopsided. The reads outnumber the writes by orders of magnitude. The user-perceived latency is on the read path, not the write path. The agent&apos;s reasoning budget is consumed at read time. The hallucinations are produced at read time. Every dimension along which &quot;less work&quot; sounds appealing turns out, on inspection, to be a dimension along which write-time investment buys read-time relief at favourable rates. A rule of thumb the corpus suggests: if a write-time pipeline costs three to five LLM calls per document, but the document will be read hundreds of times across its lifetime, the per-read amortised cost of the write work is far below the cost of running a single additional LLM call at read time to compensate for what wasn&apos;t done at write. The arithmetic is forgiving in a way that almost no other architectural decision in the field is.

## The adoption order the corpus implies

For a team retrofitting an existing flat-RAG system, the order that pays back fastest at each step is something like this. Provenance first, it&apos;s the cheapest to add and the prerequisite for almost everything else. Types second, also cheap, and it unlocks faceted retrieval immediately. Multi-step ingest third, once provenance and types are in place, refactoring the single-shot extraction into two steps becomes tractable. Atomisation fourth, partially a consequence of multi-step ingest, but worth making explicit. Online dedup fifth, the most invasive form because it requires the ingest pipeline to read the existing store before deciding what to write. Confidence scoring sixth, shipped in shadow mode first, calibrated from data, acted on later.

This isn&apos;t the only order that works. It&apos;s the one that pays back fastest at each step. Provenance enables debugging immediately. Types enable faceted retrieval immediately. Multi-step ingest enables better extraction immediately. Each step yields a visible improvement and lays groundwork for the next.

If you&apos;re picking up an existing memory system that isn&apos;t getting the read-time quality you want, the single most useful question is the one this corpus is built around: of every piece of work the system is doing on every read, could this have been paid once at write time instead? The answer will be &quot;yes&quot; more often than is comfortable.

The next piece is on confidence and provenance, the substrate that makes everything in this piece auditable rather than mysterious. If the write-time pattern in this piece is the engine, confidence and provenance are the instrumentation. Hindsight&apos;s freshness lifecycle, Supermemory&apos;s versioned DAG, and OpenContext&apos;s per-fact source trails are three different shapes of the same underlying argument, that a memory system without provenance is one you can&apos;t debug, and a memory system without confidence is one you can&apos;t calibrate. That piece is next.</content:encoded><category>memory</category><category>llm</category><category>agents</category><category>architecture</category><category>write-time</category><author>steven@sbatman.com (Steven Batchelor-Manning)</author></item><item><title>Six deployment shapes for agent memory. Did Supermemory get it right?</title><link>https://blog.sbatman.com/posts/2026-05-11-llm-memory-research-03/</link><guid isPermaLink="true">https://blog.sbatman.com/posts/2026-05-11-llm-memory-research-03/</guid><description>Six deployment shapes for agent memory. The architecture everyone debates isn&apos;t what decides how it feels to use. An honest look at Supermemory&apos;s bet.</description><pubDate>Mon, 11 May 2026 00:00:00 GMT</pubDate><content:encoded>&lt;img src=&quot;/posts/llm-memory-research/week-03/hero.png&quot; alt=&quot;Six deployment shapes for agent memory. Did Supermemory get it right? - hero image.&quot; class=&quot;hero-banner-post&quot; /&gt;


Most agent-memory comparisons argue about the wrong axis. After 19 systems deep on this, what&apos;s clear is the architecture everyone debates publicly isn&apos;t the thing that decides how a memory system actually feels to use. The deployment shape is. Same architecture in two different shapes is two completely different products.

Six shapes have surfaced across the field. Most teams pick one as primary. The loudest of those six right now is the managed API service, and the loudest exemplar of that shape is Supermemory, which has done a competent job of building exactly what the shape demands. Whether that&apos;s the right shape to have built for is a different question, and one I think the corpus implies an honest answer to.

Let&apos;s look at the six, then at where Supermemory sits in them.

## The question operators actually answer first

Most published comparisons of memory systems classify them by paradigm. Is the engine flat vector RAG, knowledge-graph augmented, progressive compression, multi-index hybrid, LLM-as-retriever, trace-as-memory, a Karpathy-style wiki, or filesystem-native? That&apos;s a useful question. The previous piece in this series was dedicated to it.

It&apos;s not the question an operator answers first when picking a memory system. The question they answer first is shaped like this. Do I want to call an HTTPS endpoint, link a library against my agent, point a desktop app at a folder, run a CLI from a script, or install a skill into the agent I already use?

That&apos;s the deployment-shape question, and it&apos;s orthogonal to paradigm. The same paradigm, multi-index hybrid retrieval, has been built and shipped four ways across the 19 systems. As a managed API service (Supermemory at api.supermemory.ai). As a self-hostable open-source server with a managed endpoint on the side (mem9 at api.mem9.ai). As an in-process Go library that doubles as an MCP server (graymatter). As a filesystem-native context store with an MCP front door (OpenContext). Four products. One paradigm. To the operator they feel like four different categories of thing.

That&apos;s the gap I want to dig into. Why deployment shape dominates the daily experience, what the six shapes actually look like, and whether the most prominent bet in the most prominent shape, Supermemory&apos;s bet on the managed API, is the bet operators picking now should be making.

## The six shapes the field has settled into

Six deployment shapes have emerged across the 19 systems. Each one forces a particular operational shape, a particular trust boundary, and a particular set of trade-offs the operator has to live with from day one.

Managed API service. A hosted endpoint the agent calls over HTTPS. The vendor runs the storage, the embedding pipeline, the upgrades, the on-call rotation. The operator sets a base URL and an API key. Supermemory and mem9&apos;s hosted endpoint sit here.

In-process library. A package the operator&apos;s code links against. The memory engine runs inside the same OS process as the agent. Storage is local, typically a single embedded database file. SimpleMem and graymatter sit here.

Filesystem-native plus MCP. The user&apos;s actual filesystem is the canonical store. Markdown files on disk are the artefact, a small index is a derivative, and an MCP server fronts the system to whichever agent the user is running. OpenContext, Tolaria, second-brain.

Desktop application. A complete user-facing program that includes the memory layer rather than exposing it as a separate component. The user launches the app, sees a UI, interacts with memory through the app&apos;s chrome. llm-wiki, Tolaria again, Memex.

CLI tool. A command-line program the operator runs from a shell or script. The artefact is a binary that takes arguments, performs work, returns a result. OpenKB, Graphify, GitNexus.

Skill or hook framework. Memory expressed as agent-side artefacts: skill definitions, slash commands, hooks that fire between turns. There&apos;s no separate process. The memory is in the shape of the agent&apos;s behaviour. oh-my-kiro, Understand-Anything.

Six shapes. Most teams pick one as primary. A small number genuinely span more than one, and those are the most interesting cases in the corpus.

## So, did Supermemory get it right?

Supermemory is the loudest exemplar of the managed API shape, and the most useful test of whether that shape is a good bet for the operator picking now. The honest answer is shaped: yes for the audience they&apos;re built for, with caveats that the audience either doesn&apos;t care about or hasn&apos;t hit yet.

What Supermemory got right is real. The engineering is genuinely competent. The Cloudflare Workers + Durable Objects + Hyperdrive-PostgreSQL stack is well-chosen for the workload. The TypeScript and Python SDKs plus the integrations with Vercel AI SDK, Mastra, LangChain, LangGraph, OpenAI Agents SDK, Agno, VoltAgent, Cartesia, and Pipecat make Supermemory genuinely drop-in for an agent runtime that already exists. The Memory Router pattern, an OpenAI-compatible reverse proxy that injects memories into prompts and harvests memories from completions transparently, is the slickest integration story in the corpus. The connector ecosystem (Google Drive, Gmail, Notion, OneDrive, GitHub, web crawler) and the multi-modal ingestion (PDF, image OCR, video transcription, AST-aware code chunking) are weeks of work the operator doesn&apos;t have to do. If you&apos;re a team building a consumer product where memory is a feature and you don&apos;t want to operate it, Supermemory is the most credible answer in the corpus. None of that is in dispute.

What gives me pause is structural rather than technical. Three things, none of which Supermemory&apos;s team can fix without re-platforming.

The trust boundary places the user&apos;s most intimate context inside someone else&apos;s perimeter. Memory contains what the user has asked, what they&apos;ve been told, what they care about, and the agent&apos;s accumulated model of them. A managed API places that data inside the vendor&apos;s trust boundary, which is reasonable for a fraction of users and structurally unacceptable for another fraction. For privacy-sensitive deployments, regulated industries, or single-developer power users who refuse to ship their context off the local machine, Supermemory&apos;s shape isn&apos;t a tradeoff to weigh, it&apos;s a non-starter. That&apos;s not a critique of Supermemory&apos;s posture, which is reassuring. It&apos;s a critique of the shape they bet on, which constrains who they can serve.

Costs scale with usage in ways the operator can&apos;t architecturally prevent. Supermemory&apos;s MCP error pathway includes &quot;402, Memory limit reached. Upgrade at supermemory.ai&quot;, and the platform tracks per-organisation document limits with overage billing. For an agent that calls memory on every turn (which is the recommended pattern for an agent that actually uses memory well), the per-call cost compounds quickly. The operator can budget but can&apos;t architecturally cap. An in-process library has no such failure mode, the cost is the cost of a few embedded-database reads and a few cosine distances, in the agent&apos;s own process, billed to nobody.

The engine is opaque. Supermemory&apos;s repo contains a 1,464-line zod-openapi schema documenting the wire contract in considerable detail, versioning, soft-deletion, the relation enum, the static/dynamic profile split. From the schemas you can reverse-engineer most of what the engine does. You can&apos;t reverse-engineer how. The embedding model, the chunking heuristics, the extraction prompts, the reranker, none of those are inspectable, and when the engine misbehaves on the operator&apos;s specific corpus the only recourse is escalating to support and waiting. That&apos;s the price of the genre, not a Supermemory-specific failing, but it&apos;s a price the corpus shows you don&apos;t have to pay if you pick a different shape.

Set against the other 18 systems, this looks less like Supermemory got it wrong and more like Supermemory committed hard to one shape on the matrix and is now constrained by the cell they&apos;re in. If the audience they&apos;re serving is comfortable with the trust boundary and the cost model, the engineering they&apos;ve built on top of those decisions is genuinely good. If the audience they&apos;re serving isn&apos;t, no amount of engineering rescues a shape mismatch. The most honest answer to the title&apos;s question is: Supermemory got their bet right for their audience, and the operator picking now should ask whether they&apos;re in that audience before defaulting to the loudest option in the room.

mem9 is worth holding up as the counter-example. It ships the same paradigm as Supermemory (multi-index hybrid) in the same shape (managed API at api.mem9.ai), and also ships as a self-hostable Go binary the operator can run on their own infrastructure. The operator who wants to start managed and migrate to self-hosted later doesn&apos;t have to re-platform, they change a base URL and a credential. That&apos;s a structurally less constrained bet than Supermemory&apos;s, and it&apos;s the bet I expect the second-generation systems in this corpus to default to.

## Why the shape often dominates the paradigm

The argument, plainly. Switching paradigm within a shape is a re-implementation. Switching shape is a re-platform.

If you start with Supermemory (managed API) and want to move to graymatter (in-process library), you&apos;re not just changing your memory engine. You&apos;re changing your data-residency posture, your billing model, your operational on-call shape, your trust boundary, and your dependency graph. These are infrastructure concerns, and infrastructure changes are expensive.

By contrast, swapping a flat-vector-RAG engine for a multi-index-hybrid engine within the same shape is mostly a matter of changing the call site and re-tuning the recall path. Same trust model, same operational shape, same billing posture. Different recipe inside.

The shape constrains the paradigm in practice too. A managed API is structurally suited to multi-index hybrid (the most popular paradigm in this column by a wide margin) and structurally awkward for filesystem-native (which presupposes the user&apos;s filesystem). A skill framework can&apos;t ship a heavyweight engine because it has no engine, the host agent does the work. The shape choice rules out a substantial fraction of the paradigm space before the operator gets to choose paradigm.

And the operational surface is the daily experience. Once a system is deployed, the operator interacts with the shape, not the paradigm. They manage subscriptions and API keys, or embedded database files and library versions, or a vault folder and an MCP configuration, or desktop application updates, or shell scripts and batch jobs, or skill files and hook scripts. Whichever shape they&apos;ve chosen, the daily texture of their work is shaped by it. The paradigm is a property of the engine that mostly matters at recall time. The shape is a property of everything else.

## Three trade-offs the shape forces, every time

The trade-offs that come with each shape compound across years. Three are worth naming because they show up in every shape and resolve differently in each.

Trust delegation. Where does the user&apos;s context live? A managed API places that context inside the vendor&apos;s trust boundary. Supermemory&apos;s marketing is reassuring, the encryption posture is reasonable, the Cloudflare-native architecture is competent. None of that is the same as running the engine on hardware you own. For some users this trade is fine, they already trust OpenAI with far more. For others it&apos;s a non-starter. An in-process library or a filesystem-native shape keeps the data on the operator&apos;s machine. A skill framework keeps the data wherever the host agent already keeps it. The trade-off resolves differently per shape, and you can&apos;t move it later without re-platforming.

Operational ownership. A managed API outsources operations to the vendor. An in-process library puts the operator on call. A filesystem-native shape splits the responsibility, the application owns its index, the user owns the vault, and reconciliation between them is the engineering. A desktop application owns its chrome and asks the user to handle the rest. A CLI is operational only when invoked. A skill framework piggybacks on the host agent&apos;s operations. None of these are wrong. They&apos;re different bets about who&apos;s awake at 3am when something breaks.

Distribution shape. A managed API ships through a `npm install` or a `pip install` plus an API key. An in-process library ships through the same channels but with no key. A filesystem-native system ships as an installer plus a folder. A desktop application ships as a code-signed platform binary with auto-update. A CLI ships through a package manager. A skill framework ships through the host agent&apos;s plugin protocol. The cost of a release is wildly different across the six. Supermemory pushes a Cloudflare Worker in seconds. Tolaria ships a calendar-versioned `YYYY.M.D` Tauri build to three operating systems with notarisation. Different shapes, different release cadences, different feedback loops with users.

These three trade-offs aren&apos;t abstract. They&apos;re the texture of working with a memory system day to day. The shape locks them in early, and the paradigm choice rides on top.

## The systems that span more than one shape

A handful of systems in the corpus genuinely occupy more than one shape, and they&apos;re the most instructive cases because they show what it actually costs to do so.

mem9 is the corpus&apos;s first system that&apos;s both a fully open self-hostable engine and a managed API endpoint at the same time. The README states the position with unusual precision. Switching between the hosted endpoint and the self-hosted server is &quot;a base-URL and credential change, not a plugin rewrite&quot;. Three architectural properties combine in mem9 that no other system in the corpus exhibits together: a managed externally-callable API, a multi-backend storage abstraction spanning TiDB, PostgreSQL, and a third backend called db9, and auto-provisioning of a fresh database per tenant. Each property is a direct consequence of being both self-hostable and managed. The cost is real, multi-backend abstraction, per-tenant provisioning, spend-limit middleware, control-plane and data-plane separation. The cost is also finite, and once paid the engine spans two cells of the matrix at once.

second-brain spans three shapes in a single Python codebase. It&apos;s filesystem-native (SQLite plus the user&apos;s folder), a desktop runtime (terminal REPL), and a skill framework (a Telegram bot acting as a hosted agent surface). One process, three surfaces, and a SQLite authorizer hook gating per-agent reads at the C layer to make multi-shape multi-agent isolation safe. It&apos;s not the cleanest system in the corpus, but it&apos;s the most ambitious about deployment-shape porosity.

graymatter ships a single ~10MB Go binary that becomes an in-process library, an MCP server, an HTTP server, a CLI, or a TUI dashboard depending on how you invoke it. Same engine, five surfaces. The library API is sixteen public symbols, of which three cover ninety-five percent of use. The whole production surface fits on one screen. This is the cleanest single-binary expression of the deployment-shape question I&apos;ve seen.

These multi-shape systems are still rare. But they&apos;re not anomalies. They&apos;re the leading edge of a porosity that&apos;s been latent in the field since the beginning, and they suggest the binary &quot;managed-or-library&quot; framing is probably the wrong one. A more useful question is, what is the shape of the engine, and which deployment surfaces are exposed?

## The honest checklist for picking now

If you&apos;re picking a memory system for a real project today, the corpus implies you should commit to shape before you commit to paradigm. The honest version of the choice is something like this.

Will the user&apos;s context leave the user&apos;s machine? If no, the managed API shape is out. This is a substantial fraction of single-developer power users and most regulated industries, and the constraint is binary, not negotiable.

Does the operator want to take operational responsibility for the engine? If no, the managed API is in and the in-process library and filesystem-native shapes are out, or significantly harder. The team&apos;s capacity to be on call is the cap.

Will the agent call memory on every turn? If yes, the CLI tool shape is out, latency is too high. The in-process library is the strongest candidate, since the recall path is microseconds plus embedding time inside the agent&apos;s own process.

Is the user a non-developer? If yes, the desktop application shape is the only one that fits without re-skilling them. CLIs and MCP configurations are non-starters for non-technical users.

Is the agent the host&apos;s agent (Claude Code, Cursor, Kiro CLI), and does the host expose a hook-and-skill surface? If yes, the skill or hook framework shape offers the lightest deployment, the host does most of the work.

Is the substrate the user&apos;s existing notes folder, and do they expect to keep editing it directly? If yes, the filesystem-native shape is the only one that respects the constraint. Bolting a database on top would break the user&apos;s relationship with their own files.

Once the shape is fixed, the paradigm question becomes manageable. There are still real choices to make inside any column, multi-index hybrid versus knowledge-graph augmented versus Karpathy LLM Wiki, but the choice is bounded by what fits inside the shape you&apos;ve already committed to. The corpus has 19 systems in 15 populated cells of the shape-by-paradigm matrix. The job is to pick the right cell and then pick from the systems in it, not to pick a paradigm in the abstract and discover later that no exemplar in your shape actually implements it.

## Where this leaves the field

The corpus&apos;s existing porosity, mem9 spans two cells, second-brain spans three, graymatter blurs the library/CLI/MCP boundaries, suggests the boundaries between shapes are conventions rather than constraints. The systems built next will probably commit less to a single shape and more to a deployment surface that exposes multiple shapes from one engine, the way mem9 ships one Go server that runs both as `api.mem9.ai` and as `make build`.

That&apos;s not the canonical shape story this corpus inherited from the field&apos;s first generation, where almost every system was clearly one thing. It&apos;s plausibly the shape story of the second generation, where the engine is the same and the deployment surface is what changes.

For the operator picking now, the practical advice is short. Pick the shape because it fits the constraint that&apos;s hardest to move later, where the data lives, who&apos;s on call, who the user is. Pick the paradigm inside that shape from what&apos;s available there. Be prepared to discover that the most popular paradigm in the abstract isn&apos;t the most popular paradigm in your column, and that&apos;s fine, popularity in the abstract isn&apos;t a constraint your project actually has to satisfy.

The next piece is on write-time investment, the highest-ROI design decision across the 19 systems, and the one move that compounds at every subsequent read. It&apos;s the principle that pulls SimpleMem&apos;s online synthesis, Hindsight&apos;s async consolidation, and LLM-Wiki&apos;s two-step ingest into the same pattern. If you&apos;ve ever wondered why the systems with the cleanest recall behaviour all spend disproportionate compute at write time, that piece is the answer.</content:encoded><category>memory</category><category>llm</category><category>agents</category><category>architecture</category><category>deployment</category><author>steven@sbatman.com (Steven Batchelor-Manning)</author></item><item><title>Eight Agentic Memory Paradigms</title><link>https://blog.sbatman.com/posts/2026-05-07-llm-memory-research-02/</link><guid isPermaLink="true">https://blog.sbatman.com/posts/2026-05-07-llm-memory-research-02/</guid><description>19 agent-memory systems resolve into 8 architectural paradigms, not 8 flavours of one recipe. Eight different bets on what memory is. Pick the one that fits.</description><pubDate>Thu, 07 May 2026 00:00:00 GMT</pubDate><content:encoded>&lt;img src=&quot;/posts/llm-memory-research/week-02/hero.png&quot; alt=&quot;Eight Agentic Memory Paradigms - hero image.&quot; class=&quot;hero-banner-post&quot; /&gt;


The first piece in this series landed on the one thing 19 open-source agent-memory systems agree on, and it was a negative. Flat vector RAG, on its own, isn&apos;t enough. Every team that started there ended up adding something. The agreement on what to add ends there.

(Need a catchup? The first piece is here: [LINK_TO_PIECE_1])

This piece is what&apos;s behind that &quot;what to add&quot; disagreement. Once you actually look at what the 19 added, the surface variety resolves into a small, sharp set of architectural commitments. Eight of them.

Eight paradigms. Not eight implementations of the same recipe. Eight different bets about what memory fundamentally is.

That distinction matters more than it sounds. Most of what reads as fragmentation across the agent-memory field is actually a small number of incompatible commitments dressed up in different vocabularies. Once you can name the eight, the field stops looking chaotic and starts looking like a design space with known edges.

Let&apos;s look at the eight, then at what they&apos;re really arguing about.

## The eight, in one sentence each

Flat vector RAG with structured extras (Flat-RAG). Embed everything, retrieve top-k, prepend to prompt. Then layer something on top, because nobody ships it bare. SimpleMem and early Memex are the textbook examples.

Knowledge-graph augmented (Graph). Memory is a typed graph of entities and typed edges. Recall is traversal, not similarity. Graphify, EdgeQuake, GitNexus.

Progressive compression (Prog-Compression). Memory is a hierarchy of increasingly summarised representations, with heat-gated promotion between tiers. The verbose original fades; the dense distillation persists. MemoryOS is the textbook implementation.

Multi-index hybrid search (MI-Hybrid). Several indexes running in parallel, fused at recall time, almost always with Reciprocal Rank Fusion at k=60. Hindsight, supermemory, mem9, graymatter.

LLM-as-retriever (LLM-Retriever). Skip the vector store. Give the model a hierarchical map of the documents and let it navigate. Memex evolved into this. OpenKB ships it. Supermemory&apos;s rewrite mode runs on it.

Trace-as-memory (Trace). The agent&apos;s own execution history is the memory. Not the user&apos;s documents. Moraine is the purest case. Hindsight&apos;s observation tier is a hybrid component.

Karpathy LLM Wiki (Wiki). Plain Markdown, wikilinks, frontmatter, an index file as the catalogue, the user as the ultimate curator. Three independent teams rebuilt this in 2025. Understand-Anything reads it. OpenKB writes it. LLM-Wiki does both as a desktop app.

Filesystem-native context store (FS-Native). The file is the artefact. The database is a derivative cache. When in doubt, the disk wins. OpenContext, Tolaria, second-brain.

Eight commitments. Read them again as commitments rather than as feature lists. (Trace) says memory is what the agent did. (Prog-Compression) says memory is the densest faithful summary of what&apos;s been observed. Those aren&apos;t different settings on the same dial. They&apos;re different answers to &quot;what is the system trying to remember?&quot;

## What the paradigms are actually arguing about

The temptation is to read the eight as a tier list and pick the one with the best benchmark. That&apos;s a category error.

Each paradigm is a commitment about three things at once. What kind of question the agent will ask. What shape the answer should take. Where the cost of getting from one to the other should be paid.

A coding agent asking &quot;what breaks if I change this function&apos;s return signature?&quot; wants structural intelligence. Cosine similarity over chunks of source code is a poor substitute for a typed graph of call edges. (Graph) is the natural fit, and a (Flat-RAG) system trying to answer the same question will lose to a graph system every time.

A research agent reading a 400-page report wants navigation. It wants to descend into the section that matters and ignore the rest. (LLM-Retriever) over a hierarchical map of the document beats (Flat-RAG)&apos;s top-k cosine search, because the question shape is &quot;find the right region, then read carefully&quot; rather than &quot;find the most similar 800-token window&quot;.

A personal assistant accumulating context over years wants compression. Without it the store grows linearly in tokens until retrieval drowns. (Prog-Compression)&apos;s tiered hierarchy is the answer. A (Flat-RAG) system at year three is a system whose retrieval is mostly stale chunks pretending to be relevant.

An observability-minded team operating agents in production wants the trace. (Trace) turns &quot;what did this agent do last Tuesday at 3pm&quot; into a literal query. Most systems with rich logging can&apos;t answer that question because the logs aren&apos;t memory; they&apos;re text dumps the agent itself can&apos;t see.

The paradigm-fits-question framing is the most useful single move you can make on this material. Once it clicks, the eight stop being competitors and start being tools, each with a clean problem shape it&apos;s right for and a set of problem shapes it&apos;s wrong for.

## Some of these don&apos;t compose

The mature systems in the corpus increasingly hybridise. Hindsight is primarily (MI-Hybrid) with a (Trace) observation tier and a (Graph) entity sub-component. Supermemory is (MI-Hybrid) with a versioned-DAG schema that&apos;s (Graph)-adjacent and an (LLM-Retriever) rewrite mode. LLM-Wiki is (Wiki) with (Graph) community detection and opt-in (MI-Hybrid) vector retrieval. OpenContext is (FS-Native) with (MI-Hybrid) retrieval mechanics layered on top.

That&apos;s the optimistic reading. The pessimistic reading, which is also the accurate one, is that composition isn&apos;t free.

Take (Prog-Compression) and (Trace) together. (Prog-Compression) says the verbose original fades; only the densest representation persists. (Trace) says the verbose trace is the memory; compressing it is destroying the artefact. You cannot make both your primary mode without an internal contradiction. You can run them as separate stores with separate query paths, which is what Hindsight does. But the choice of which gets first call on retrieval is an architectural decision that propagates through the whole stack.

Or take (Wiki) and (Flat-RAG) together. The (Wiki) paradigm makes a deliberate philosophical refusal: memory should be a persistent compounding artefact a human can read, not a re-derivation from chunks on every query. Bolting a vector store onto a wiki to &quot;improve retrieval&quot; doesn&apos;t compose them; it turns the wiki into chunked text the LLM happens to have access to and loses the compounding property that made the wiki worth building.

(Graph) and (FS-Native) compose more cleanly, because filesystem-native storage and graph-as-derivative-cache aren&apos;t fighting over the same role. The graph is a query optimisation; the file is the source of truth. But even there, the synthesis question (what does the graph do when the file changes?) is an open problem across every system in the corpus that&apos;s tried it.

Put plainly:

The eight paradigms compose where their commitments don&apos;t collide. Where the commitments collide, composition is a deliberate engineering choice that costs something on both sides.

The systems that have done this well have done it by being explicit about which paradigm is primary and which is secondary, and by accepting that the secondary&apos;s strengths are partly muted in the bargain.

## Five angles on why this matters

Hindsight is the clearest case for paradigm-as-strategy rather than paradigm-as-style. Its TEMPR retrieval system doesn&apos;t just run multiple indexes; it has a deliberate routing policy that decides which paradigm to lean on for a given query. Temporal queries go to the trace tier. Relational queries go to the entity tier. Semantic queries go to vector RRF. The routing is the system. Strip the routing out and you have four indexes that disagree about the answer.

EdgeQuake makes the case from the cost side. Its multi-pass gleaning extractor catches 15-25% more entities than a single-pass extractor and dominates the system&apos;s compute bill by an order of magnitude over retrieval. (Graph) isn&apos;t expensive at recall time. It&apos;s expensive at write time, in a way that makes (Flat-RAG)&apos;s &quot;embed and forget&quot; feel almost free by comparison. The teams that picked (Graph) picked it knowing this. The teams that drift into it accidentally end up surprised by their LLM bill.

MemoryOS makes the case from the long-tail side. Its three-tier STM/MTM/LPM hierarchy with the heat-gated promotion formula isn&apos;t a clever optimisation on top of flat-RAG. It&apos;s a different commitment about what storage is for. STM holds raw turns. MTM holds segment summaries promoted by access pressure. LPM holds the user profile that survives indefinitely. A system without that hierarchy at year three is a system whose retrieval has quietly become noise.

Moraine makes the case from the observability side, and from the most extreme end of the spectrum. There&apos;s no embedding model. No vector database. No LLM in the operational dataplane at all. ClickHouse, materialised views, BM25 over the verbatim trace. The argument is that for its specific problem, &quot;what did the agent do, when, with what tokens, and what was the outcome&quot;, structured trace storage is the better trade than any retrieval-by-similarity mechanism. The paradigm fits the question.

Understand-Anything, OpenKB, and LLM-Wiki together make the case from the curation side. Three teams, working independently, built variants of the same Karpathy sketch in the same year. Plain Markdown. Wikilinks. An index file. The compounding artefact, with the human as the eventual arbiter of staleness and the LLM as the heavy lifter of synthesis. Convergent design across three independent teams is rare in this field. When it happens, it&apos;s worth taking seriously as a signal that the underlying commitment is sound.

Five systems, five paradigms, same underlying point. Each one is a deliberate answer to a question the other four were asking less directly.

## What to do if you&apos;re picking now

If you&apos;re building agent memory in 2026, the choice of paradigm is more consequential than the choice of database, the choice of embedding model, or the choice of retrieval algorithm. The paradigm sets which of those choices matter.

Characterise the questions your agent will ask before you characterise the documents it will read. The wrong paradigm with the right corpus retrieves the wrong things. The right paradigm with a worse corpus still works for the questions it was designed for. Map question shape to paradigm and pick a primary. Plan for one secondary if the question shape is genuinely mixed. Don&apos;t plan for three.

Be honest about which paradigm&apos;s weaknesses you&apos;re taking on. Knowledge-graph systems are the most expensive to build and the hardest to maintain when the source changes. Progressive compression loses information by design. Multi-index hybrid systems need an explainable fusion stage or they become folklore. Karpathy-wiki systems push curation work back onto the user. Filesystem-native systems push synthesis back onto the agent. Trace-as-memory captures actions but not facts. None of these is disqualifying. All of them are worth knowing about up front.

Take the deployment shape as seriously as the paradigm. A managed API and an in-process library implementing the same paradigm feel like completely different products to the team using them. The choice between them is often more consequential than the paradigm itself, because deployment dictates who owns the data, who runs the infrastructure, and what the trust model is. Pick the deployment shape first. Pick the paradigm inside it.

## Where the field is going

The longer-horizon read of the eight paradigms is that the equilibrium probably isn&apos;t eight. It&apos;s two or three composite paradigms that absorb three or four of the others each.

Three candidates are already visible in the corpus. Hybrid retrieval with provenance and an observation tier, which is (MI-Hybrid) plus (Trace) plus the write-time-investment habit from (Flat-RAG). Hindsight is the closest current expression. Karpathy wiki on a filesystem-native substrate, which is (Wiki) plus (FS-Native) with (LLM-Retriever) as a query mode. LLM-Wiki and OpenKB are moving towards this. Knowledge-graph systems for structurally rich domains, which is (Graph) with (MI-Hybrid) mechanics for fuzzy entry-point selection. EdgeQuake, GitNexus, and Graphify are converging on this shape.

The pure single-paradigm systems aren&apos;t going to disappear. SimpleMem-style (Flat-RAG) is the right answer when the budget is small and the question shape is genuinely flat. Moraine-style (Trace) is the right answer when the question is about what the agent did rather than what the world contains. The point isn&apos;t that the paradigms are converging. The point is that the loudest paradigm right now isn&apos;t necessarily the right paradigm in six months, and the problem has been the same all along.

If there&apos;s a single piece of advice that falls out of the eight, it&apos;s this. Pick the paradigm because it fits the problem. Not because it&apos;s the loudest paradigm in the room. Not because the benchmark on the dataset that doesn&apos;t match your domain says so. Not because the framework you already use happens to ship with one. The shape of the question your agent will ask is the only thing that should drive the answer.

Next up I&apos;m writing about the patterns that show up across most of the eight. Provenance. Hybrid retrieval. Write-time investment. The handful of moves that are universal even when the paradigm choices aren&apos;t.

If this was useful, share it with someone in your network who&apos;s building agent memory right now, or who&apos;s about to. The conversations I keep ending up in with engineers and tech leads working on this stuff are the best part of writing the series, and the only way more of them happen is if pieces like this find the right people.</content:encoded><category>memory</category><category>llm</category><category>agents</category><category>architecture</category><category>paradigms</category><author>steven@sbatman.com (Steven Batchelor-Manning)</author></item><item><title>What 19 agent-memory systems all agree on (and it&apos;s a negative)</title><link>https://blog.sbatman.com/posts/2026-05-02-llm-memory-research-01/</link><guid isPermaLink="true">https://blog.sbatman.com/posts/2026-05-02-llm-memory-research-01/</guid><description>Nineteen open-source agent-memory systems agree on one thing, and it&apos;s what not to do. Vector RAG isn&apos;t enough on its own. Here&apos;s what they all added.</description><pubDate>Sat, 02 May 2026 00:00:00 GMT</pubDate><content:encoded>&lt;img src=&quot;/posts/llm-memory-research/week-01/hero.png&quot; alt=&quot;What 19 agent-memory systems all agree on (and it&apos;s a negative) - hero image.&quot; class=&quot;hero-banner-post&quot; /&gt;


Over the last few months I&apos;ve continued down the rabbit hole of agentic memory systems, digging deep into open-source repos and research papers. It&apos;s clear the current solutions out there agree on one thing, and it&apos;s what not to do.

Vector RAG, on its own, is not enough.

From the 19 systems, every single one that started with flat vector RAG ended up adding something on top of it. Every one. The additions are all over the place. Some added knowledge graphs. Some added trace logs. Some added a hierarchical wiki the LLM navigates by reading files. Three teams independently rebuilt Andrej Karpathy&apos;s late 2025 LLM wiki sketch and called it memory. Two teams ripped the vector store out partway through and replaced it with ripgrep over a structured folder tree. The variety of what people added is noticeable. The total agreement that something must be added is the most useful single finding I&apos;ve spotted.

That&apos;s the consensus. Let&apos;s look at why, and where it&apos;s coming from.

The question every team is trying to answer
Digging through the systems, it&apos;s clear they&apos;re all working towards one question.

How do you give a language model usable, durable, agent-friendly memory beyond what fits in a context window?

That&apos;s the field. Everything else, the embeddings, the graphs, the wikis, is a particular bet on how to answer it. Each project&apos;s README, in its preferred dialect, is a restatement of that one sentence.

That&apos;s where the alignment stops. From there, the myriad of solutions begin.

What the disagreement actually looks like
Eight paradigms. Sharp boundaries. Real engineering trade-offs that pull in incompatible directions. Most teams take a primary position on one paradigm and a secondary position on another. A few hybridise across three. Nobody&apos;s converged on a winner.

The eight, one sentence each.

Flat vector RAG with structured extras. Embed everything, retrieve top-k, prepend to prompt, then add a layer or three of cleverness on top. SimpleMem and Memex started here.

Knowledge-graph augmented. Memory as a typed graph of entities and relationships, retrieved by traversal plus vector ranking. Graphify, EdgeQuake, GitNexus.

Progressive compression. Memory as a hierarchy of increasingly summarised representations, with heat-gated promotion between tiers. MemoryOS is the textbook implementation. SimpleMem hybridises into this from paradigm 1.

Multi-index hybrid search. Multiple indexes, one fusion stage, almost always Reciprocal Rank Fusion at k=60. Hindsight, Supermemory, Graymatter, OpenContext, mem9.

LLM-as-retriever. Forget the vector store, give the LLM a hierarchical map of the documents, have it navigate to the answer. Memex evolved into this. OpenKB ships it. Supermemory&apos;s rewrite mode runs on it.

Trace-as-memory. Memory is the agent&apos;s own execution history, not the user&apos;s data. Moraine in its purest form. Hindsight&apos;s observation tier as a hybrid component.

Karpathy LLM wiki. Plain Markdown, wikilinks, frontmatter, an index file as the catalogue, the user as the ultimate curator of staleness. Three independent implementations. Understand-Anything reads it. OpenKB writes it. LLM-Wiki does both as a desktop app.

Filesystem-native context store. The file is the artefact. The database is a derivative cache. When in doubt, the disk wins. OpenContext, Tolaria, second-brain.

The boundaries between these aren&apos;t stylistic. They&apos;re different commitments about what memory fundamentally is. Paradigm 6 says memory is what the agent did. Paradigm 3 says memory is the densest faithful summary of what&apos;s been observed. You can&apos;t make both your primary mode without an internal contradiction. You can compose them, and a few systems do, but composition forces a choice about which gets first call on retrieval, and that choice has architectural consequences that propagate through the whole stack.

That&apos;s the framing that, more than any other, explains why the field looks the way it does.

The thing everyone agrees on
Of the 19 systems, every one that started with flat vector RAG as its primary mechanism has ended up adding something on top of it. The additions aren&apos;t all the same. The agreement that something must be added is universal. Put plainly:

No serious system believes that &quot;embed everything, retrieve top-k by cosine, prepend to prompt&quot; is sufficient as a memory architecture for an agent.

That sentence is worth being careful with. It&apos;s not a claim that vector search is useless. Vector search shows up in roughly three quarters of the 19 systems. It&apos;s the substrate. It&apos;s just no longer treated as the solution.

Hindsight is the clearest case. Its documentation enumerates four pain points of pure vector RAG and then architects the rest of the system around answering each one. Pure semantic similarity loses temporal signal. Facts get disconnected. Agents need to consolidate. Reasoning style is bank-specific. Hindsight&apos;s four headline subsystems map almost directly onto those four pain points. The system is, in the words of its own deep-dive, a complete answer to &quot;what does an agent memory system look like if you take the limitations of RAG seriously?&quot;

EdgeQuake makes the same case with a different illustration. Imagine the query &quot;How did Sarah Chen&apos;s research on neural networks influence the work of her colleagues at Quantum Dynamics Lab?&quot; A flat vector store returns disconnected chunks. One mentioning Sarah Chen. One about neural networks. One about Quantum Lab. The system can&apos;t follow the influence chain across documents because the chain was never indexed. What was indexed was the similarity of each chunk&apos;s embedding to the query&apos;s embedding. Similarity isn&apos;t relationship.

SimpleMem makes the case in benchmark numbers. Its paper compares the cost-versus-quality frontier of four memory approaches on LoCoMo. Pure vector retrieval is the bottom of the chart on F1 and the middle on cost. The systems that beat it all do something more than retrieve top-k. SimpleMem&apos;s contribution is intent-aware retrieval planning plus online write-time synthesis. A-Mem&apos;s is iterative LLM reasoning loops. Mem0&apos;s is per-write LLM extraction. Every method that beat naive vector RAG did so by adding a step.

Karpathy makes the case from the user&apos;s side rather than the agent&apos;s. RAG re-derives knowledge from raw chunks on every query, with no accumulation. A wiki, by contrast, is a persistent compounding artefact. Cross-references already there. Contradictions already flagged. Synthesis already reflecting everything that&apos;s been read. The maintenance burden, which historically killed personal wikis, is the part the LLM does for free. This isn&apos;t a paradigm-1 system that&apos;s added things. It&apos;s a deliberate refusal to be a paradigm-1 system in the first place.

Moraine makes the case from the most extreme end of &quot;the LLM is never invoked during ingestion or retrieval&quot;. There&apos;s no embedding model, no vector database, no LLM runtime, no message queue, no Python-managed daemon in the operational dataplane. Just ClickHouse, materialised views, and BM25 over the verbatim trace. Moraine isn&apos;t arguing that vector search is wrong. It&apos;s arguing that for its specific problem, BM25 over a normalised event table is the better trade.

Five teams, same negative argument from five angles. The argument isn&apos;t that flat vector RAG is useless. It&apos;s that on its own, it leaves too many of the relevant retrieval shapes unsupported. Temporal queries. Relationship queries. Multi-hop queries. The kinds of question that need provenance, navigation, or structured filters. The kinds of question that need the system to know something it never observed but should have inferred.

Even the systems that do run flat vector RAG as their primary mechanism stack things on top. SimpleMem layers intent-aware retrieval planning and online write-time synthesis on the cosine search. Memex, which started in the same camp, ended up replacing the vector store entirely with file-system-level tools and an LLM-as-search-engine pattern over structured P.A.R.A. organisation. The final shape of Memex isn&apos;t a vector RAG system that happens to use ripgrep. It&apos;s a system that abandoned vector RAG once the team had actually used it.

The pattern&apos;s consistent enough to state plainly.

Every one of the 19 systems that started with flat vector RAG ended up adding something. None of them judged the original recipe sufficient on its own.

That&apos;s the most unambiguous finding from the 19 deep-dives. It&apos;s the closest thing the field has to settled engineering. And the finding&apos;s negative rather than positive. The field knows what doesn&apos;t work. It doesn&apos;t yet know what does.

Why no convergence
If everyone agrees that flat vector RAG isn&apos;t enough, why hasn&apos;t the field converged on a single replacement?

Because the engineering trade-offs are real and they pull in incompatible directions. Six of them are worth naming.

Write cost versus read cost. A team that pays heavily at write time, like SimpleMem with online synthesis or Hindsight with async consolidation or LLM-Wiki with two-step ingest, gets a cleaner store that reads cheaply. A team that pays at read time, like Moraine with BM25 over the verbatim trace or OpenKB with LLM tree-walks over PageIndex, gets a write path that scales linearly in raw bytes and a read path whose cost depends on query shape. The rough rule of thumb across the 19 is: if the data will be read more than five times, pay at write time; otherwise, pay at read time. The threshold&apos;s different for every team.

Substrate ownership. Where does the data live? In a database the operator runs (PostgreSQL, SQLite, ClickHouse). In the user&apos;s filesystem (OpenContext, Tolaria, OpenKB). In a vendor&apos;s cloud (Supermemory, mem9&apos;s hosted endpoint). Or in the agent&apos;s harness logs on disk (Moraine). Each substrate has implications for backup, portability, and operational burden. The widest gap is between Tolaria&apos;s position, the file is the artefact and the database is a derivative cache, and Supermemory&apos;s position, in which the substrate is a closed PostgreSQL behind a Cloudflare Workers API the developer can never directly inspect. These aren&apos;t just different engineering decisions. They&apos;re different philosophies about who owns the user&apos;s memory.

Agent shape. The shape of the agent that consumes memory matters more than the abstract quality of the retrieval algorithm. A coding agent operating over a single repository wants structural intelligence: blast radius, call chains, community membership. Vector similarity is a poor substitute for &quot;what breaks if I change this function&apos;s return type&quot;. A research agent reading long PDFs wants hierarchical navigation. A personal-life-recording agent wants a condensed user profile that grows over time, plus the ability to navigate back to the original entries by date. The right paradigm depends on the shape of the question the agent will ask, not on any abstract quality of the retrieval algorithm.

Operator burden. An in-process library running against an embedded SQLite file imposes near-zero operational burden. A managed API service imposes near-zero configuration burden but high vendor-lock-in burden. A self-hosted PostgreSQL with pgvector and a worker process imposes meaningful infrastructure burden but offers full transparency. A filesystem-native store imposes near-zero burden of either kind but pushes synthesis work back onto the agent. Different teams have different operator profiles.

Update semantics. When a fact stops being true, what happens? Flat vector RAG handles updates badly. A new chunk supersedes an old one only if retrieval happens to favour it, and the old chunk lingers as a ghost. Knowledge-graph systems handle updates with explicit edge manipulation, but the maintenance question (what happens to derived edges when a source document changes?) is largely unsolved across the 19. Versioned-DAG systems handle updates by appending new nodes that point to old ones, which solves the lineage question but trades it for a garbage-collection one. Filesystem-native systems handle updates by letting the user edit the file, which is correct but pushes the synthesis question back onto the human. None of the 19 has a clean cross-paradigm answer to &quot;the user said this last month and now it isn&apos;t true&quot;. Every system has a partial answer it isn&apos;t quite happy with.

Provenance. Does every fact in memory carry where it came from? The mature consensus is yes. Every system that started without provenance has added it. But adding provenance to a system that didn&apos;t have it from the start is genuinely hard. The chunk has an embedding. The embedding has a topive-k rank in the result. What it doesn&apos;t have, unless someone built it, is a pointer back to the conversation turn or document section that produced it. Retrofitting this onto a flat vector RAG store is a meaningful project. The fact that every mature system has done it is the strongest evidence that it has to be done.

These six trade-offs aren&apos;t solved problems. They&apos;re the live edges of the design space. Different teams resolve them differently because their constraints are different. That&apos;s why the field looks fragmented from the outside, and why on closer reading the apparent fragmentation is mostly informed eclecticism rather than confusion.

What this means if you&apos;re building memory now
If you&apos;re building agent memory in 2026, the landscape gives you no winning blueprint. What it does give you is a clear set of known-bad starting points and a clear set of known-valuable additions, parameterised by the trade-offs above.

The practical advice that falls out of 19 deep-dives is short.

Don&apos;t ship flat vector RAG and call it a memory system. It&apos;s the right substrate for many systems but it isn&apos;t the answer on its own. Every team that started there ended up adding something. The question isn&apos;t whether you&apos;ll add something but what you&apos;ll add. The three most common additions are hybrid lexical plus semantic retrieval fused with RRF, provenance metadata travelling with every fact, and write-time synthesis (online dedup, atomisation, structured extraction). If your system has none of those, the weekend&apos;s project is to pick the most relevant one and ship it.

Match the paradigm to the problem. The right architecture depends on the shape of the question your agent will ask. A coding agent should consider knowledge-graph augmentation. A research agent reading long documents should consider LLM-as-retriever over hierarchical structure. A personal assistant accumulating context over years should consider progressive compression with tiered storage. There&apos;s no shame in hybridising. Most mature systems do. The shame&apos;s in not knowing which trade-off you&apos;re making.

Be honest about the weakness you&apos;re taking on. Knowledge-graph systems are the most expensive to build and the hardest to maintain under source change. Progressive-compression systems lose information by design. Multi-index hybrid systems need an explainable fusion stage. None of these is disqualifying. All of them are worth knowing about up front.

Expect to invest at write time. The highest-ROI design decision across the 19 is paying compute up front to make the store cleaner. SimpleMem&apos;s online synthesis. Hindsight&apos;s async consolidation. LLM-Wiki&apos;s two-step ingest. These aren&apos;t all the same mechanism but they&apos;re all the same principle. The work you do at write time pays back at every subsequent read.

Build for the agent&apos;s behaviour, not just for the agent&apos;s prompt. oh-my-kiro&apos;s central design claim, &quot;if a constraint can be expressed in code, it shouldn&apos;t be enforced with words&quot;, is more broadly true than its specific implementation. Memory systems that rely on the LLM to behave a certain way will see those behaviours decay across long sessions. The systems with lasting behavioural properties get them by making the substrate enforce them.

Take the deployment shape as seriously as the paradigm. A managed API and an in-process library can implement the same paradigm and feel like completely different products. The choice between them is often the most consequential decision the team makes. It dictates who owns the data, who runs the infrastructure, and what the trust model is. None of those is downstream of the retrieval algorithm.

Where the field is
Not in convergence. Not in fragmentation. In what the literature on emerging engineering disciplines calls informed eclecticism. A community that&apos;s agreed on the problem, agreed on the constraints, agreed on what doesn&apos;t work, and is working through the design space without a single dominant solution emerging.

That&apos;s not a bad place to be. It&apos;s the necessary phase before convergence. Database systems went through it in the 1970s. Mobile OSes went through it in the late 2000s. Agent memory&apos;s in that phase right now. The eight paradigms across the 19 systems aren&apos;t all going to survive. Six months from now the scene will be different again, new approaches, refined approaches arriving constantly will continue to probe the solution space.

If the field has any single piece of received wisdom right now, it&apos;s the one this piece opened with. Flat vector RAG, on its own, isn&apos;t enough. Every team that started there ended up adding something. The agreement on what to add ends there. The agreement on the need to add is total, and that&apos;s the foundation everything else sits on.

Over the coming few weeks I will dig into the topics mentioned in this high level and cover some of what I&apos;ve found on this magical mystery tour of Agentic memory systems.</content:encoded><category>memory</category><category>llm</category><category>agents</category><category>architecture</category><category>paradigms</category><author>steven@sbatman.com (Steven Batchelor-Manning)</author></item></channel></rss>