llm memory research

Pay at write time, read for free. The one Agentic Memory move that compounds across every retrieval

14 May 2026 · ~12 min read · by Steven Batchelor-Manning

Contents

What write time actually is
The six forms the pattern takes
Why the forms compose
The convergence is the evidence
The honest costs
The adoption order the corpus implies

Pay at write time, read for free. The one Agentic Memory move that compounds across every retrieval - hero image.

If you change one thing about how you build agent memory in 2026, change how much of the work you do at write time. After 19 systems deep on this, no other architectural decision compounds the way this one does. The systems with the cleanest recall behaviour all spend disproportionate compute when a fact enters the store. The systems that don’t are paying interest on that decision forever, on every read, by an agent that no longer has the original context to reason from.

The case is built on an asymmetry so obvious it tends to slide off the eye. Every fact is written once and read many times. In real agent traffic the ratio is typically two to four orders of magnitude. A second of LLM work paid at write time, divided across thousands of subsequent reads, is a rounding error per read. A second of work skipped at write time has to be re-derived on every read, by an agent navigating around the rough edge instead of past it. That arithmetic forces the design conclusion. The corpus has already done the forcing for the field.

Let’s look at what “write time” actually means, the six forms the pattern takes, the convergence evidence that’s hard to argue with, and the honest costs of going this route.

What write time actually is

Write time is the moment a fact enters the persisted store. The user message that gets retained. The document dropped into the ingest queue. The clipped page that lands on disk. The tool result the agent decides to keep. The background sweep that promotes raw observations into synthesised beliefs.

What write time isn’t is the user-perceived response. The agent has already replied from the previous state of memory, the new state will be visible to the next query. That asynchrony is what makes the cost tolerable. Write-time work runs behind the response, not in front of it. Hindsight is sharp about this discipline. Extraction and entity resolution run synchronously, but the consolidation pass that turns facts into observations is deliberately deferred to a background sweep so retain latency stays low. mem9’s reconcile phase issues an LLM call for every batch of new facts, then hands the result to a background goroutine, not the agent’s response. Supermemory’s document-to-chunk-to-memory pipeline does the heavy LLM work entirely off the API request path.

The systems that mix write-time work into the response path suffer for it. The systems that hold the discipline are the ones that ship.

The six forms the pattern takes

Across the 19 systems, write-time investment has surfaced in six recognisable forms. Most mature systems run several of them at once. The ones that run several are the ones with the cleanest behaviour under load.

Online dedup-and-synthesis is the first form, and the highest-leverage. When a fact arrives, the system queries the existing store for candidates that overlap, then issues a single batch LLM call that emits per-fact actions: add, update, delete, or no-op. SimpleMem’s add_memories is the textbook version. mem9’s reconcile is the same pattern at scale. The store never accumulates near-duplicates that have to be filtered or re-ranked on every later read. A subtler benefit is that synthesis surfaces contradictions that flat-write systems never even detect. When “user likes React” arrives followed later by “user has switched to Vue”, Hindsight’s consolidation refines the observation to capture the journey rather than overwriting, so the memory records preference as it evolved, not just the latest state.

Atomisation is the second form. Break a statement into the smallest individually retrievable propositions before embedding. A wall of text recovered as a single chunk is opaque to ranking. A paragraph atomised into half a dozen short, self-contained claims gives the retriever something to actually discriminate against. LLM-Wiki’s two-step ingest is the cleanest expression: step one produces a structured analysis that names the entities, concepts, claims, and relations. Step two writes the actual wiki pages, each one functionally an atomised proposition. The Louvain community detection running over the graph only makes sense because the units are atoms. Run it over arbitrary chunks and the community structure means nothing.

Multi-step ingest is the third form. Once you accept that ingest doesn’t have to be a single LLM pass, the pipelines fan out into richer compositions, deterministic where possible, LLM-where-necessary. Understand-Anything is the most extreme example in the corpus. Six of its nine agents follow the same internal structure: phase one writes and runs a deterministic helper script (Tree-sitter for structure, a Node script for fan-in and fan-out metrics), phase two reads the JSON output and applies LLM judgment. The LLM is explicitly told “Do NOT re-run file discovery commands or re-count lines, trust the script’s results entirely”. OpenKB’s compilation pipeline is four steps built around prompt-cache reuse, the cache amortising the document context across many fan-out calls. A multi-step pipeline that’s naïvely implemented is brutally expensive. A multi-step pipeline that’s built around the cache is cheaper per document than the single-shot equivalent.

Provenance metadata is the fourth form, and the simplest to implement. Every entry carries the source it came from, the timestamp it was captured at, optionally the confidence the system had at capture, and optionally the citations that justify it. Hindsight is the most rigorous expression: every fact carries the journalist’s interrogation, what, when, where, who, why, plus typed kind and category, plus source memory ids, plus a consolidated-at timestamp. Supermemory carries provenance at every layer of its three-tier object model, memories are immutable nodes in a versioned DAG connected by typed updates, extends, and derives edges. mem9 carries source, agent id, and session id on every row, plus a versioning column that supports an append-and-archive transaction so the previous version is never lost.

Confidence scoring is the fifth form. Each fact gets a number or a state telling downstream readers how much to trust it. Hindsight uses a freshness lifecycle, observations move from fresh to confirmed as corroborating evidence arrives. Supermemory exposes a relative version distance from the primary memory so clients can render a temporal slider over a memory’s history without re-querying. The hard part isn’t producing the score, it’s calibrating it. The deployment pattern that makes this shippable is shadow-mode: ship the heuristic dark first, collect the score distribution against real traffic, decide the threshold from the data, not from intuition.

Type tagging is the sixth form. Every atom is labelled with what kind of thing it is, concept, entity, claim, relation, event, conversation. Tolaria’s frontmatter-as-type convention shows that even a convention-only type system delivers most of the value, you don’t need schema validation to get the benefits of typed retrieval. The label gives the retriever a second axis to filter on, which gives the agent a way to ask for “the concepts on this topic” rather than “the chunks that match this query”.

Six forms. They aren’t a checklist where you pick three. They reinforce each other.

Why the forms compose

A subtle observation that doesn’t come out of any single system but is visible across the corpus: the forms are mutually reinforcing.

Atomisation is more useful when atoms are typed, because the type tells the retriever what kind of atom it has. Type tagging is more useful when atoms have provenance, because the type plus the source lets the retriever filter on both axes. Provenance is more useful when atoms have confidence, because confidence tells the retriever how much to trust the provenance. Confidence is more useful when the system performs online dedup, because dedup folds many low-confidence corroborating sources into a single high-confidence fact with multiple source ids. And the multi-step pipeline that does all of the above is more useful than the single-shot pipeline that does one of them, because the steps can hand structured intermediate results between each other rather than hand prose summaries.

Hindsight illustrates this composition the most cleanly. The atom is the extracted fact. The fact is typed by kind and type. It carries six-dimensional provenance via what, when, where, who, why, plus event date and mentioned-at. It carries confidence implicitly via the freshness state on its derived observations. It’s reconciled into observations via a background batch consolidation pass. The whole thing runs through a multi-step ingest pipeline that separates extraction, entity resolution, embedding, and consolidation. All six forms compose, and the composition is what makes the downstream observation tier legible to the reflection agent at all.

The inverse is also visible. Skipping one form weakens the others. A typed corpus without atomisation gives you typed wall-of-text. An atomised corpus without types gives you a flood of equivalent units the retriever can’t discriminate between. Provenance without confidence tells you where a fact came from but not how much to trust it. Dedup without provenance loses the audit trail of which sources fed which fact. The decision isn’t whether to invest at write time, it’s how many of the six forms to compose, and the corpus suggests the honest answer is “all of them, eventually”.

The convergence is the evidence

Anyone can find a clean architecture in a single mature system. The interesting question is whether independent systems, starting from different assumptions, end up in the same place. On write-time investment, they do.

The Karpathy LLM Wiki pattern is the clearest convergence. Karpathy’s original gist did the work in a single LLM pass per document. Three independent implementations in the corpus, LLM-Wiki, OpenKB, and Understand-Anything’s knowledge-graph mode, all moved away from that single-pass design within their first major iteration. LLM-Wiki landed on the two-step chain-of-thought pattern explicitly motivated by the observation that single-shot generation forgets to link to existing content. OpenKB landed on a four-step pattern explicitly motivated by prompt-cache reuse. Understand-Anything landed on a six-of-nine two-phase pattern explicitly motivated by the observation that LLMs are slow at counting and wrong at line numbers. Three different starting points, same architectural conclusion. That kind of convergence is the strongest possible evidence that the pattern is load-bearing, not stylistic.

The negative evidence is just as sharp. Systems that started without write-time investment have, at some maturation point, added it. The shift goes one way. No mature memory system in the corpus has reverted from “do work at write time” to “do work at read time”. The systems that designed for it from the start, Hindsight, Supermemory, Tolaria, carry their architecture forward gracefully. The systems that didn’t are paying interest on the architectural debt forever.

Plainly: if you’re starting a new memory system in 2026 and you skip write-time investment, you are choosing to repeat a journey the field has already finished and learned from.

The honest costs

This isn’t free. Anyone who’s shipped a memory system will recognise the costs.

Latency goes up at the write path. Multi-step ingest pipelines take seconds, sometimes tens of seconds, per document. Supermemory’s 10,000-docs-per-hour throughput is bottlenecked by extraction LLM cost at roughly three to five LLM calls per document. mem9’s reconcile is a synchronous LLM call on the ingest path, even with batching it dominates the wall-clock cost of writing. OpenKB’s multi-step compilation runs five LLM calls per document even with the prompt cache reused across all of them.

Token spend goes up. Online dedup costs tokens because the LLM needs to see the candidate existing facts. Atomisation costs tokens because the LLM has to be asked to split rather than summarise. Multi-step pipelines cost tokens at every step. Prompt-cache reuse blunts the marginal cost but doesn’t eliminate it.

Engineering complexity goes up. A pipeline with five steps has more failure modes than a pipeline with one. mem9’s extraction prompt has three fallback strategies for malformed JSON, including recovery from a known flattened-fact corruption pattern. Understand-Anything’s merge script has explicit logic for recovering nodes the analysis script dropped, remapping unknown node types, restoring dropped dangling edges. Every multi-step pipeline accretes this kind of defensive code.

The trade-off is real. It’s also, on every honest reckoning across the corpus, lopsided. The reads outnumber the writes by orders of magnitude. The user-perceived latency is on the read path, not the write path. The agent’s reasoning budget is consumed at read time. The hallucinations are produced at read time. Every dimension along which “less work” sounds appealing turns out, on inspection, to be a dimension along which write-time investment buys read-time relief at favourable rates. A rule of thumb the corpus suggests: if a write-time pipeline costs three to five LLM calls per document, but the document will be read hundreds of times across its lifetime, the per-read amortised cost of the write work is far below the cost of running a single additional LLM call at read time to compensate for what wasn’t done at write. The arithmetic is forgiving in a way that almost no other architectural decision in the field is.

The adoption order the corpus implies

For a team retrofitting an existing flat-RAG system, the order that pays back fastest at each step is something like this. Provenance first, it’s the cheapest to add and the prerequisite for almost everything else. Types second, also cheap, and it unlocks faceted retrieval immediately. Multi-step ingest third, once provenance and types are in place, refactoring the single-shot extraction into two steps becomes tractable. Atomisation fourth, partially a consequence of multi-step ingest, but worth making explicit. Online dedup fifth, the most invasive form because it requires the ingest pipeline to read the existing store before deciding what to write. Confidence scoring sixth, shipped in shadow mode first, calibrated from data, acted on later.

This isn’t the only order that works. It’s the one that pays back fastest at each step. Provenance enables debugging immediately. Types enable faceted retrieval immediately. Multi-step ingest enables better extraction immediately. Each step yields a visible improvement and lays groundwork for the next.

If you’re picking up an existing memory system that isn’t getting the read-time quality you want, the single most useful question is the one this corpus is built around: of every piece of work the system is doing on every read, could this have been paid once at write time instead? The answer will be “yes” more often than is comfortable.

The next piece is on confidence and provenance, the substrate that makes everything in this piece auditable rather than mysterious. If the write-time pattern in this piece is the engine, confidence and provenance are the instrumentation. Hindsight’s freshness lifecycle, Supermemory’s versioned DAG, and OpenContext’s per-fact source trails are three different shapes of the same underlying argument, that a memory system without provenance is one you can’t debug, and a memory system without confidence is one you can’t calibrate. That piece is next.

Tagged

#memory #llm #agents #architecture #write-time

Share & discuss

Share on X Discuss on X

The X Article covers the same ground in a different form. The site version is the canonical one; the X version exists for the conversation in the replies.