llm memory research

Eight Agentic Memory Paradigms

· ~11 min read · by Steven Batchelor-Manning
Contents
  1. The eight, in one sentence each
  2. What the paradigms are actually arguing about
  3. Some of these don't compose
  4. Five angles on why this matters
  5. What to do if you're picking now
  6. Where the field is going
Eight Agentic Memory Paradigms - hero image.

The first piece in this series landed on the one thing 19 open-source agent-memory systems agree on, and it was a negative. Flat vector RAG, on its own, isn’t enough. Every team that started there ended up adding something. The agreement on what to add ends there.

(Need a catchup? The first piece is here: [LINK_TO_PIECE_1])

This piece is what’s behind that “what to add” disagreement. Once you actually look at what the 19 added, the surface variety resolves into a small, sharp set of architectural commitments. Eight of them.

Eight paradigms. Not eight implementations of the same recipe. Eight different bets about what memory fundamentally is.

That distinction matters more than it sounds. Most of what reads as fragmentation across the agent-memory field is actually a small number of incompatible commitments dressed up in different vocabularies. Once you can name the eight, the field stops looking chaotic and starts looking like a design space with known edges.

Let’s look at the eight, then at what they’re really arguing about.

The eight, in one sentence each

Flat vector RAG with structured extras (Flat-RAG). Embed everything, retrieve top-k, prepend to prompt. Then layer something on top, because nobody ships it bare. SimpleMem and early Memex are the textbook examples.

Knowledge-graph augmented (Graph). Memory is a typed graph of entities and typed edges. Recall is traversal, not similarity. Graphify, EdgeQuake, GitNexus.

Progressive compression (Prog-Compression). Memory is a hierarchy of increasingly summarised representations, with heat-gated promotion between tiers. The verbose original fades; the dense distillation persists. MemoryOS is the textbook implementation.

Multi-index hybrid search (MI-Hybrid). Several indexes running in parallel, fused at recall time, almost always with Reciprocal Rank Fusion at k=60. Hindsight, supermemory, mem9, graymatter.

LLM-as-retriever (LLM-Retriever). Skip the vector store. Give the model a hierarchical map of the documents and let it navigate. Memex evolved into this. OpenKB ships it. Supermemory’s rewrite mode runs on it.

Trace-as-memory (Trace). The agent’s own execution history is the memory. Not the user’s documents. Moraine is the purest case. Hindsight’s observation tier is a hybrid component.

Karpathy LLM Wiki (Wiki). Plain Markdown, wikilinks, frontmatter, an index file as the catalogue, the user as the ultimate curator. Three independent teams rebuilt this in 2025. Understand-Anything reads it. OpenKB writes it. LLM-Wiki does both as a desktop app.

Filesystem-native context store (FS-Native). The file is the artefact. The database is a derivative cache. When in doubt, the disk wins. OpenContext, Tolaria, second-brain.

Eight commitments. Read them again as commitments rather than as feature lists. (Trace) says memory is what the agent did. (Prog-Compression) says memory is the densest faithful summary of what’s been observed. Those aren’t different settings on the same dial. They’re different answers to “what is the system trying to remember?”

What the paradigms are actually arguing about

The temptation is to read the eight as a tier list and pick the one with the best benchmark. That’s a category error.

Each paradigm is a commitment about three things at once. What kind of question the agent will ask. What shape the answer should take. Where the cost of getting from one to the other should be paid.

A coding agent asking “what breaks if I change this function’s return signature?” wants structural intelligence. Cosine similarity over chunks of source code is a poor substitute for a typed graph of call edges. (Graph) is the natural fit, and a (Flat-RAG) system trying to answer the same question will lose to a graph system every time.

A research agent reading a 400-page report wants navigation. It wants to descend into the section that matters and ignore the rest. (LLM-Retriever) over a hierarchical map of the document beats (Flat-RAG)‘s top-k cosine search, because the question shape is “find the right region, then read carefully” rather than “find the most similar 800-token window”.

A personal assistant accumulating context over years wants compression. Without it the store grows linearly in tokens until retrieval drowns. (Prog-Compression)‘s tiered hierarchy is the answer. A (Flat-RAG) system at year three is a system whose retrieval is mostly stale chunks pretending to be relevant.

An observability-minded team operating agents in production wants the trace. (Trace) turns “what did this agent do last Tuesday at 3pm” into a literal query. Most systems with rich logging can’t answer that question because the logs aren’t memory; they’re text dumps the agent itself can’t see.

The paradigm-fits-question framing is the most useful single move you can make on this material. Once it clicks, the eight stop being competitors and start being tools, each with a clean problem shape it’s right for and a set of problem shapes it’s wrong for.

Some of these don’t compose

The mature systems in the corpus increasingly hybridise. Hindsight is primarily (MI-Hybrid) with a (Trace) observation tier and a (Graph) entity sub-component. Supermemory is (MI-Hybrid) with a versioned-DAG schema that’s (Graph)-adjacent and an (LLM-Retriever) rewrite mode. LLM-Wiki is (Wiki) with (Graph) community detection and opt-in (MI-Hybrid) vector retrieval. OpenContext is (FS-Native) with (MI-Hybrid) retrieval mechanics layered on top.

That’s the optimistic reading. The pessimistic reading, which is also the accurate one, is that composition isn’t free.

Take (Prog-Compression) and (Trace) together. (Prog-Compression) says the verbose original fades; only the densest representation persists. (Trace) says the verbose trace is the memory; compressing it is destroying the artefact. You cannot make both your primary mode without an internal contradiction. You can run them as separate stores with separate query paths, which is what Hindsight does. But the choice of which gets first call on retrieval is an architectural decision that propagates through the whole stack.

Or take (Wiki) and (Flat-RAG) together. The (Wiki) paradigm makes a deliberate philosophical refusal: memory should be a persistent compounding artefact a human can read, not a re-derivation from chunks on every query. Bolting a vector store onto a wiki to “improve retrieval” doesn’t compose them; it turns the wiki into chunked text the LLM happens to have access to and loses the compounding property that made the wiki worth building.

(Graph) and (FS-Native) compose more cleanly, because filesystem-native storage and graph-as-derivative-cache aren’t fighting over the same role. The graph is a query optimisation; the file is the source of truth. But even there, the synthesis question (what does the graph do when the file changes?) is an open problem across every system in the corpus that’s tried it.

Put plainly:

The eight paradigms compose where their commitments don’t collide. Where the commitments collide, composition is a deliberate engineering choice that costs something on both sides.

The systems that have done this well have done it by being explicit about which paradigm is primary and which is secondary, and by accepting that the secondary’s strengths are partly muted in the bargain.

Five angles on why this matters

Hindsight is the clearest case for paradigm-as-strategy rather than paradigm-as-style. Its TEMPR retrieval system doesn’t just run multiple indexes; it has a deliberate routing policy that decides which paradigm to lean on for a given query. Temporal queries go to the trace tier. Relational queries go to the entity tier. Semantic queries go to vector RRF. The routing is the system. Strip the routing out and you have four indexes that disagree about the answer.

EdgeQuake makes the case from the cost side. Its multi-pass gleaning extractor catches 15-25% more entities than a single-pass extractor and dominates the system’s compute bill by an order of magnitude over retrieval. (Graph) isn’t expensive at recall time. It’s expensive at write time, in a way that makes (Flat-RAG)‘s “embed and forget” feel almost free by comparison. The teams that picked (Graph) picked it knowing this. The teams that drift into it accidentally end up surprised by their LLM bill.

MemoryOS makes the case from the long-tail side. Its three-tier STM/MTM/LPM hierarchy with the heat-gated promotion formula isn’t a clever optimisation on top of flat-RAG. It’s a different commitment about what storage is for. STM holds raw turns. MTM holds segment summaries promoted by access pressure. LPM holds the user profile that survives indefinitely. A system without that hierarchy at year three is a system whose retrieval has quietly become noise.

Moraine makes the case from the observability side, and from the most extreme end of the spectrum. There’s no embedding model. No vector database. No LLM in the operational dataplane at all. ClickHouse, materialised views, BM25 over the verbatim trace. The argument is that for its specific problem, “what did the agent do, when, with what tokens, and what was the outcome”, structured trace storage is the better trade than any retrieval-by-similarity mechanism. The paradigm fits the question.

Understand-Anything, OpenKB, and LLM-Wiki together make the case from the curation side. Three teams, working independently, built variants of the same Karpathy sketch in the same year. Plain Markdown. Wikilinks. An index file. The compounding artefact, with the human as the eventual arbiter of staleness and the LLM as the heavy lifter of synthesis. Convergent design across three independent teams is rare in this field. When it happens, it’s worth taking seriously as a signal that the underlying commitment is sound.

Five systems, five paradigms, same underlying point. Each one is a deliberate answer to a question the other four were asking less directly.

What to do if you’re picking now

If you’re building agent memory in 2026, the choice of paradigm is more consequential than the choice of database, the choice of embedding model, or the choice of retrieval algorithm. The paradigm sets which of those choices matter.

Characterise the questions your agent will ask before you characterise the documents it will read. The wrong paradigm with the right corpus retrieves the wrong things. The right paradigm with a worse corpus still works for the questions it was designed for. Map question shape to paradigm and pick a primary. Plan for one secondary if the question shape is genuinely mixed. Don’t plan for three.

Be honest about which paradigm’s weaknesses you’re taking on. Knowledge-graph systems are the most expensive to build and the hardest to maintain when the source changes. Progressive compression loses information by design. Multi-index hybrid systems need an explainable fusion stage or they become folklore. Karpathy-wiki systems push curation work back onto the user. Filesystem-native systems push synthesis back onto the agent. Trace-as-memory captures actions but not facts. None of these is disqualifying. All of them are worth knowing about up front.

Take the deployment shape as seriously as the paradigm. A managed API and an in-process library implementing the same paradigm feel like completely different products to the team using them. The choice between them is often more consequential than the paradigm itself, because deployment dictates who owns the data, who runs the infrastructure, and what the trust model is. Pick the deployment shape first. Pick the paradigm inside it.

Where the field is going

The longer-horizon read of the eight paradigms is that the equilibrium probably isn’t eight. It’s two or three composite paradigms that absorb three or four of the others each.

Three candidates are already visible in the corpus. Hybrid retrieval with provenance and an observation tier, which is (MI-Hybrid) plus (Trace) plus the write-time-investment habit from (Flat-RAG). Hindsight is the closest current expression. Karpathy wiki on a filesystem-native substrate, which is (Wiki) plus (FS-Native) with (LLM-Retriever) as a query mode. LLM-Wiki and OpenKB are moving towards this. Knowledge-graph systems for structurally rich domains, which is (Graph) with (MI-Hybrid) mechanics for fuzzy entry-point selection. EdgeQuake, GitNexus, and Graphify are converging on this shape.

The pure single-paradigm systems aren’t going to disappear. SimpleMem-style (Flat-RAG) is the right answer when the budget is small and the question shape is genuinely flat. Moraine-style (Trace) is the right answer when the question is about what the agent did rather than what the world contains. The point isn’t that the paradigms are converging. The point is that the loudest paradigm right now isn’t necessarily the right paradigm in six months, and the problem has been the same all along.

If there’s a single piece of advice that falls out of the eight, it’s this. Pick the paradigm because it fits the problem. Not because it’s the loudest paradigm in the room. Not because the benchmark on the dataset that doesn’t match your domain says so. Not because the framework you already use happens to ship with one. The shape of the question your agent will ask is the only thing that should drive the answer.

Next up I’m writing about the patterns that show up across most of the eight. Provenance. Hybrid retrieval. Write-time investment. The handful of moves that are universal even when the paradigm choices aren’t.

If this was useful, share it with someone in your network who’s building agent memory right now, or who’s about to. The conversations I keep ending up in with engineers and tech leads working on this stuff are the best part of writing the series, and the only way more of them happen is if pieces like this find the right people.

Share & discuss

The X Article covers the same ground in a different form. The site version is the canonical one; the X version exists for the conversation in the replies.