llm memory research

What 19 agent-memory systems all agree on (and it's a negative)

02 May 2026 · ~14 min read · by Steven Batchelor-Manning

Over the last few months I’ve continued down the rabbit hole of agentic memory systems, digging deep into open-source repos and research papers. It’s clear the current solutions out there agree on one thing, and it’s what not to do.

Vector RAG, on its own, is not enough.

From the 19 systems, every single one that started with flat vector RAG ended up adding something on top of it. Every one. The additions are all over the place. Some added knowledge graphs. Some added trace logs. Some added a hierarchical wiki the LLM navigates by reading files. Three teams independently rebuilt Andrej Karpathy’s late 2025 LLM wiki sketch and called it memory. Two teams ripped the vector store out partway through and replaced it with ripgrep over a structured folder tree. The variety of what people added is noticeable. The total agreement that something must be added is the most useful single finding I’ve spotted.

That’s the consensus. Let’s look at why, and where it’s coming from.

The question every team is trying to answer Digging through the systems, it’s clear they’re all working towards one question.

How do you give a language model usable, durable, agent-friendly memory beyond what fits in a context window?

That’s the field. Everything else, the embeddings, the graphs, the wikis, is a particular bet on how to answer it. Each project’s README, in its preferred dialect, is a restatement of that one sentence.

That’s where the alignment stops. From there, the myriad of solutions begin.

What the disagreement actually looks like Eight paradigms. Sharp boundaries. Real engineering trade-offs that pull in incompatible directions. Most teams take a primary position on one paradigm and a secondary position on another. A few hybridise across three. Nobody’s converged on a winner.

The eight, one sentence each.

Flat vector RAG with structured extras. Embed everything, retrieve top-k, prepend to prompt, then add a layer or three of cleverness on top. SimpleMem and Memex started here.

Knowledge-graph augmented. Memory as a typed graph of entities and relationships, retrieved by traversal plus vector ranking. Graphify, EdgeQuake, GitNexus.

Progressive compression. Memory as a hierarchy of increasingly summarised representations, with heat-gated promotion between tiers. MemoryOS is the textbook implementation. SimpleMem hybridises into this from paradigm 1.

Multi-index hybrid search. Multiple indexes, one fusion stage, almost always Reciprocal Rank Fusion at k=60. Hindsight, Supermemory, Graymatter, OpenContext, mem9.

LLM-as-retriever. Forget the vector store, give the LLM a hierarchical map of the documents, have it navigate to the answer. Memex evolved into this. OpenKB ships it. Supermemory’s rewrite mode runs on it.

Trace-as-memory. Memory is the agent’s own execution history, not the user’s data. Moraine in its purest form. Hindsight’s observation tier as a hybrid component.

Karpathy LLM wiki. Plain Markdown, wikilinks, frontmatter, an index file as the catalogue, the user as the ultimate curator of staleness. Three independent implementations. Understand-Anything reads it. OpenKB writes it. LLM-Wiki does both as a desktop app.

Filesystem-native context store. The file is the artefact. The database is a derivative cache. When in doubt, the disk wins. OpenContext, Tolaria, second-brain.

The boundaries between these aren’t stylistic. They’re different commitments about what memory fundamentally is. Paradigm 6 says memory is what the agent did. Paradigm 3 says memory is the densest faithful summary of what’s been observed. You can’t make both your primary mode without an internal contradiction. You can compose them, and a few systems do, but composition forces a choice about which gets first call on retrieval, and that choice has architectural consequences that propagate through the whole stack.

That’s the framing that, more than any other, explains why the field looks the way it does.

The thing everyone agrees on Of the 19 systems, every one that started with flat vector RAG as its primary mechanism has ended up adding something on top of it. The additions aren’t all the same. The agreement that something must be added is universal. Put plainly:

No serious system believes that “embed everything, retrieve top-k by cosine, prepend to prompt” is sufficient as a memory architecture for an agent.

That sentence is worth being careful with. It’s not a claim that vector search is useless. Vector search shows up in roughly three quarters of the 19 systems. It’s the substrate. It’s just no longer treated as the solution.

Hindsight is the clearest case. Its documentation enumerates four pain points of pure vector RAG and then architects the rest of the system around answering each one. Pure semantic similarity loses temporal signal. Facts get disconnected. Agents need to consolidate. Reasoning style is bank-specific. Hindsight’s four headline subsystems map almost directly onto those four pain points. The system is, in the words of its own deep-dive, a complete answer to “what does an agent memory system look like if you take the limitations of RAG seriously?”

EdgeQuake makes the same case with a different illustration. Imagine the query “How did Sarah Chen’s research on neural networks influence the work of her colleagues at Quantum Dynamics Lab?” A flat vector store returns disconnected chunks. One mentioning Sarah Chen. One about neural networks. One about Quantum Lab. The system can’t follow the influence chain across documents because the chain was never indexed. What was indexed was the similarity of each chunk’s embedding to the query’s embedding. Similarity isn’t relationship.

SimpleMem makes the case in benchmark numbers. Its paper compares the cost-versus-quality frontier of four memory approaches on LoCoMo. Pure vector retrieval is the bottom of the chart on F1 and the middle on cost. The systems that beat it all do something more than retrieve top-k. SimpleMem’s contribution is intent-aware retrieval planning plus online write-time synthesis. A-Mem’s is iterative LLM reasoning loops. Mem0’s is per-write LLM extraction. Every method that beat naive vector RAG did so by adding a step.

Karpathy makes the case from the user’s side rather than the agent’s. RAG re-derives knowledge from raw chunks on every query, with no accumulation. A wiki, by contrast, is a persistent compounding artefact. Cross-references already there. Contradictions already flagged. Synthesis already reflecting everything that’s been read. The maintenance burden, which historically killed personal wikis, is the part the LLM does for free. This isn’t a paradigm-1 system that’s added things. It’s a deliberate refusal to be a paradigm-1 system in the first place.

Moraine makes the case from the most extreme end of “the LLM is never invoked during ingestion or retrieval”. There’s no embedding model, no vector database, no LLM runtime, no message queue, no Python-managed daemon in the operational dataplane. Just ClickHouse, materialised views, and BM25 over the verbatim trace. Moraine isn’t arguing that vector search is wrong. It’s arguing that for its specific problem, BM25 over a normalised event table is the better trade.

Five teams, same negative argument from five angles. The argument isn’t that flat vector RAG is useless. It’s that on its own, it leaves too many of the relevant retrieval shapes unsupported. Temporal queries. Relationship queries. Multi-hop queries. The kinds of question that need provenance, navigation, or structured filters. The kinds of question that need the system to know something it never observed but should have inferred.

Even the systems that do run flat vector RAG as their primary mechanism stack things on top. SimpleMem layers intent-aware retrieval planning and online write-time synthesis on the cosine search. Memex, which started in the same camp, ended up replacing the vector store entirely with file-system-level tools and an LLM-as-search-engine pattern over structured P.A.R.A. organisation. The final shape of Memex isn’t a vector RAG system that happens to use ripgrep. It’s a system that abandoned vector RAG once the team had actually used it.

The pattern’s consistent enough to state plainly.

Every one of the 19 systems that started with flat vector RAG ended up adding something. None of them judged the original recipe sufficient on its own.

That’s the most unambiguous finding from the 19 deep-dives. It’s the closest thing the field has to settled engineering. And the finding’s negative rather than positive. The field knows what doesn’t work. It doesn’t yet know what does.

Why no convergence If everyone agrees that flat vector RAG isn’t enough, why hasn’t the field converged on a single replacement?

Because the engineering trade-offs are real and they pull in incompatible directions. Six of them are worth naming.

Write cost versus read cost. A team that pays heavily at write time, like SimpleMem with online synthesis or Hindsight with async consolidation or LLM-Wiki with two-step ingest, gets a cleaner store that reads cheaply. A team that pays at read time, like Moraine with BM25 over the verbatim trace or OpenKB with LLM tree-walks over PageIndex, gets a write path that scales linearly in raw bytes and a read path whose cost depends on query shape. The rough rule of thumb across the 19 is: if the data will be read more than five times, pay at write time; otherwise, pay at read time. The threshold’s different for every team.

Substrate ownership. Where does the data live? In a database the operator runs (PostgreSQL, SQLite, ClickHouse). In the user’s filesystem (OpenContext, Tolaria, OpenKB). In a vendor’s cloud (Supermemory, mem9’s hosted endpoint). Or in the agent’s harness logs on disk (Moraine). Each substrate has implications for backup, portability, and operational burden. The widest gap is between Tolaria’s position, the file is the artefact and the database is a derivative cache, and Supermemory’s position, in which the substrate is a closed PostgreSQL behind a Cloudflare Workers API the developer can never directly inspect. These aren’t just different engineering decisions. They’re different philosophies about who owns the user’s memory.

Agent shape. The shape of the agent that consumes memory matters more than the abstract quality of the retrieval algorithm. A coding agent operating over a single repository wants structural intelligence: blast radius, call chains, community membership. Vector similarity is a poor substitute for “what breaks if I change this function’s return type”. A research agent reading long PDFs wants hierarchical navigation. A personal-life-recording agent wants a condensed user profile that grows over time, plus the ability to navigate back to the original entries by date. The right paradigm depends on the shape of the question the agent will ask, not on any abstract quality of the retrieval algorithm.

Operator burden. An in-process library running against an embedded SQLite file imposes near-zero operational burden. A managed API service imposes near-zero configuration burden but high vendor-lock-in burden. A self-hosted PostgreSQL with pgvector and a worker process imposes meaningful infrastructure burden but offers full transparency. A filesystem-native store imposes near-zero burden of either kind but pushes synthesis work back onto the agent. Different teams have different operator profiles.

Update semantics. When a fact stops being true, what happens? Flat vector RAG handles updates badly. A new chunk supersedes an old one only if retrieval happens to favour it, and the old chunk lingers as a ghost. Knowledge-graph systems handle updates with explicit edge manipulation, but the maintenance question (what happens to derived edges when a source document changes?) is largely unsolved across the 19. Versioned-DAG systems handle updates by appending new nodes that point to old ones, which solves the lineage question but trades it for a garbage-collection one. Filesystem-native systems handle updates by letting the user edit the file, which is correct but pushes the synthesis question back onto the human. None of the 19 has a clean cross-paradigm answer to “the user said this last month and now it isn’t true”. Every system has a partial answer it isn’t quite happy with.

Provenance. Does every fact in memory carry where it came from? The mature consensus is yes. Every system that started without provenance has added it. But adding provenance to a system that didn’t have it from the start is genuinely hard. The chunk has an embedding. The embedding has a topive-k rank in the result. What it doesn’t have, unless someone built it, is a pointer back to the conversation turn or document section that produced it. Retrofitting this onto a flat vector RAG store is a meaningful project. The fact that every mature system has done it is the strongest evidence that it has to be done.

These six trade-offs aren’t solved problems. They’re the live edges of the design space. Different teams resolve them differently because their constraints are different. That’s why the field looks fragmented from the outside, and why on closer reading the apparent fragmentation is mostly informed eclecticism rather than confusion.

What this means if you’re building memory now If you’re building agent memory in 2026, the landscape gives you no winning blueprint. What it does give you is a clear set of known-bad starting points and a clear set of known-valuable additions, parameterised by the trade-offs above.

The practical advice that falls out of 19 deep-dives is short.

Don’t ship flat vector RAG and call it a memory system. It’s the right substrate for many systems but it isn’t the answer on its own. Every team that started there ended up adding something. The question isn’t whether you’ll add something but what you’ll add. The three most common additions are hybrid lexical plus semantic retrieval fused with RRF, provenance metadata travelling with every fact, and write-time synthesis (online dedup, atomisation, structured extraction). If your system has none of those, the weekend’s project is to pick the most relevant one and ship it.

Match the paradigm to the problem. The right architecture depends on the shape of the question your agent will ask. A coding agent should consider knowledge-graph augmentation. A research agent reading long documents should consider LLM-as-retriever over hierarchical structure. A personal assistant accumulating context over years should consider progressive compression with tiered storage. There’s no shame in hybridising. Most mature systems do. The shame’s in not knowing which trade-off you’re making.

Be honest about the weakness you’re taking on. Knowledge-graph systems are the most expensive to build and the hardest to maintain under source change. Progressive-compression systems lose information by design. Multi-index hybrid systems need an explainable fusion stage. None of these is disqualifying. All of them are worth knowing about up front.

Expect to invest at write time. The highest-ROI design decision across the 19 is paying compute up front to make the store cleaner. SimpleMem’s online synthesis. Hindsight’s async consolidation. LLM-Wiki’s two-step ingest. These aren’t all the same mechanism but they’re all the same principle. The work you do at write time pays back at every subsequent read.

Build for the agent’s behaviour, not just for the agent’s prompt. oh-my-kiro’s central design claim, “if a constraint can be expressed in code, it shouldn’t be enforced with words”, is more broadly true than its specific implementation. Memory systems that rely on the LLM to behave a certain way will see those behaviours decay across long sessions. The systems with lasting behavioural properties get them by making the substrate enforce them.

Take the deployment shape as seriously as the paradigm. A managed API and an in-process library can implement the same paradigm and feel like completely different products. The choice between them is often the most consequential decision the team makes. It dictates who owns the data, who runs the infrastructure, and what the trust model is. None of those is downstream of the retrieval algorithm.

Where the field is Not in convergence. Not in fragmentation. In what the literature on emerging engineering disciplines calls informed eclecticism. A community that’s agreed on the problem, agreed on the constraints, agreed on what doesn’t work, and is working through the design space without a single dominant solution emerging.

That’s not a bad place to be. It’s the necessary phase before convergence. Database systems went through it in the 1970s. Mobile OSes went through it in the late 2000s. Agent memory’s in that phase right now. The eight paradigms across the 19 systems aren’t all going to survive. Six months from now the scene will be different again, new approaches, refined approaches arriving constantly will continue to probe the solution space.

If the field has any single piece of received wisdom right now, it’s the one this piece opened with. Flat vector RAG, on its own, isn’t enough. Every team that started there ended up adding something. The agreement on what to add ends there. The agreement on the need to add is total, and that’s the foundation everything else sits on.

Over the coming few weeks I will dig into the topics mentioned in this high level and cover some of what I’ve found on this magical mystery tour of Agentic memory systems.

Tagged

#memory #llm #agents #architecture #paradigms

Share & discuss

Share on X Discuss on X

The X Article covers the same ground in a different form. The site version is the canonical one; the X version exists for the conversation in the replies.