llm memory research

Your memory system does not need to decide what the agent sees. The agent does.

02 June 2026 · ~11 min read · by Steven Batchelor-Manning

Contents

What injection looks like
Four failure modes of injection
What tools look like
The inversion: oh-my-kiro
The highest-leverage refinement
The two-step rhythm
What the shift implies

Your memory system does not need to decide what the agent sees. The agent does. - hero image.

I’ve been down the rabbit hole on how memory reaches the model for a while now. The assumption I kept running into, before this research, was that the hard part of agent memory was retrieval quality. Get the right chunks, the thinking goes, and the rest is plumbing.

That assumption is half right. Retrieval quality matters. But the mechanism that delivers those chunks to the model matters more, and the field has quietly reversed its position on it without most people noticing. The mature systems have all converged on the same shape: the agent is given memory tools, and the agent decides when and how to use them. The middleware is no longer a predictor of what the agent needs. It’s a service the agent calls.

This is not a small change. It happened almost without comment. And if you’re still building injection-style memory in 2026, you’re working harder than you need to.

What injection looks like

Early RAG was injective. The pattern was simple enough to fit on a slide:

Embed every document in the corpus.
On every user turn, embed the query.
Run a top-K nearest-neighbour search.
Concatenate the K results into the system prompt.
Send the resulting blob to the model.

The model never asks for context. The middleware decides what’s relevant. The model sees the result as a fait accompli, prepended above the user’s actual question. This is automatic context injection. It was the dominant shape of RAG from roughly 2023 to 2024 and it’s still what most teams mean when they say “we added RAG” without further qualification.

It works, up to a point. The point where it stops working is the point where the agent needs to do something the middleware didn’t predict.

Feel the difference concretely. Suppose the user has a memory store containing notes about a deployment incident from the prior week, a long-standing preference for terse responses, a half-finished design document about an authentication refactor, and a transcript of yesterday’s standup mentioning that the auth work has been deprioritised. The user asks: “Where are we on the auth refactor?”

Under injection, the middleware embeds the query, searches, and concatenates the top-K chunks. The model gets back a chunk from the incident postmortem about an auth-related rollback, a chunk from the design doc, a chunk from the standup mentioning deprioritisation, and five other chunks that happened to share the word “authentication” with varying degrees of relevance. The model reads all eight. It answers. The user sees the answer. Nobody sees the eight chunks.

Under tools, the agent reads the question, decides it needs to check memory, calls a search tool with a query it composed itself, gets back identifiers and previews, reads the previews, decides which records are relevant, and fetches only those. The trace shows every step. The agent retrieved two records, not eight. It chose them. The cost reflects the choice.

The difference isn’t retrieval quality. The same chunks exist in both traces. The difference is agency: who decided what the model sees.

Four failure modes of injection

1. The irrelevant-chunk tax

Injection pays retrieval cost on every turn whether or not memory was relevant. If the user asks “what time is it?”, the middleware still embeds the query, still searches the store, still concatenates K chunks into the prompt. The model still processes them. None of that work was necessary. On a system doing thousands of agent turns per day, the waste is real and it compounds.

2. The wrong-K problem

Top-K retrieval returns the K chunks most similar to the query embedding. But similarity to the query isn’t the same as relevance to the task. The middleware can’t tell the difference because it doesn’t understand the task. It understands the query string. The model could tell the difference, but by the time the model sees the chunks, they’re already in the prompt. The model can ignore them, but it can’t un-retrieve them. The attention cost is already paid.

3. The follow-up block

Injection is a one-shot operation. The middleware injects once per turn. If the model reads the injected chunks and realises it needs something else, a different memory item, a related document, a prior conversation that wasn’t in the top-K, it has no way to get it. The retrieval layer has already fired and closed. The agent is stuck with what it was given.

This is the failure mode that hurts most in practice. The agent has the memory. It just can’t get to it.

4. The opaque trace

When memory is injected, the conversation trace doesn’t show what the agent looked at. It shows what the middleware decided to give the agent. If the agent produces a wrong answer because it was given the wrong chunks, debugging means reconstructing what the middleware retrieved and why. The trace is opaque to the very person who needs it most: the developer trying to fix the system.

What tools look like

The tool-based pattern inverts the relationship. The agent is given named operations that read and sometimes write the memory store. The agent decides when to call them. The middleware becomes a service, not a gatekeeper.

Of the 19 systems I went through, the mature ones have all converged on this shape. Supermemory, Graymatter, OpenContext, Tolaria, second-brain, MemoryOS, GitNexus, mem9 all expose memory as a tool surface rather than as automatic injection. The systems that do something closer to injection are the ones the field treats as the prior state of the art, not the current one.

The tool surfaces vary in size and shape, and the variation is itself instructive.

The small end: Graymatter

Graymatter’s MCP server exposes five tools. Three of them are the core: Remember, Recall, and memory_reflect. The last of these is the most expressive: it lets the agent update an existing fact, forget an existing fact, or link a fact to a knowledge-graph entity. The agent maintains its own memory mid-session rather than the memory layer maintaining itself.

The smallness is the pitch. The README is explicit: “the small surface is its whole pitch.” An agent author can hold the entire memory API in their head and the LLM can hold the entire tool description in its prompt. Anything more elaborate becomes another thing to teach.

The mid-band: Supermemory and OpenContext

Supermemory’s MCP server exposes four tools plus two resources and one prompt. The memory tool handles saves and deletes. The recall tool handles search. The aggressive tool description is worth noting: “DO NOT USE ANY OTHER MEMORY TOOL ONLY USE THIS ONE.” That’s prompt-engineering the tool description itself, a hack, but a working one. Tool descriptions are themselves prompts, and the same care that goes into system prompts should go into tool descriptions.

OpenContext registers nine tools split into three groups: read, write, and metadata. The write group is deliberately incomplete. The MCP server can register an empty file but not write its body. The agent edits the file directly using its existing file-editing tool. This is the clearest example in the 19 systems of splitting the read and write paths intentionally, letting the memory layer own discovery and resolution while letting the agent’s own tooling own mutation.

OpenContext also encodes cost governance in the tool surface itself. There’s no oc_index_build MCP tool because building the index calls a paid embedding API. Indexing is strictly CLI-driven. The skill text the agent reads on install reinforces this: “do NOT run it unless the user explicitly approves.” Policy in the tool surface, not in out-of-band documentation.

The narrow-by-design end: Tolaria

Tolaria’s position is worth quoting: “The agent has full shell access. These MCP tools provide Tolaria-specific capabilities that native tools can’t replace.” Six tools, all vault-aware reads and UI-steering actions. The agent owns the write path through its native filesystem tools. Tolaria owns the vault-aware read path and the UI surface. The asymmetry is intentional and it’s what keeps the tool surface comprehensible to the LLM.

The radical end: second-brain

The most extreme point on the spectrum. The agent isn’t given pre-baked recall verbs. It’s given the database: a read-only SELECT and PRAGMA tool over the SQLite memory store, with table-level scoping enforced at the SQLite C-API authorizer hook. The agent writes SQL. If it needs a different join, a different filter, a different projection, it just writes a different query.

second-brain also ships pre-baked recall tools for common cases: hybrid_search, lexical_search, semantic_search. These sit alongside the raw SQL tool. The agent can use the convenience verbs when they suffice and drop down to SQL when they don’t. The tool surface doesn’t need to be one or the other. It can offer both, and let the agent choose.

MemoryOS: three tools, hidden hierarchy

MemoryOS wraps its three-tier hierarchical store behind three tools: add_memory, retrieve_memory, and get_user_profile. The agent doesn’t need to know about the short-term, mid-term, and long-term tiers. It calls retrieve_memory(query) and the system returns the best match across whichever tier owns the data. The complexity of the hierarchy stays inside the memory engine. The tool surface stays small.

The internal model can be as sophisticated as it needs to be, heat-gated promotion, dialogue-chain reconstruction, parallel two-stage retrieval, without any of that sophistication leaking into the agent’s tool descriptions. If the engine wants to reorganise its tiers behind the scenes, the agent doesn’t have to know.

mem9: same surface, many wrappers

mem9 publishes a REST API and surfaces it through plugins for multiple agent frameworks. The plugin shapes are all over the place, the Claude Code plugin is bash hooks that curl the REST API, the Codex plugin is Node hooks with client-side conversation parsing, but the underlying agent surface is the same five or so operations: store, search, get, update, remove, plus an ingest for whole-conversation handoff.

The same tool surface drives very different agent integrations because the surface itself is small enough to wrap many ways. The Claude Code shell hooks are 80 lines because there isn’t much to wrap. Memory logic stays in the server. Plugins stay thin.

The inversion: oh-my-kiro

One system inverts the pattern in a way that’s worth understanding. oh-my-kiro doesn’t give the agent memory tools. Instead, it interposes on the agent’s existing tool calls. When the agent calls a file-editing tool, oh-my-kiro’s hook system intercepts the call, extracts the relevant context, and stores it in the memory layer. The agent never explicitly asks for memory. The memory layer learns from what the agent does.

This isn’t injection. The middleware isn’t predicting what the agent needs and prepending it. It’s observing what the agent does and recording it. Injection is push. oh-my-kiro’s hooks are pull-by-observation. Both have a place, and a mature agent system might well combine them: tools for active recall, hooks for passive capture.

Across the tool-based systems, one pattern costs almost nothing to adopt: self-guiding tool responses.

Every tool response ends with a hint about what to do next. GitNexus does this explicitly: every tool response includes a Next: line suggesting the most likely follow-up call. The agent learns the API through use rather than through documentation. The trace becomes self-documenting. A developer reading it can see not just what the agent called but what the system suggested the agent call next.

GitNexus has seven or more tools grouped by process: hybrid search, 360-degree symbol context, impact analysis, even a Cypher escape hatch for graph queries. A surface that size could easily overwhelm an LLM that’s never seen it before. The Next: hints are what make it navigable. The agent calls a tool, reads the response, and the response tells it what to consider next. No documentation required.

The cost is one line per tool response. The benefit is that the agent stops guessing about API shape and starts following the system’s own understanding of what comes next.

The two-step rhythm

The natural structural fit for tool-based memory is two-step retrieval. Search returns identifiers and short previews. A separate GetByID call fetches the full record when needed. mem9’s MemoryRepo interface is built around this pattern. MemoryOS wraps it behind a single retrieve_memory call but the internal implementation is the same: search first, then fetch.

The numbers make the case plainly. Ten matches at 1,500 tokens each is 15,000 tokens injected into context whether the agent uses them or not. Two-step retrieval returns 10 identifiers and short previews at roughly 450 tokens total, then fetches only the records the agent actually needs. Across 20 recall steps in a session, that difference compounds to around 200,000 tokens saved.

No architectural change to the memory store. No additional LLM calls. No information loss. It’s a retrieval interface decision.

What the shift implies

If your memory system still auto-injects, you’re working too hard. You’re predicting what the agent needs without the agent’s input. You’re paying retrieval cost on every turn whether or not memory was relevant. You’re blocking the agent from following up on partial results. You’re making the trace opaque to your own debugging. You’re doing the agent’s job for it.

The agent should ask. Give it the tools and trust it to use them. Make the tools’ responses guide the next call. Make the search-then-read split the default rhythm. Then get out of the way and let the trace tell you, after the fact, what the agent actually needed, which is the question you should have been asking all along.

Tagged

#memory #llm #agents #architecture #tools

Share & discuss

Share on X Discuss on X

The X Article covers the same ground in a different form. The site version is the canonical one; the X version exists for the conversation in the replies.