llm memory research
Almost every serious memory system made the same retrieval decision. Here's why.
Contents
Ask a flat-vector retrieval system to find the note that mentions the string idx_memory_units_text_search. The embedding model has no privileged representation for an arbitrary identifier. The tokeniser splits it into pieces, the encoder averages those pieces into a vector that looks much like every other identifier-shaped vector, and the note may or may not surface in the top fifty results. A keyword search returns it instantly. Now ask a keyword search to find notes about “authentication” when every relevant note uses the word “login”. Exact-match is exact-match by construction. Stemming and stop-word removal help at the margins; they do not bridge the semantic gap. The note is not found.
These two failure modes are not edge cases. They are the default failure modes of single-index retrieval, and they are why almost every system in the 19 I went deep on has either abandoned flat-vector search or wrapped it behind something else.
The something else is almost always the same thing. Hybrid retrieval — running both lexical and semantic search and fusing their ranked lists — has become the consensus architecture for serious memory systems. The fusion algorithm is Reciprocal Rank Fusion, almost universally at the constant k=60. The corpus disagrees about nearly everything else: what to extract from a conversation, when to forget, whether memory is files or rows, whether the agent should drive retrieval or be handed results. On the question of how to order a candidate list, it has converged so completely that k=60, inherited from a 2009 information-retrieval paper, has become a magic number copied without comment from one implementation to the next.
Why the two lanes are not interchangeable
Dense retrieval and lexical retrieval fail in opposite directions, which is why combining them works. Dense retrieval handles semantic variation well — “login” and “authentication” land near each other in the vector space — but struggles with exact identifiers, rare tokens, and anything the embedding model has no privileged representation for. Lexical retrieval handles exact terms, identifiers, and rare strings well but cannot bridge synonyms or paraphrase. The failure modes are complementary. Running both and fusing the results covers the ground neither covers alone.
The fusion step matters because you cannot simply concatenate the two ranked lists. A document that ranks first in the vector lane and fifteenth in the keyword lane should score differently from one that ranks first in both. Reciprocal Rank Fusion handles this by converting each rank into a score of 1 / (k + rank) and summing across lanes. The constant k controls how much weight goes to top-ranked items versus the rest of the list. At k=60, a rank-1 result scores 1/61 and a rank-60 result scores 1/120 — a 2x difference. The algorithm is rank-based rather than score-based, which means it is robust to the different score distributions produced by different retrieval strategies. A cosine similarity of 0.87 and a TF-IDF score of 14.3 are not directly comparable; their ranks are.
graymatter as the reference implementation
If the corpus has a single canonical worked example of hybrid retrieval done cleanly, it is graymatter. The implementation is short, the choices are explicit, and the whole flow fits on a screen.
Three rankings are produced independently. Vector ranking: cosine similarity between the query embedding and each stored fact’s embedding. Keyword ranking: a TF-IDF-style score — term frequency multiplied by a log-IDF factor, summed across query terms, divided by term count to dampen long facts. Not full BM25, but in the same family. Recency ranking: an exponential decay score, exp(-lambda x age_hours), with a default half-life of 30 days. Each ranking produces a map from fact ID to rank. The fusion step sums 1/(60+rank) across all three lanes for each fact, sorts descending, and returns the top results.
Three lanes. One fusion. The recency lane is the interesting addition — it means a fact that is semantically relevant and keyword-matched but six months old will score lower than a fresher fact with similar relevance. The decay is tunable; the discipline is that recency is a first-class signal rather than a post-hoc filter.
GitNexus: lanes compose without limit
GitNexus generalises the pattern. Where graymatter has three lanes, GitNexus has five separate full-text indexes — one per content type — merged by score summation, plus a BM25 lane and a dense vector lane fused with RRF. The group_query function runs cross-repository RRF, fusing results from multiple repositories into a single ranked list. The architecture documentation records RRF_K=60 as the standard constant, with Elasticsearch and Pinecone running in parallel as the retrieval backends.
The lesson from GitNexus is that lanes compose. You do not need to redesign the fusion step when you add a new lane. You add the lane, produce a ranked list from it, and hand it to the same 1/(60+rank) summation. The algorithm absorbs the new signal without modification. This is why the corpus has converged on it: it is not just correct, it is extensible.
Hindsight: RRF as a stage in a longer pipeline
Hindsight is the most sophisticated retrieval architecture in the corpus. It runs four retrieval strategies in parallel — dense vector, keyword, temporal, and graph-based — fuses them with RRF, and then passes the fused list to a cross-encoder reranker. The cross-encoder reads the query and each candidate document together and produces a relevance score that is more accurate than any of the individual lane scores, at the cost of being more expensive to compute. Running it over the full candidate set would be prohibitive; running it over the top-N from RRF is tractable.
The pipeline is: four parallel lanes, RRF fusion, cross-encoder rerank, final ranked list. Each stage narrows the candidate set so the next stage can be more expensive and more accurate. RRF is not the end of the pipeline; it is the merge stage that makes the expensive final stage feasible.
Hindsight also does something the other systems do not: it preserves per-strategy scores and per-strategy ranks on the result struct alongside the fused score. The trace machinery lets an operator inspect a problematic recall and see exactly which retriever found which item, what its rank was in each lane, what the RRF score was, and what the cross-encoder did. This is unusual. Most systems return only the final result list. Hindsight’s debuggability advantage is largely a consequence of this one decision — keeping the per-lane data rather than discarding it after fusion.
The provenance gap most systems leave open
llm-wiki illustrates the cost of not keeping per-lane data. The fusion arithmetic is correct: RRF_K = 60 is set at src/lib/search.ts:53 with a comment citing the Cormack 2009 paper. But the result struct’s score field is overwritten with the RRF score, discarding the original token score and vector score. The downstream rendering code receives a single number whose provenance has been lost. A user looking at a high-relevance result cannot ask whether it ranked highly because of semantic similarity or keyword match.
This is the same pattern that appeared in the provenance piece: the cost of not having the data is deferred but not avoided. When you want to add a cross-encoder reranker, it needs the per-lane signals. When you want to build a UI that explains why a result was retrieved, it needs the per-lane signals. When you want to A/B test a new lane, you need the old per-lane numbers to compare against. The fused score is a local optimum for ordering. The per-lane scores are the substrate everything else is built on.
The discipline is worth stating plainly: one score for ranking, one score per lane for explainability, and they live in different fields.
The outlier worth taking seriously
The strongest dissent in the corpus is not a variation on hybrid retrieval — it is Tolaria, which removed embeddings entirely.
Tolaria is a Markdown vault manager. Earlier in its life it shipped a semantic indexer; ADR-0009 removed it. The reasoning: the operational complexity of shipping a Go binary, code-signing it, auto-installing it, and surfacing its index status in the UI was not justified by the search-quality benefit for the specific workflow. The replacement is plain substring search over title and content, with title matches ranked above content matches.
Crucially, Tolaria does not pretend that substring search is as good as hybrid retrieval. ADR-0009 is explicit: the AI agent provides an alternative for exploratory and semantic queries. The semantic-retrieval intelligence is shifted out of the system entirely and into the agent’s reasoning. The agent can read manifest files, reason about which folders are relevant, and read full notes. This works because Tolaria expects to be paired with a capable agent whose context window is the retrieval budget.
This is a real architectural position. The embedded-search-engine model assumes the system is responsible for finding relevant content. The agent-as-retriever model assumes the agent is responsible and the system’s job is to surface structure the agent can navigate. The two models have different cost profiles, different deployment shapes, and different failure modes. Tolaria’s position wins when the corpus is small enough to fit into the agent’s context window in summary form, when the agent is capable enough to navigate structure intelligently, and when the operational cost of running an embedding pipeline is not justified by the query volume. It loses when the corpus is large, when queries are latency-sensitive, or when the agent cannot be trusted to navigate structure reliably.
The honest framing: hybrid retrieval is the right default for systems that do serious retrieval. Tolaria’s position is the right default for systems where the agent is the retrieval engine. Knowing which one you are building is the first decision.
What the consensus actually says
The corpus has converged on hybrid retrieval with RRF at k=60 because the algorithm is correct, robust, and extensible. It is correct because it covers the complementary failure modes of dense and lexical retrieval. It is robust because it operates on ranks rather than raw scores, making it insensitive to the different score distributions produced by different retrieval strategies. It is extensible because lanes compose: adding a new signal means adding a new lane, not redesigning the fusion.
The variation in the corpus is in what the lanes are, how many there are, and what gets layered on top of the fused list. graymatter shows the clean three-lane reference. GitNexus shows that lanes compose without limit. Hindsight shows that RRF is the merge stage of a longer pipeline, with a cross-encoder reranker sitting above it. mem9 shows that a managed-API system can hide the embedding step behind the database and amortise the cost across tenants. Tolaria shows that the entire hybrid-retrieval edifice rests on an assumption — that retrieval is the system’s job — and that the assumption is challengeable.
The single piece of practical advice from the corpus compressed into one sentence: use RRF at k=60, keep the per-lane scores on the result struct, and spend your engineering budget on the lanes rather than on the fusion. The algorithm has been right for long enough that you can trust it. The lanes are where the leverage is. The provenance is where the debugging is.
The next piece covers tiered storage — the pattern that separates systems that keep everything in one flat store from the ones that have learned to match the storage medium to the access pattern. That piece is coming up.
Tagged
Share & discuss
The X Article covers the same ground in a different form. The site version is the canonical one; the X version exists for the conversation in the replies.