llm memory research

A fact without provenance is an island. Why every memory must carry its origin.

· ~10 min read · by Steven Batchelor-Manning
Contents
  1. Six levels, one discipline
  2. Three tiers of implementation maturity
  3. The cautionary case: computed and discarded
  4. The honest cost of building this right
  5. What the strongest implementations share
A fact without provenance is an island. Why every memory must carry its origin. - hero image.

Sit down with a flat-RAG system that has been running in production for a year and try to ask it four questions. “Why did you say that?” The model produced a confident answer about the customer’s contract renewal date, but the trail back to the source clause does not exist. “Re-validate this claim — the source changed.” The contract was amended last week. Every fact derived from it is suspect. You cannot find them because they do not know which document they came from. “Decide between these two contradicting facts.” One says Alex works at Google, another says Stripe. Neither carries a confidence score or a source timestamp. Recency of the write has nothing to do with recency of the evidence. “Attribute this hallucination.” The model said the meeting was on Thursday. There was no meeting on Thursday. You cannot tell whether the LLM invented the date or whether bad data in memory misled it.

These four failures share a single cause. They are each a missing column. A source identifier, a capture timestamp, a confidence score, a response citation log. Each costs a few bytes at write time. The cost of not having them is unbounded: every claim is unauditable, every contradiction is unresolvable, every hallucination is untraceable.

Across 19 systems the pattern is consistent. The strongest implementations treat provenance the way a court treats evidence — every claim arrives with an unbroken chain of custody or it does not arrive at all. The weakest carry none, and pay for it every time something goes wrong in production. This piece walks the six levels of provenance the corpus has surfaced, the three tiers of implementation maturity separating them, and the honest cost of building this right from the start.

Six levels, one discipline

Reading 19 systems back to back, six distinct levels of provenance separate themselves out. They are not a hierarchy; they are orthogonal. A system can have source provenance with no causal provenance. It can have versioning without confidence scoring. The strongest implementations cover all six. None of the weakest cover any.

Identity answers “which fact, exactly?” OpenContext mints UUIDs for every piece of context at ingest time and exposes them as a citation scheme the agent can use directly in responses. The identifier survives renames, moves, and reorganisations because it is bound to the content, not the path. mem9 carries a stable memory ID on every row plus an explicit version counter with If-Match concurrency protection. Without identity provenance you cannot even name what you are talking about when things go wrong.

Source answers “where did it come from?” Hindsight records source type and identifier on every observation, so the system can answer which conversation or document produced a given fact. mem9 carries source, agent ID, and session ID on every row. Supermemory stores document IDs alongside memories. Without source provenance you cannot cascade updates when a source changes, because you do not know which facts depend on it.

Causal answers “which agent step used this?” Hindsight captures every retrieved-and-used fact in an observation tier with full retrieval context — which retriever surfaced it, what rank it achieved, and whether the agent actually consumed it. Moraine treats every trace step as its own provenance record, making the agent’s entire execution recoverable as a sequence of source-addressable events. Without causal provenance you cannot distinguish “the system retrieved this fact” from “the agent used this fact to produce that response.”

Capture confidence answers “how sure were we when we wrote it down?” Graphify marks every edge in its knowledge graph with three-level capture confidence: CONFIRMED for deterministic extractions, LIKELY for high-confidence LLM inferences, and AMBIGUOUS for uncertain claims. The AMBIGUOUS edges surface as “knowledge gaps” for human review rather than being silently treated as fact. Hindsight carries a confidence score on every observation that decays over time through its freshness lifecycle — fresh observations are trusted, stale ones are down-weighted or retired. mem9 runs near-duplicate detection in shadow mode first, recording scores without acting on them until the engineer has calibrated the threshold from real data. Without capture confidence, uncertain facts and certain facts are retrieved with equal weight, and the agent cannot discriminate between a solid claim and an educated guess.

Versioned answers “what did we believe before?” Supermemory treats memory as a versioned DAG with typed edges — updates, extends, derives — giving every belief commit history. mem9 splits the write path: in-place mutation for human edits, append-and-archive for LLM-driven rewrites where the new content semantically replaces the old. Tolaria lets Git carry the version history entirely, treating one-line diffs as a first-class user-facing artefact. Without versioned provenance you cannot rewind to what the system believed on Tuesday, because old beliefs are deleted rather than archived.

Reciprocal answers “what other facts share this origin?” llm-wiki weighs source overlap above direct linking in its four-signal relevance graph — two pages that came from the same raw document are presumed more strongly related than two pages with a direct wikilink, because the LLM is unreliable at cross-linking but the sources frontmatter is mechanically maintained. EdgeQuake accumulates source IDs on entities and relationships across the corpus, so repeated mentions of the same entity from different sources strengthen its provenance weight. second-brain carries source as part of its lexical-index composite key, letting a single document be indexed from multiple pipelines independently. Without reciprocal provenance you cannot answer “show me everything in this system that came from the same conversation” or “what else did we learn from that document?”

The six are orthogonal but they reinforce each other. Source provenance is useless without identity (you need to name the fact before you can trace it). Versioning is weaker without confidence (you know the lineage of edits but not how sure the system was about any of them). Reciprocal queries depend on source being present first. The systems that carry all six are the ones whose memory does not silently rot.

Three tiers of implementation maturity

The corpus separates into three tiers based on where provenance lives and what it can do at read time.

Tier 1 is no-provenance RAG, the starting point for most teams. Flat vector stores with content and an embedding. No source column, no confidence score, no version history. When a fact is retrieved, you get text and a similarity score. You cannot trace where it came from, how sure the system was about it, or whether it has been superseded. Every system that started here has moved toward Tier 2 over time.

Tier 2 carries provenance on the row. Source ID, confidence, version — all present as columns alongside the fact. mem9 sits here with source, agent ID, session ID, and version on every row. Supermemory’s versioned DAG is Tier 2 structure. Graphify’s three-level edge confidence is Tier 2 discipline. The provenance is available for queries, but it does not automatically decorate retrieval results. You have to write the query that uses it.

Tier 3 decorates read-time results with provenance context without the caller having to ask for it. Hindsight is the reference here — every retrieved fact arrives with its source type, confidence score, freshness state, and per-retriever ranking already attached. The agent consuming the result sees provenance as part of the fact, not as a separate lookup. mem9’s source-turn decoration sits halfway between Tier 2 and Tier 3, grafting read-time context onto a Tier 2 schema.

The migration is one-way. No system starts at Tier 3 and decides to remove provenance discipline. The columns prove their value as soon as they are present. If you are designing a memory system today, the question is not “do I need provenance?” but “which tier do I want to start at, knowing that every tier above the one I pick is harder to reach later than to bake in now?”

The cautionary case: computed and discarded

Understand-Anything illustrates what happens when the substrate for confidence exists but the column to record it does not. The system distinguishes deterministic edges (resolved by a project scanner from source files) from inferred edges (guessed by an LLM during semantic analysis). That information is real and meaningful — a structural import edge deserves higher confidence than a non-code inference. But both are stored as weight 0.7, identical in the graph. The confidence signal is computed at write time and discarded before persistence.

Adding a confidence field would be a small change with a large information-quality payoff. The system already knows which edges are solid and which are guesses. It just does not record the distinction where it matters — on the row that gets retrieved later, when the agent needs to discriminate between them. This pattern appears across multiple systems in the corpus: information is available at write time, cheap to capture, and then lost because nobody added the column.

The honest cost of building this right

Provenance costs bytes. A source ID, a confidence score, a version counter — each adds a few fields per row. At scale that matters, but it matters far less than the cost of not having them when something goes wrong in production and you cannot trace why the system produced a wrong answer.

The compute cost is lower than expected. Confidence scoring typically requires one additional LLM call at write time (or a deterministic heuristic for structural facts), which is amortised across every subsequent read. Source tracking costs nothing beyond recording an identifier that already exists. Versioning costs one extra column or one append per write, not a full snapshot. Hindsight’s observation tier adds storage proportional to the number of retrieved-and-used facts, but only records what the agent actually consumed, not everything it saw.

The operational cost is the real question. Tier 3 decoration means more data flowing on every retrieval, which increases context-window usage and response latency slightly. The systems that ship this handle it by keeping decorations concise — a source type enum, a confidence float, a freshness state — rather than full provenance trees inline. The agent gets enough to discriminate without drowning in metadata.

What the strongest implementations share

The exemplars across the corpus deserve restating because no two solve the same slice of the problem:

OpenContext mints UUIDs that survive any filesystem reorganisation and exposes them as a citation scheme the agent can use directly. mem9 carries source, agent ID, session ID on every row, version on every update, and decorates search results with source-turn context governed by an explicit budget. Supermemory treats memory as a versioned DAG with typed edges, giving every belief commit history. Hindsight captures every retrieved-and-used fact in an observation tier with full source provenance, evolution history, and per-retriever ranking. Graphify marks every edge with three-level capture confidence, surfacing uncertain edges for human review rather than treating them as fact.

The unifying observation is plain: provenance is not metadata, it is part of the fact. The systems that treat it that way are the ones whose memory does not silently rot, whose contradictions can be adjudicated, whose hallucinations can be traced, and whose belief history can be rewound to any point in the past without having anticipated the need.

The systems that do not are the ones whose users learn eventually that the memory they were trusting was an island all along.

Provenance is the cheapest insurance you can buy at write time. The discipline is to buy it before you find out you needed it.

The next piece walks hybrid retrieval and RRF, the pattern that lets you combine vector and keyword signals without one drowning out the other — which matters most when provenance tells you a fact is solid but relevance alone would bury it. That piece is coming up.

Share & discuss

The X Article covers the same ground in a different form. The site version is the canonical one; the X version exists for the conversation in the replies.