llm memory research

Storage is cheap. Attention is expensive. Are you using the system that exploits the difference?

· ~10 min read · by Steven Batchelor-Manning
Contents
  1. The asymmetry that makes tiering pay back
  2. MemoryOS as reference implementation
  3. Hindsight's theory-derived tiers
  4. supermemory's genre-conditioned tiering
  5. Promotion vs. fixed typology
  6. What flat costs you
  7. The closing position
Storage is cheap. Attention is expensive. Are you using the system that exploits the difference? - hero image.

The asymmetry that makes tiering pay back

If you change one thing about how you design agent memory, change this: stop treating storage cost and attention cost as if they’re the same thing.

I’ve been going deep on this across 19 systems, and the finding that keeps surfacing is that the ones that handle memory well aren’t necessarily the ones with the most sophisticated retrieval. They’re the ones that figured out which memories belong in the prompt at all. That’s a different problem, and it has a different solution.

Storage is cheap. Attention is expensive. A 128k-context model reading 110k of mediocre context isn’t, empirically, a better agent than the same model reading 8k of carefully selected context. The research on this is consistent: retrieval quality degrades as context fills with noise, and the degradation isn’t linear. The model doesn’t simply ignore the irrelevant material. It processes it, and the processing crowds out the signal.

Storage doesn’t have this property. A fact sitting in a database costs nothing to keep. The cost only arrives when you retrieve it and load it into the prompt. Which means the question isn’t “should I store this?” but “should I retrieve this, and if so, when?”

That reframing is where tiering comes from. Different memory items have different access patterns. Some things an agent needs every turn: the user’s name, their current project, their stated preferences. Some things it needs often but not always: recent decisions, open questions, episodic facts from the last few sessions. Some things it should be able to find when needed but should never burden every turn: meeting notes from three months ago, completed tasks, raw transcripts, one-off reference material.

A flat store treats all three categories identically. Hot items pay the search cost of cold items. Cold items inflate the prompt with noise. There’s no mechanism for items to graduate as they become more relevant, and no mechanism for items to age out as they become less relevant. The OS analogy is exact: CPUs have L1, L2, L3, RAM, SSD, and disk not because bytes are different but because frequency of access varies by orders of magnitude. Seven of the 19 systems I went through had already built explicit tiering before I started looking. Two of them are worth understanding in detail.


MemoryOS as reference implementation

MemoryOS is the clearest implementation of tiered memory in the 19 systems. Three tiers, each with a distinct data shape, a distinct latency budget, and a distinct role.

The short-term tier is a Python deque with a maximum length of 10 QA pairs. No embeddings, no search index, no heat tracking. It’s pure conversational raw material, the last ten exchanges, available at microsecond latency. When the deque fills, the oldest pair drains into the mid-term tier.

The mid-term tier holds up to 2000 sessions. Each session carries a summary, an embedding, a keyword set, and heat counters. The tier is indexed with Faiss and searched by cosine similarity. When the tier reaches capacity, the coldest sessions are evicted. When a session gets hot enough, it’s promoted to the long-term tier.

The long-term tier is a 90-dimension psychology and alignment schema, two knowledge-base deques, one for user facts and one for assistant facts, each capped at 100 entries. This is the persistent layer, the one that survives across sessions and carries the durable model of the user.

The heat formula that governs promotion is twelve lines of Python. Three signals: visit frequency, an LFU analogue; interaction depth, a proxy for topical engagement; and recency decay, exponential with a 24-hour half-life. A segment crosses the promotion threshold at 5.0. After promotion, the visit and interaction counters reset to zero, and heat collapses back to roughly 1.0.

The design decision that’s easy to miss: heat gates promotion, not retrieval. The retriever is purely semantic, cosine similarity over the mid-term embeddings. Heat is a background signal that decides whether a segment should graduate to the long-term tier. The two concerns are decoupled, and that decoupling matters more than the formula itself.

What MemoryOS leaves on the table is worth naming. The coefficients are hardcoded at 1.0, there’s no mechanism for learning weights from actual usage patterns. There are no demotion paths; once something reaches the long-term tier, it stays. And the formula optimises for frequency over importance. A critical but rare fact, a partner’s name, a medical condition, a hard constraint, may never cross the promotion threshold if it only surfaces once. Once the mid-term tier evicts the segment, the fact is gone.


Hindsight’s theory-derived tiers

Hindsight arrives at the same three-tier structure from a completely different starting point. Where MemoryOS draws on OS cache theory, Hindsight draws on cognitive science.

The three tiers are World, Experience, and Observations. World holds objective claims about the universe, ground truth, always-on, long-lived. Experience holds first-person actions of the system itself, the episodic record. Observations hold consolidated beliefs derived from World and Experience facts, carrying source memory IDs, a proof count, and a history field that tracks how the belief has evolved.

All three tiers live in the same database table, differentiated by a fact-type discriminator. Partial HNSW indexes are built per fact type. The schema is unified; the access patterns aren’t.

Promotion in Hindsight isn’t counter-driven. It’s batched LLM-driven consolidation. When a new fact is written, it’s enqueued into an async operations table. A background worker fetches the new facts alongside existing overlapping observations, builds a batch prompt, and asks the model for creates, updates, and deletes. Source memories are stamped with a consolidated-at timestamp to prevent reprocessing. Every new fact, regardless of how frequently it’s been accessed, gets considered for promotion to the Observations tier.

That’s the key difference from MemoryOS. By tying upper-tier promotion to consolidation rather than heat, Hindsight avoids the blind spot for critical-but-rare facts. A single consolidation pass considers every new fact. Frequency is irrelevant to whether something graduates.

Put plainly: two systems, different first principles, different implementation languages, different target use cases, and both land on three tiers with raw material at the bottom, a working layer in the middle, and a synthesised persistent layer at the top. Both use async promotion. Both carry provenance back to lower tiers. That’s what convergent evolution looks like in software architecture, and it’s the strongest signal I know of that a pattern is load-bearing.


supermemory’s genre-conditioned tiering

supermemory operates in the managed API deployment shape, which changes the implementation without changing the architecture. The three tiers are static profile, dynamic profile, and document and chunk store.

The static profile holds stable long-term facts, the hot tier. It’s returned as a static array from the profile endpoint, cached at the edge, with a latency budget of roughly 50ms. The dynamic profile holds recent and episodic context, the warm tier. Many entries carry a forgetAfter field that sets a TTL. The document and chunk store is the cold tier, queried via search endpoints when needed.

Tier assignment happens at write time. An extraction LLM classifies each incoming memory with an isStatic boolean and optionally a forgetAfter value. The classification is enforced by a closed extraction prompt, uniform across all consumers.

The managed API deployment shape enables three things that an in-process designer can’t directly copy but should understand. Cold-tier data can sit on cheaper hardware, object storage for raw bytes, a standard relational store for metadata and chunks, with the hot profile cached separately at the edge. The hot tier gets its own endpoint with its own SLA, separate from the search path. And the extraction prompt is centralised, which means tier assignment is consistent in a way that per-agent classification rarely is.

The trade-offs are real. A remote hot tier is only fast if the network is fast. The agent can’t override the engine’s tier classification. The extraction prompt is a black box. But the architectural pattern, hot tier as its own endpoint, cold tier on cheaper hardware, tier assignment at write time, is worth copying even if the deployment shape isn’t.


Promotion vs. fixed typology

mem9 is the contrast case that clarifies what tiering isn’t.

mem9 has a memory-type column with three values: pinned, insight, and digest. Pinned memories are assigned by explicit content-write paths, manually created, protected from LLM reconciliation. Insights are assigned by every LLM-extracted write, mutable, versionable, supersedable. Both participate in the same hybrid recall with the same RRF scoring. The type field is a write-protection flag, not a retrieval filter and not a tier signal.

This is typology, not tiering. The distinction matters because the two are easy to conflate. Typology describes governance: who can mutate this memory, under what conditions. Tiering describes access patterns: how frequently is this memory needed, and what storage representation best serves that frequency. A system can have both. supermemory’s isStatic is a tier signal, while a separate isInference flag is closer to a governance class. But conflating them produces the most ambiguity in practice.

If you find yourself adding a type field to your memory rows, the question to ask is whether the field describes an access pattern or a governance class. If it’s an access pattern, you’re building tiering. If it’s a governance class, you’re building typology. Both are useful. They’re not the same thing.


What flat costs you

Four concrete things follow from running a flat memory store.

The hot path pays the cold path’s search cost. Every retrieval scans the same index over the same items. The agent looking up the user’s name pays the same search cost as the agent looking up a meeting note from six months ago. At small scale this is invisible. At scale it’s a latency problem.

The cold path inflates the prompt with noise. Retrieval returns semantically near items, which includes standing context, superseded facts, and material that’s factually correct but irrelevant to the current turn. The model processes all of it. The signal-to-noise ratio in the context window degrades as the store grows.

There’s no mechanism for items to graduate. Memory isn’t static. A fleeting episodic fact from early in a relationship may, over time, become a durable signal about the user’s preferences or constraints. A flat store gives no machinery for noticing that transition. The item stays in the same representation it was written in, regardless of how its relevance has changed.

There’s no mechanism for items to age out. Demotion isn’t deletion. A flat store that wants to remove stale material must delete it. A tiered store can move it to a colder representation, still findable, no longer burdening the hot path. The flat store forces a binary choice that the tiered store doesn’t.


The closing position

18 of the 19 systems implement tiering, gesture at it, or have explicit recommendations for it. The exception is mem9, which has a typology layer solving a different problem and would benefit from tiering on top of it.

If you’re building agent memory from scratch, the progression the 19 systems point to is this. Identify what the agent needs every turn, that’s your hot tier. Identify what it needs often but not always, that’s your warm tier. Identify what it should find when needed but never burden every turn, that’s your cold tier. Pick a promotion mechanism: heat-based like MemoryOS, LLM-judgement like Hindsight, or extraction-time classification like supermemory. Pick a demotion mechanism: time decay, TTL, or freshness categories. Keep promotion and retrieval scoring separate at first, the decoupling is easier to add than to unpick. Track provenance from upper tiers back to lower tiers, so you can always answer the question of where a synthesised belief came from.

The 19 systems aren’t unanimous on much. On this, they are: a single flat memory store is the wrong default for any non-trivial agent memory system.

Storage is cheap. Attention is expensive. Build the system that exploits the difference.


Next week: context budget management, how the 19 systems handle the constraint that the context window is finite, and what the ones that handle it well have in common.

Share & discuss

The X Article covers the same ground in a different form. The site version is the canonical one; the X version exists for the conversation in the replies.