A logistics company deployed a customer support agent with what looked like a reasonable memory architecture: each session loaded the last 5,000 tokens of conversation history directly into the context window. The model was a 32k-token LLM. Sessions never approached that limit. On paper, no problem.
Six weeks in, operators started getting complaints. The agent was contradicting itself: confirming a pickup time early in a session, then denying any such confirmation forty turns later. The team’s first instinct was hallucination. It wasn’t. The relevant confirmation was 28,000 tokens back in a 30,000-token window — sitting squarely in the middle of a dense thread. The model was technically attending to the correct context. It just wasn’t attending to it reliably. Liu et al. documented this pattern systematically: performance on multi-document retrieval degrades significantly when relevant information appears in the middle of long contexts, even for models explicitly designed for long-context use 1. The context window was not the problem. The position was.
This is the failure mode that in-context memory produces at scale — not overflow, but degradation. And it is only one of three distinct ways agent memory can silently break. Each approach to memory has a different weak point, and the weak points don’t overlap. That asymmetry is what matters for practitioners.
In-Context Memory: Fast Until It Isn’t
In-context memory is the simplest approach: the agent’s conversation history, retrieved documents, or scratchpad state all live inside the active context window. There is no external storage, no retrieval step, no write latency. The agent reasons over everything it needs in a single forward pass.
For short-lived sessions — under a few thousand tokens, contained within a single task — this is the right choice. It is fast, coherent, and requires no infrastructure. The agent can reference any part of its working memory with equal fidelity because the attention mechanism has full access to the whole context.
The failure mode is position-dependent degradation, and it arrives before you expect it.
The “lost in the middle” phenomenon identified by Liu et al. is well documented 1, and if you have read our context-window-distortion-agents post you have seen it in detail. What matters here is how it fails in practice: the agent does not refuse to answer or return an error. It answers confidently, drawing on the wrong parts of its context. From the outside, this looks like hallucination or inconsistency. The actual cause is architectural.
There are two thresholds to track:
Utilization threshold. In-context memory begins degrading well before the context window is full. The logistics example above used roughly 94% of a 32k window. The degradation is not linear — it is concentrated in the middle. An agent with a 128k context window and 100k tokens of history is not necessarily more reliable than one with a 32k window and 28k tokens of history. The ratio matters less than the absolute distance between relevant tokens.
Session boundary. In-context memory is stateless across sessions by definition. If your agent restarts, the context is gone. Workarounds — serializing the context to disk, loading it at session start — help only until the accumulated history exceeds what fits usefully in the window. At that point you are already in external store territory, but without the infrastructure to support it.
The combination of these two thresholds defines the envelope for in-context memory: short sessions, contained tasks, no cross-session continuity requirements. Outside that envelope, you are accumulating a reliability debt that will surface as intermittent, hard-to-reproduce failures.
Retrieval-Augmented Memory: Precise Until It Goes Stale
Retrieval-augmented memory decouples the storage of past interactions from the context window. The agent writes observations, summaries, or documents to a vector store at the end of each turn or session. When it needs to recall something, it queries the store and injects the top-k retrieved chunks into context. This breaks the hard ceiling of in-context memory and makes cross-session continuity tractable.
For agents handling large, diverse knowledge bases — or for systems where the volume of past interactions genuinely exceeds what any context window can hold — retrieval is the right foundation. It scales horizontally in a way in-context memory cannot.
The failure modes are different in character from in-context degradation: they are not about position effects or window size. They are about what gets retrieved and what doesn’t.
The deep dive on retrieval failure modes lives in our agent-memory-retrieval-failure post. The comparative angle here is: how does retrieval fail relative to the alternatives, and when does that failure show up?
Temporal staleness. When a user updates their account type, their shipping preferences, or their stated goals, the new information creates new embeddings. It does not automatically displace the old ones. Similarity search is not temporal search. An agent querying for “user’s preferred shipping method” six months after a preference change may retrieve both the old and new vectors, or only the old ones, depending on how similar the queries are to each embedding. This failure is invisible at write time. It surfaces as confident, wrong answers.
Recall gaps at scale. The LoCoMo benchmark — a dataset of very long-term conversations spanning up to 35 sessions and averaging 9,000 tokens per conversation — found that standard RAG systems struggled specifically with temporal and causal reasoning across sessions 2. The problem is structural: retrieval based on semantic similarity cannot reliably reconstruct causal chains. If the agent needs to reason about a sequence of events (“the user updated their address after the order was placed, but before it shipped”), chunk-level retrieval will likely return relevant individual facts without surfacing the ordering relationship between them.
Relevance gaps. As the vector store grows, the signal-to-noise ratio of retrieved chunks decreases. At small scale, top-k retrieval returns obviously relevant chunks. At large scale — thousands of past sessions, tens of thousands of stored observations — semantically similar but contextually irrelevant content starts competing with the right chunks for the top-k slots. The agent’s effective working memory degrades not because information was lost, but because it is buried.
MemoryArena, a recent benchmark evaluating agent memory across interdependent multi-session tasks, found that agents performing well on long-context benchmarks like LoCoMo consistently struggled in scenarios where memory had to guide future actions rather than answer factual questions 3. This distinction matters: retrieval-augmented systems are well-optimized for factual recall. They are not well-optimized for procedural continuity — remembering that a subtask is in progress, or that a constraint was established three sessions ago and should block a current decision.
External Store: Consistent Until It Isn’t
External store memory — relational databases, key-value stores, graph databases — treats agent state as structured data rather than dense vectors or raw token sequences. The agent reads and writes explicit records: user preferences, task states, entity relationships, completed steps. Retrieval is deterministic: a lookup by key returns exactly the record you wrote, not an approximation based on embedding distance.
For long-running agents managing complex, structured state — multi-step workflows, entity relationships, cross-session task continuity — external stores provide guarantees that neither in-context nor retrieval-based approaches can match. If you need to know whether a task was completed, a database query is unambiguous. A similarity search is not.
Packer et al.’s MemGPT system illustrated why tiered memory management matters: the context window alone cannot provide the extended, reliable state that complex agents require, and a naive append-only external store introduces its own coordination problems 4. The failure modes are distinct from the other two approaches:
Write latency and consistency under concurrency. External stores introduce synchronous write operations into the agent’s execution path. For single-threaded, single-session agents, this is a non-issue. For agents running concurrent sessions — parallel tool calls, multi-agent pipelines, or any architecture where multiple agent instances share state — writes can race. Two instances that simultaneously check whether a step has been completed may both conclude it hasn’t, both execute it, and produce duplicate or conflicting results. This is a standard distributed systems problem, but it is invisible in development environments where single-agent testing never triggers it.
Cold-start for new entities. External stores know nothing about entities that have not been explicitly written to them. An in-context agent can reason about a new entity introduced in the current session. A retrieval-augmented agent can infer properties about it from related stored observations. An external store agent will return a null lookup and must handle it explicitly. For agents that frequently encounter novel entities — new users, new document types, new task categories — this cold-start cost is structural, not incidental.
Schema rigidity. Structured stores require schemas. Schemas require decisions about what to track, made at design time, before the agent is deployed. When the agent’s requirements evolve — new task types, new entity relationships, new constraints — the schema must evolve with them. This is not a theoretical concern. In practice, it means that agents built on external stores accumulate migration debt in proportion to how much the underlying task domain changes. In-context and retrieval-based systems don’t have this property: they degrade gracefully when new information arrives because they don’t enforce structure.
Comparison Table
| Approach | Primary failure mode | Scale threshold | Observable symptom | Mitigation |
|---|---|---|---|---|
| In-context | Position-dependent degradation | ~50–70% window utilization | Inconsistency within a session; confident wrong answers | Trim to recent + salient; use compression (see agent-memory-compression-strategies) |
| Retrieval-augmented | Temporal staleness; recall gaps at scale | >10k stored chunks; long-horizon tasks | Wrong answers that were once correct; missing causal chains | Versioned embeddings; recency-weighted retrieval; explicit invalidation |
| External store | Concurrency races; cold-start failures; schema drift | Concurrent sessions; novel entity volume | Duplicate actions; null-lookup failures; silent migration errors | Optimistic locking; null-handling policies; schema versioning |
Decision Framework
These are not equal tradeoffs between style preferences. Each choice forecloses certain options. Here are the forks that actually matter:
1. Does the agent need to remember anything across session boundaries? If no: in-context memory is almost always the right choice. It is faster, simpler, and has no infrastructure footprint. Any retrieval or external store overhead is pure cost with no benefit. If yes: proceed to question 2.
2. Is the information the agent needs to recall primarily factual (user preferences, past events, document contents), or procedural (task state, step completion, workflow continuity)? Factual recall favors retrieval-augmented memory. Procedural continuity favors external store. Mixing both often means both are needed, with in-context memory serving as the working buffer for the current session.
3. Will multiple agent instances ever share state simultaneously? If yes, external store is mandatory for any state that requires consistency. Retrieval-augmented systems handle concurrent reads acceptably; concurrent writes to shared state without explicit locking will produce races. In-context memory is inherently per-session and has no concurrency issue.
4. How often does your agent encounter entities it has never seen before? High novelty volume (new users at session start, new document types, new task categories) penalizes external stores. Cold-start cost is not latency — it is correctness. An agent that cannot look up an entity must handle the null case explicitly, and most implementations don’t. In-context and retrieval-based approaches can infer properties from context without explicit records.
5. What is the expected volume of stored state over the agent’s lifetime? Under a few thousand observations: in-context or small retrieval store. Ten thousand to hundreds of thousands: retrieval store with explicit staleness management. Highly structured state that must remain queryable and consistent over millions of records: external store.
6. Are your failure modes more acceptable as intermittent inconsistency (in-context), confident wrong recall (retrieval), or silent null returns (external store)? This is not a rhetorical question. Different downstream applications have different error tolerances. A customer support agent can tolerate occasional retrieval misses. A financial workflow agent cannot tolerate duplicate step execution. Choose the failure mode that your application can detect and recover from.
What Most Practitioners Get Wrong
The default mistake is treating memory as an infrastructure choice rather than a correctness choice. Teams pick in-context memory because it requires no setup, or retrieval because it seems like “the scalable option,” or external store because it looks like a real database. None of these are reasons.
The actual question is: what does failure look like in production, and can I detect it?
In-context memory fails silently — consistent-looking answers that are wrong because the relevant context was in the middle. Retrieval-augmented memory fails confidently — correct-looking answers that are stale. External store memory fails structurally — correct behavior until a new entity appears or concurrent writes race.
Of these, retrieval-augmented failure is the hardest to catch in testing. Unit tests and integration tests almost always run with small, fresh vector stores. Staleness failures require months of production use to emerge. Concurrency failures in external stores require load testing that most teams skip. In-context degradation is at least detectable with long-context test cases, if you write them.
The practitioners who get this right do two things differently. First, they test the failure modes explicitly — not just happy paths. Second, they choose memory architectures based on the failure modes their application can tolerate, not the ones that look best on a benchmark. A system that scores well on LoCoMo’s factual recall tasks can still fail badly on the procedural continuity tasks that MemoryArena reveals 3. The benchmarks are not measuring what your agent needs to do in production. You are.
Footnotes
-
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172. https://arxiv.org/abs/2307.03172 ↩ ↩2
-
Maharana, A., et al. (2024). Evaluating Very Long-Term Conversational Memory of LLM Agents. arXiv:2402.17753. https://arxiv.org/abs/2402.17753 ↩
-
He, Z., Wang, Y., Zhi, C., Hu, Y., Chen, T.-P., Yin, L., Chen, Z., Wu, T. A., Ouyang, S., Wang, Z., Pei, J., McAuley, J., Choi, Y., & Pentland, A. (2026). MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks. arXiv:2602.16313. https://arxiv.org/abs/2602.16313 ↩ ↩2
-
Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S. G., Stoica, I., & Gonzalez, J. E. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560. https://arxiv.org/abs/2310.08560 ↩