Context Engineering: The Skill That Separates Reliable Agents from Brittle Ones
In session 47 of a long-running research fleet, an agent tasked with synthesizing a competitor landscape report failed — not because it lacked the right information, but because the right information had been appended to its context in the wrong order. The instruction to focus on enterprise pricing was there. The scraped data was there. The constraint to exclude consumer products was buried 14,000 tokens into a 16,000-token window. The model produced a comprehensive consumer breakdown with enterprise pricing mentioned in passing, then marked the task complete.
The agent had perfect information and terrible context.
This failure pattern is common. It is also underappreciated because practitioners spend the majority of their engineering effort on two problems that look adjacent but are not the same: what the agent is told (prompt engineering) and where the agent stores things it needs later (memory management). Context engineering — the ongoing operational discipline of managing what is actually in the window during a live session, in what structure, at what compression level, and in what order — sits in the gap between them. Getting it wrong is the single most reliable way to build an agent that passes evals and fails in production.
What Context Engineering Actually Is
Prompt engineering produces the static instructions an agent starts with. It answers: what does this agent do, what is its persona, what are its constraints? These instructions are written once and evaluated at agent design time.
Memory management handles retrieval and storage across sessions. It answers: what should the agent remember across interactions, how do we embed and retrieve relevant facts, how do we prevent the memory store from becoming stale? Memory systems are architectural decisions about what lives outside the context window.
Retrieval-augmented generation (RAG) moves information from an external corpus into the context window at query time. It answers: which chunks of external knowledge are relevant to this specific query?
Context engineering is none of these. It is the discipline of managing what is in the active context window during a live, multi-step session. It answers: of everything the agent could have access to right now — system instructions, task specifications, tool outputs, prior turn history, retrieved facts, running state — what should be present, in what order, at what level of compression, and what should be removed or summarized to make room for what comes next?
The distinction matters because the failure modes are different. Prompt engineering failures show up immediately — the agent misbehaves from the first call. Memory failures show up across sessions — the agent forgets things it should know. Context engineering failures are insidious: the agent starts correctly, degrades mid-session, and by turn 30 is operating on stale facts, buried instructions, and a history so bloated it cannot effectively attend to any of it.
A useful mental model: system prompt is the agent’s standing orders. Memory is its filing cabinet. Context is its working desk — and context engineering is the discipline of keeping that desk organized under operational conditions.
Three Failure Modes That Break Context Reliability
1. Context Bloat: Dumping Everything In
The simplest and most common failure. An agent accumulates tool outputs, prior conversation turns, and retrieved documents without any compression policy. After 20 steps in a long-horizon task, the context window is full of everything that happened, not of what matters now.
The problem is not just token cost. Research on long-context LLM behavior consistently shows that models do not attend uniformly to content at different positions. Liu et al.1 demonstrated that performance on multi-document question answering drops sharply when relevant content appears in the middle of a long context — a finding since replicated across tasks and model families. Bloated context does not just waste tokens; it actively degrades attention to the content that matters.
The ACON framework2 — designed explicitly for long-horizon agent tasks — found that uncompressed agent context causes 26–54% unnecessary token overhead (measuring peak tokens) and that smaller models show up to 46% performance improvement when operating on compressed histories rather than raw accumulated context. The failure is not model weakness. It is that models were not designed to operate as perfect-retrieval engines over arbitrarily long, unstructured histories.
A concrete bloat scenario: a coding agent that appends every tool response verbatim. After 30 file reads and 5 failed test runs, the context includes 8,000 tokens of prior file contents that have since been edited, making them both incorrect and attention-consuming.
2. Context Ordering Effects: Burying Critical Instructions
Most practitioners understand primacy and recency effects in broad strokes. What they underestimate is how dramatically ordering affects reliability in multi-turn agent sessions.
He et al.3 found that prompt template formatting — including structural choices about where content appears and how it is organized — causes up to 40% performance variation in LLM task completion, with smaller models showing stronger sensitivity. This is not a finding about exotic prompting techniques. It is a finding about where in the input your content sits and how it is delimited.
The ordering problem compounds in agent sessions because the context is dynamic. The system prompt appears at the top. But as the session accumulates tool outputs and turn history, task-critical instructions — the constraints, the scope limits, the output format requirements — sink toward the middle of the window. By the time the agent is executing step 15 of a 20-step task, its most recent tool output is at the bottom (high attention) and its core task specification is somewhere in the middle (degraded attention).
The failure mode from the opening of this post is precisely this: a constraint buried in the middle of a long context, not re-anchored as the session grew.
3. Context Staleness: Acting on Outdated State
Context staleness occurs when the working context contains facts that were true at session start but have since changed — and the agent is not aware of the divergence.
This is not a memory retrieval failure. The stale fact is already in the context window. The problem is that nothing has replaced it or flagged it as superseded. A document status that was “draft” at turn 1 becomes “approved” at turn 12 after a tool call, but the agent’s prior reference to it (“the draft requires revision”) remains in the history. An agent reading its own prior turn outputs as ground truth will inherit its past errors.
Staleness failures are particularly dangerous in long-horizon planning tasks. The agent at step 18 may be operating on a plan it formulated at step 3, without registering that 6 of its earlier assumptions have been invalidated by subsequent tool results. Vasilopoulos4, studying 283 development sessions of an agent operating on a 108,000-line codebase, documented repeated failure patterns where agents re-introduced previously resolved issues because their working context did not explicitly flag what had changed — only what had been done.
Production Patterns That Work
Rolling Compression Windows
Rather than letting context grow unbounded, implement a rolling compression policy that summarizes older turns while preserving anchor facts.
The approach:
- Define a compression trigger — e.g., when context exceeds 60% of the model’s context window.
- Summarize turns older than the most recent N turns into a structured state block.
- Preserve anchor facts explicitly: these are facts that were relevant at session start and must not be distorted by summarization (task goal, hard constraints, critical state).
The ACON framework2 implements this through a two-phase approach: first compressing environment observations (tool outputs, retrieved content) and then compressing interaction history, with separate compression policies for each. The key insight is that observation compression and history compression have different requirements. Observations can often be reduced to their actionable conclusion. History compression must preserve the reasoning thread.
A minimal implementation in Python:
def compress_context(turns: list[dict], anchor_facts: dict, max_tokens: int = 4000) -> list[dict]:
"""
Compress older turns once context exceeds threshold.
Preserves the last 5 turns verbatim; summarizes earlier ones.
Anchor facts are injected as a structured block at the top.
"""
if estimate_tokens(turns) < max_tokens * 0.6:
return turns
recent = turns[-5:]
to_compress = turns[:-5]
summary = summarize_turns(to_compress) # LLM call: compress to key decisions/state changes
return [
{"role": "system", "content": format_anchor_facts(anchor_facts)},
{"role": "system", "content": f"[Session summary to this point]: {summary}"},
*recent
]
The critical discipline: anchor facts are never summarized away. They are re-injected explicitly after each compression.
Structured Context Slots
Unstructured context accumulation is the root cause of most ordering failures. The fix is to define explicit slots that map to positions in the context window, with stable ordering enforced by the scaffolding layer, not left to natural turn accumulation.
A production slot schema:
CONTEXT_SCHEMA = {
"system_instructions": 0, # Never moves. Agent identity, standing constraints.
"task_specification": 1, # Current task goal and success criteria.
"current_state": 2, # Mutable: updated as facts change mid-session.
"recent_tool_outputs": 3, # Last 3 tool results only; older ones compressed.
"turn_history": 4, # Summarized prior turns.
}
The scaffolding layer assembles the actual context window on each call from these slots in schema order, rather than by naive turn accumulation. This guarantees that task_specification is always in position 1 regardless of session length. It also makes context state inspectable: you can dump the current state of each slot independently, which matters for debugging.
State Anchoring: What Must Stay at Top and Bottom
Primacy and recency effects are real and exploitable. Use them deliberately.
Top of context (primacy position):
- Agent identity and standing constraints
- Current task goal and hard constraints (not just at session start — re-inject on each call if the session is long)
- Any constraint that, if ignored, causes an unrecoverable failure
Bottom of context (recency position):
- The most recent tool output
- The immediate next step instruction
- Any fact that must be in the model’s “working memory” for the next single action
Middle (compressionable):
- Turn history
- Earlier tool outputs
- Background context that is useful but not critical to the next action
The key discipline: do not let task constraints drift to the middle. Re-anchor them at the top on each call once a session exceeds 10 turns.
Context Versioning: Reset vs. Patch
Long-running agent sessions face a decision point that has no equivalent in prompt engineering: when has the context accumulated enough drift that it is cheaper and safer to reset than to patch?
A context patch is a targeted injection — adding a correction or update to the current context. A context reset is starting a fresh session with a structured state handoff: summarize what was accomplished, extract the current ground truth, and begin from a clean window.
Signs that reset beats patch:
- More than 3 anchor facts have been superseded and the old versions remain in the history
- The agent is referencing its own prior outputs as ground truth for claims that have since been invalidated
- Context is above 70% capacity and compression has already been applied twice
A state handoff template:
def build_handoff_context(session: Session) -> str:
return f"""
## Session Handoff
**Task**: {session.original_task}
**Completed steps**: {session.completed_steps_summary}
**Current ground truth**:
{format_facts(session.current_facts)}
**Outstanding constraints**:
{format_constraints(session.active_constraints)}
**Next action**: {session.next_step}
"""
This structure forces the scaffolding layer to be explicit about what it is carrying forward. Anything not in the handoff is intentionally discarded.
Comparison Table
| Intervention | Problem It Solves | Implementation Cost | When to Apply |
|---|---|---|---|
| Rolling compression windows | Context bloat; token overflow; degraded attention on long histories | Medium — requires a summarization call and compression trigger logic | Sessions exceeding 15–20 turns; any long-horizon task |
| Structured context slots | Ordering instability; critical instructions drifting to middle of window | Low — scaffolding layer change, no model calls | All multi-turn agents; apply from the start |
| State anchoring (top/bottom) | Primacy/recency neglect; constraint drift in long sessions | Low — ordering discipline in scaffolding | Any session > 10 turns; especially when hard constraints exist |
| Context reset with handoff | Accumulated staleness; multiple superseded anchor facts; compounding errors | Medium — requires structured state extraction | When compression has been applied 2+ times; when stale facts outnumber current ones |
| Observation compression | Tool output bloat; verbose API responses flooding context | Low-Medium — requires per-tool summarization rules | Any agent using tool calls with verbose outputs (web fetch, file reads, API responses) |
Hard Conclusion
Most practitioners treat context as a container for relevant information. The implicit model is: put in what the agent needs, it will find what it needs within that content, and the task will proceed correctly. This model is wrong, and the mismatch is the dominant cause of agent reliability failures in production.
Context is not a container. It is the agent’s working memory — and like working memory, its reliability depends not just on what is present but on how it is organized, what is foregrounded, and what has been deliberately cleared. A human expert given 50 relevant documents scattered on a desk will perform worse than one given the same documents with the 5 most critical ones on top, the stale drafts removed, and the hard constraints pinned in view.
The research bears this out. He et al.3 found 40% performance variance from structural formatting choices alone, before any content change. Kang et al.2 found that agents operating on compressed, well-structured context outperform agents with full uncompressed histories by up to 46% on long-horizon tasks. The content is the same. The structure is different. The outcome diverges sharply.
The practitioners who build reliable agents have internalized this: context engineering is a first-class operational discipline, not a concern to address when things go wrong. They define their context schema before they write their first tool. They set compression triggers before they see their first overflow. They test context state at each turn, not just task outcome at session end.
Practitioners who skip this step build agents that pass evals — which typically run short sessions with clean context — and fail in production, where sessions are long, context accumulates, and the distance between what the agent is attending to and what actually matters grows with every turn.
The difference between a reliable agent and a brittle one is usually not the model, the tools, or the memory system. It is the 200 lines of scaffolding that nobody reviewed because context engineering does not have a benchmark yet.
Footnotes
-
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. https://arxiv.org/abs/2307.03172 ↩
-
Kang, M., Chen, W.-N., Han, D., Inan, H. A., Wutschitz, L., Chen, Y., Sim, R., & Rajmohan, S. (2025). ACON: Optimizing Context Compression for Long-horizon LLM Agents. https://arxiv.org/abs/2510.00615 ↩ ↩2 ↩3
-
He, J., Rungta, M., Koleczek, D., Sekhon, A., Wang, F. X., & Hasan, S. (2024). Does Prompt Formatting Have Any Impact on LLM Performance? NAACL 2025. https://arxiv.org/abs/2411.10541 ↩ ↩2
-
Vasilopoulos, A. (2026). Codified Context: Infrastructure for AI Agents in a Complex Codebase. https://arxiv.org/abs/2602.20478 ↩