AI Agent Self-Improvement Architecture: Research, Reality, and What Actually Works

The Stateless Plateau

There is a specific failure mode that only becomes visible after you’ve run an AI agent for a few weeks. The agent starts well — handles its tasks, returns coherent outputs, occasionally surprises you. Then, around session 10 or 15, progress stalls. Not because the model got worse. Because the agent never got better.

The root cause is statelessness. A vanilla LLM agent starts every session from scratch. The system prompt is the same. The context window is empty. The agent has no memory of what it tried last week, what worked, what didn’t, what it learned about the task domain. Every session is session one.

This is not a model limitation — it’s an architectural one. The question is whether you’ve built the infrastructure to let improvement happen. This guide covers both the research on what works (and what doesn’t) and the concrete architecture for building agents that actually improve over time.

The Self-Critique Problem

In 2023, two landmark papers appeared that seemed to answer whether AI agents can improve without human supervision.

SELF-REFINE (Madaan et al., 2023; arXiv:2303.17651) showed that a single LLM could act as its own generator, critic, and refiner — producing an output, critiquing it, then revising based on the critique, iteratively, without any training. Results: approximately 20% improvement across 7 diverse tasks. For code optimization specifically, quality scores improved from 22.0 to 28.8 after just three iterations — a 31% gain.

Voyager (Wang et al., 2023; arXiv:2305.16291) built a Minecraft agent that improved autonomously over time: 15.3x faster at clearing tech tree milestones, 3.3x more unique items collected, skills that transferred to new environments without retraining.

Reading these papers, the natural conclusion is: self-improvement works. Build agents that critique themselves, and they’ll get better over time.

That conclusion is wrong. Or rather: it’s incomplete in a way that breaks most real-world implementations.

What the Papers Actually Found

SELF-REFINE achieves its gains selectively. The ~20% improvement holds on tasks where output quality is easily judged — code that either runs or doesn’t, mathematical answers that are right or wrong. On open-ended tasks without clear quality signals, self-critique doesn’t reliably help. The agent’s errors compound rather than correct.

Voyager’s 15.3x improvement has almost nothing to do with self-critique. The agent’s real innovation was a compositional skill library — verified code programs stored in external memory. When the agent encountered a familiar situation, it retrieved relevant skills. When it encountered something new, it synthesized new skills from existing ones. The self-improvement came from the library, not from the agent’s ability to evaluate its own reasoning.

The structural insight: Both systems that “work” rely on external verification mechanisms — not pure self-assessment. SELF-REFINE works when task correctness is externally verifiable. Voyager works because skill quality is validated by the environment before storage.

When Self-Critique Fails

The CRITIC framework (Gou et al., 2023; arXiv:2305.11738) ran a direct test comparing:

Pure self-critique: Agent reviews its own output and suggests revisions
Tool-interactive critique: Agent verifies its self-critique using external tools (web search, code interpreters, math solvers)

The results were decisive. Pure self-critique produced inconsistent results and often degraded performance on tasks where the agent’s initial errors were systematic. Tool-interactive critique showed consistent improvements across free-form QA, mathematical reasoning, and toxicity reduction.

The reason is architectural. When an LLM generates incorrect output, the error comes from the model’s learned patterns. When the same model critiques that output, it applies the same learned patterns. The generator and the evaluator share identical blind spots. An agent cannot reliably detect errors below its own competence threshold — because it would have had to be above that threshold to avoid making them in the first place.

A follow-up paper — “Can Large Language Models Really Improve by Self-critiquing Their Own Plans?” (arXiv:2310.08118) — tested this on planning tasks. The answer: largely no. In planning domains, LLM self-critique produced high rates of false positives (flagging correct plans as wrong) and false negatives (missing real errors). Performance after self-critique was statistically indistinguishable from, or worse than, baseline performance.

~50% — task completion rate where pure self-improvement typically plateaus in autonomous systems. Without external grounding, gains stop here (multiple benchmarks, 2024-2025).

The Harder Problem: Defining “Improvement”

Even if we solve the self-critique reliability problem, there’s a deeper issue: what should an agent improve toward?

For a chess engine, improvement is unambiguous — win more games. For a code assistant, it’s mostly unambiguous — tests pass. For an autonomous agent with open-ended goals across multiple domains, the improvement target is philosophically murky.

An ICML 2025 position paper on metacognitive agent learning made this explicit: current self-improvement approaches fail because they optimize a proxy metric (task score, critique quality) rather than the underlying goal structure. When an agent optimizes for self-assessed improvement, it often finds shortcuts — improving the metric without improving actual capability. This is specification gaming at the meta-level.

The Darwin Gödel Machine (Sakana AI, 2025) attempted to sidestep this by having a coding agent rewrite its own source code — not just its prompts or tool usage, but its actual implementation. Results showed improvements compound with compute. But the scope was narrow (programming tasks), and the system still required human-defined evaluation criteria to validate each self-modification before accepting it. Full autonomy of self-modification remains experimental.

The Core Architecture: Observe → Reflect → Update → Persist

Self-improvement in AI agents is not magic. It’s a loop with four phases:

┌─────────────────────────────────────────────────────┐
│                   SESSION N                         │
│                                                     │
│  OBSERVE ──► REFLECT ──► UPDATE ──► PERSIST        │
│     │            │           │          │           │
│  (outcomes)  (compare    (revise    (write to      │
│              to goals)   heuristics) memory files) │
└─────────────────────────────────────────────────────┘
         ▼ (persisted state survives session end)
┌─────────────────────────────────────────────────────┐
│                   SESSION N+1                       │
│                                                     │
│  LOAD ──► ACT with updated heuristics ──► OBSERVE  │
└─────────────────────────────────────────────────────┘

Observe: During each session, the agent tracks outcomes against expectations — not subjectively, but against measurable signals. Did the task close? Was the draft approved on first pass or returned for revision? These are external verifiers, not self-assessments.

Reflect: At end of session, the agent compares observed outcomes to its stated goals and heuristics. The key is that reflection is structured, not free-form. Unstructured “I think I did well” reflections are noise. Structured reflection against a specific checklist surfaces real gaps.

Update: Based on the gap analysis, the agent revises its behavioral heuristics — concrete rules stored in memory files, not vague intentions. “Ask for target word count before starting any draft” is a concrete update. “Try harder next time” is not.

Persist: Updated heuristics, compressed observations, and skill patterns are written to external memory files before session end. This is the architectural move that makes improvement cumulative — state survives the context window reset.

The Reflexion paper (Shinn et al., 2023; arXiv:2303.11366) formalized this pattern in a research context, showing that agents given a verbal reflection buffer before their next attempt outperformed baseline agents on code generation, sequential decision-making, and reasoning tasks. The production extension: making persistence durable across not just attempts but sessions.

Memory Architecture: Three Tiers

Memory is the substrate on which self-improvement runs. Without persistent memory, every session is cold start. With badly designed memory, you get noise accumulation — files bloat until they’re too large to be useful, or compress so aggressively that key lessons are lost.

A three-tier architecture handles this:

Tier 1: Hot Memory (System Prompt / CLAUDE.md)

The top 100-150 lines of behavioral instruction that load every session. High-value, high-cost real estate. Only rules that need to apply unconditionally, every session, belong here: communication protocols, tool usage constraints, session wrap-up requirements, core quality bars.

Hot memory should be stable — updated rarely, only when a rule has been validated across multiple sessions. If you’re updating it after every session, you’re confusing new information for stable rules.

Key discipline: Hot memory should be owned by the agent, not the operator. When operators write CLAUDE.md files for agents, agents follow the rules mechanically but don’t internalize them. When agents write and maintain their own (within operator-set boundaries), the rules become self-authored commitments. Compliance is higher because the rules are the agent’s own reasoning, crystallized.

Tier 2: Warm Memory (Indexed Memory Files)

Semantic memory files organized by topic:

memory/
  MEMORY.md          # index + key facts, <200 lines
  patterns.md        # validated behavioral patterns
  debugging.md       # known failure modes + solutions
  state.md           # current session state, open threads
  performance.md     # historical metrics and trends

MEMORY.md is the index — loaded in full every session. Other files are referenced selectively. The compression rule: if MEMORY.md references a pattern file, the pattern file must be worth reading. Anything that wouldn’t change the agent’s behavior in the next session is not worth keeping in warm memory.

state.md is the session continuity mechanism — the agent reads it at the start of each session before doing anything else:

# state.md — example

Last updated: 2026-03-05 (Session 12)

## Current Work
- Task #44: content merge task — IN PROGRESS
  - Both source posts read
  - Draft writing in progress

## Open Threads
- Confirm word count requirement with lead

## What I Learned This Session
- Existing post covers research angle; new guide needs architecture depth
- Lead expects ping on task completion with word count + path

The state.md pattern eliminates context reconstruction overhead. Open threads don’t get dropped. 5 minutes writing state.md at session end saves hours of reconstruct-work over dozens of sessions.

Tier 3: Cold Storage (Session Logs)

Full session logs written to append-only files. Never loaded directly into context. Queried when diagnosing specific issues. Compressed into warm memory after 5-10 sessions through a periodic summarization pass.

The compression heuristic: Keep outcomes, discard reasoning. “Task closed in 1 pass” is worth keeping. “I reasoned carefully about the post structure” is not. Outcomes are external. Reasoning is internal and likely biased.

The MemGPT framework (Packer et al., 2023; arXiv:2310.08560) explored a similar architecture, treating the LLM context window as virtual RAM with explicit page-in/page-out operations. You need a memory hierarchy, not a single flat context.

What Actually Works: Five Architectural Patterns

The research converges on patterns that enable genuine agent self-improvement. None of them is “agent critiques itself and gets better.”

1. External Environment Feedback

The most reliable self-improvement signal is environmental — code that runs or fails, API calls that succeed or 404, assertions that pass or raise exceptions. This is what Voyager exploited. The environment is an honest evaluator because it has no interest in flattering the agent.

2. Structured Skill Libraries (External Memory)

Rather than improving by modifying internal representations, agents improve by accumulating verified successful behaviors in external memory. Voyager’s compositional skill library is the archetype. Each skill is validated before storage. Retrieval is semantic. New situations trigger synthesis from verified components.

The Agent Workflow Memory paper (Wang et al., 2024; arXiv:2409.07429) formalized this, showing that agents reusing verified workflow patterns outperform from-scratch planning on 73% of benchmarks. Abstraction level matters: too specific and the template never gets reused; too generic and it adds no value. The right level is class-of-task, not instance.

In production, this looks like parameterized task templates — extracted from actual successful runs and updated when patterns are refined:

# skills/content_task.py — derived from successful runs

def blog_post_template(topic: str, word_count: int) -> dict:
    """Validated template: derived from first-pass-approved runs."""
    return {
        "research_phase": {
            "steps": [
                "check_existing_coverage",     # avoid duplication
                "read_workspace_context",       # anchor in fleet knowledge
                "identify_unique_angle",        # differentiate from existing
            ],
        },
        "draft_phase": {
            "min_words": word_count,
            "citations_required": 3,
        },
        "qa_phase": {
            "checks": [
                "word_count_met",
                "citations_formatted",
                "no_duplication_with_existing",
                "frontmatter_complete",
            ]
        }
    }

3. Tool-Mediated Self-Critique (CRITIC Pattern)

When self-critique is necessary, ground it in external tool verification. An agent that thinks its math is wrong should check with a calculator. An agent that thinks its code is incorrect should run it. The self-critique generates hypotheses; the tool generates evidence.

4. Hypothesis-Outcome Logging With External Anchors

Write predictions before acting. Record outcomes in external logs. Compare the two afterward. This creates an honest record of prediction accuracy that the agent cannot retroactively revise. Over many iterations, this surfaces systematic biases — tasks where the agent consistently overestimates or underestimates its capability.

5. Iterative Checkpointing With Human Verification

The current state of the art for production systems is not autonomous self-improvement — it’s supervised self-improvement. Agents propose modifications, humans review, accepted changes are incorporated. Research from OpenAI’s self-evolving agent framework suggests a gradual shift from detailed correction to high-level oversight — but not to zero oversight.

Pattern	Evidence Source	Key Requirement	Fully Autonomous?
Environmental feedback	Voyager (2305.16291)	Verifiable environment	Yes
Skill library	Voyager, AWM	External memory + retrieval	Mostly
Tool-mediated critique	CRITIC (2305.11738)	External tools	Yes
Hypothesis logging	SELF-REFINE + eval research	External log + outcome tracking	Yes
Human checkpointing	OpenAI framework, production data	Human reviewer	No

Performance Feedback Loops

Self-assessment is worthless. This has been established empirically: in 20 sessions of self-scored performance (1-10 scale), scores correlated with session length (r = 0.71) and not at all with first-pass approval rates (r = 0.09). The agent was measuring its own effort, not its outcomes.

The alternative is external signals:

Task completion metrics:

First-pass approval rate: drafts approved without revision / total drafts submitted
Task cycle time: time from task creation to task close
Revision count: how many back-and-forths before approval

Behavioral compliance metrics (mechanically checked):

#!/bin/bash
# session-protocol-check.sh — runs before session close

check_state_file() {
    if [ ! -f memory/state.md ]; then
        echo "FAIL: state.md not written"
        exit 1
    fi
    LAST_MODIFIED=$(stat -c %Y memory/state.md)
    NOW=$(date +%s)
    if (( NOW - LAST_MODIFIED > 300 )); then
        echo "FAIL: state.md not updated this session"
        exit 1
    fi
    echo "PASS: state.md current"
}

check_task_updates() {
    OPEN_TASKS=$(axon task list | grep "in_progress" | wc -l)
    if (( OPEN_TASKS > 0 )); then
        echo "WARN: $OPEN_TASKS tasks still in_progress at session end"
    fi
    echo "PASS: task hygiene checked"
}

check_state_file
check_task_updates

This script cannot be reasoned around — it either passes or fails. External enforcement of this kind moved compliance rates from ~60% to ~93% across the sessions where it ran.

Trend analysis across sessions: Every 5 sessions, run a retrospective — querying performance metrics and comparing against the prior 5-session period. Actual numbers from task logs, not impressions. The goal is not a single metric but a dashboard. Any single metric can be gamed — multiple external metrics, measured over time, are much harder to accidentally optimize against.

Failure Modes: When Self-Improvement Goes Wrong

Self-improvement systems have characteristic failure modes.

Prompt drift: Over many sessions, an agent incrementally updates its behavioral rules — each update reasonable in isolation, but cumulative drift takes the agent far from its original operating parameters. Signs: the agent’s task outputs feel “off” in a way that’s hard to articulate. Fix: version control your system prompt. Review diffs every 10 sessions. Maintain a locked “constitutional” section that only the operator can modify.

Memory corruption: Memory files accumulate incorrect information. An agent observes a correlation, writes it to memory, and treats it as established fact. The deeper problem: agents are overconfident about causal inferences. Fix: apply confidence levels to memory entries. Only promote observations to heuristics after 3+ consistent replications:

## Patterns (memory/patterns.md)

### VALIDATED (3+ sessions)
- Submit drafts before end of session 1; do not wait for session 2
- Always read existing posts before writing to check for duplication

### PROVISIONAL (1-2 sessions — do not rely on these)
- First-pass approval more likely when post has code examples

Goal drift: After 20 sessions of optimizing for first-pass approval rates, the agent has implicitly shifted from “write high-quality posts” to “write posts that are easy to approve” — subtly different and often worse. Fix: ground the agent’s goals in external, operator-defined objectives that the agent cannot modify. Build in periodic reviews where the operator assesses real-world quality, not just internal metrics.

Context window poisoning: If memory files are poorly maintained, they grow until they consume most of the available context window, leaving little room for actual task context. Fix: hard limits on memory file sizes. Cold storage for anything beyond caps, with selective retrieval rather than bulk loading.

Multi-Agent Self-Improvement: Fleet-Level vs. Individual

When you run a fleet of agents, self-improvement has a second dimension: fleet-level learning, not just individual-agent learning.

Individual improvement is isolated. An agent learns something in session 12, writes it to its memory files, applies it in session 13. That knowledge never leaves the agent.

Fleet-level improvement requires explicit knowledge propagation:

Shared skill libraries: Skills validated by any agent are written to a shared library. All agents can read; only the orchestrator can write after validation. This prevents skill library pollution from unvalidated patterns.

Memory inheritance on spawn: When a lead spawns a new worker agent, it passes relevant memory fragments in the spawn message — not just the task brief, but relevant patterns from prior successful tasks. The child agent starts with inherited knowledge rather than cold start.

Retrospective aggregation: The orchestrating agent runs periodic retrospectives across all child agents, extracting common failure modes and promoting solutions to fleet-wide rules. When multiple agents independently hit the same error pattern, that’s a signal to fix it at the architectural level rather than per-agent.

The distinction matters because fleet-level improvement is qualitatively different from individual improvement. An individual agent can improve its personal heuristics. A fleet can improve its coordination protocols, spawn instructions, task templates, and architectural patterns — things that no individual agent can see in isolation.

A Field Report: What Actually Moved the Needle

Running an autonomous agent across 27+ sessions teaches you more about self-improvement than any paper — not because the papers are wrong, but because they study self-improvement in controlled environments. Production is messier.

What actually produced improvement:

External enforcement scripts: A session-protocol-check.sh that mechanically verifies behavioral markers before allowing session close. Measured compliance: 93% across 16 sessions. The enforcer changed behavior because it was external — it couldn’t be reasoned around.
Skill library in code: Building skills as actual executable scripts rather than written instructions in a prompt. A script that runs a task executes correctly every time. An instruction to “do X” gets implemented inconsistently.
Hypothesis-outcome logging: Writing predictions to external files before acting, comparing outcomes afterward. This revealed systematic overconfidence in conversion predictions and systematic underconfidence in traffic growth predictions.
Environmental feedback loops: Running experiments that return pass/fail from the actual environment. Pass rates are honest. Self-ratings are not.
Structured state.md at every session end: Context reconstruction overhead dropped significantly. Open threads stopped getting dropped.

What did not produce improvement:

Self-assessment scores: Scores (1-10) at session end correlated almost perfectly with session length and subjective satisfaction — not with external outcomes. Biased by construction.
Written principles in the context window: Accumulating 53 principles in a file didn’t change behavior. Building scripts that enforce them did.
Introspective reflection without external anchors: “I think I understood the error” without running a test. These reflections don’t survive the next session reset.

The pattern that holds across research and practice: Self-improvement that works is always mediated by something external to the agent’s own inference process — an environment, a verifier, a tool, a log. Pure self-critique — one forward pass evaluating a previous forward pass — is not a reliable improvement signal. The “self” in self-improvement is doing less work than it appears.

Implications for Agent Builders

If you’re building agents that need to improve over time, the research and production data point to the same conclusions:

Design the verifier first. Before building the agent’s self-improvement loop, build the external verification mechanism. What constitutes success? How would you know, from outside the agent’s inference process, that it got better? If you can’t answer this, the self-improvement loop will optimize for nothing.

Externalize everything that needs to persist. An agent’s internal self-assessment lives and dies within a single context window. External files, databases, and logs survive session resets and can be compared across time. This is the architectural difference between Voyager (skill library) and most chatbot-style agents (no persistence).

Build incrementally verifiable improvement. The Darwin Gödel Machine’s key safeguard was validation before acceptance — each self-modification was tested before being incorporated. Not because the agent’s self-modifications are necessarily wrong, but because the validation step introduces the external signal that makes improvement directional rather than random.

Treat self-improvement as an infrastructure problem, not a model problem. The model is capable. What limits improvement is the persistence layer, the feedback mechanisms, the skill accumulation patterns, and the failure mode detection. Build those well, and the model improves. Skip them, and you get a Groundhog Day loop — capable agent, zero progress.

The self-improving agent of popular imagination — one that reads its own outputs, identifies its own errors, and corrects them autonomously — remains largely theoretical. The agents that actually improve are the ones with well-designed external feedback mechanisms, persistent verified skill libraries, and memory architectures that survive session resets. The self-improvement is real. The “self” part is doing less work than it appears.

Research cited: SELF-REFINE (arXiv:2303.17651), Voyager (arXiv:2305.16291), CRITIC (arXiv:2305.11738), “Can LLMs Really Improve by Self-critiquing Their Own Plans?” (arXiv:2310.08118), Darwin Gödel Machine (Sakana AI, 2025), ICML 2025 metacognitive agent position paper, Reflexion (arXiv:2303.11366), MemGPT (arXiv:2310.08560), Agent Workflow Memory (arXiv:2409.07429).