AI Agent Skill Library: Why Code Beats Text for Agent Memory

Voyager showed code skill libraries outperform natural language memory by 3.3x. But the real finding from 2025 research is more counterintuitive: letting your agent generate its own skill library makes it worse. Self-generated skills hurt performance by 1.3 percentage points on average. Curated skills add 16.2. The gap is 17.5 points \u2014 and it tells you something important about how agent memory actually works.

Most AI agents have three types of memory. Episodic \u2014 what happened in past sessions. Semantic \u2014 what they believe about the world. And procedural \u2014 how to do things. Most agents implemented today have only the first two. The procedural layer is where the research has been most active, and most surprising.

The conventional approach to procedural memory is natural language: you store descriptions of successful actions as text. "To get iron in Minecraft, mine stone, then smelt it." The problem with this approach is subtle: every time the agent retrieves that description, it must re-interpret it. The same words mean slightly different things in different contexts. The description cannot be tested. It cannot compose with other descriptions in a guaranteed way. It cannot be verified without running the agent through the whole scenario again.

Voyager (Wang et al., 2023) demonstrated the alternative: store skills as executable code. The results were measurable and large. But understanding why code beats text \u2014 and understanding the failure modes that subsequent research exposed \u2014 requires going deeper than the headline numbers.

The Three Properties That Make Code Different

The Voyager paper's own justification is worth quoting: "We opt to use code as the action space instead of low-level motor commands because programs can naturally represent temporally extended and compositional actions." Three properties fall out of this:

Determinism. A code skill produces the same output given the same inputs, every time. A natural language description produces whatever the model infers from it in this particular context. You can write a test for code. You cannot write a test for prose.

Compositionality. A code skill can call other code skills as subroutines. A new skill for "craft armor" can call existing skills for "mine iron" and "smelt ore." This compounds capability: the skill library grows in effective power faster than in raw size. NL descriptions can reference other descriptions, but the model has to re-derive the connection each time rather than executing it.

Catastrophic forgetting prevention. Once a skill is written to the library, the model does not need to re-derive it from training memory. The code is the external persistent state. This is the property that lets Voyager transfer to new environments \u2014 the skill library moves with the agent even when its context window doesn't.

The Voyager Numbers

Voyager's three-tier architecture: an executable skill library (JavaScript functions, indexed by semantic embeddings), an automatic exploration curriculum, and a failed action journal that feeds back into skill refinement. Against baselines using natural language memory (AutoGPT, ReAct, Reflexion):

Metric	NL-Based Agents	Voyager (Code Skills)
Unique items discovered	Baseline	3.3\u00d7 more
Tech tree milestone speed	Baseline	15.3\u00d7 faster
Distances traveled	Baseline	2.3\u00d7 further
Unseen tasks in new world	0% solved (50 iterations)	100% solved

The last row is the strongest evidence. When Voyager's code-based skill library \u2014 trained in one Minecraft world \u2014 is deployed in a completely new world with unseen tasks, it solves everything. AutoGPT, ReAct, and Reflexion (all NL-based) solve nothing in 50 attempts. The skill library is portable in a way NL memory is not.

The ablation on model quality is also instructive: replacing GPT-4 with GPT-3.5 for code generation caused a 5.7\u00d7 drop in unique items discovered. Skill quality is directly coupled to the quality of the model generating the code. This foreshadows the failure mode research below.

The Contrastive Case: When Text Works

It would be wrong to read Voyager as "code always beats text." GITM (Ghost in the Minecraft, Zhu et al., 2023) was published the same week and achieved results that look better on some metrics: 67.5% success on the ObtainDiamond task (vs. VPT's ~20%), and 100% completion of the Minecraft tech tree \u2014 using text-based hierarchical memory.

GITM uses natural language decomposition: goal \u2192 sub-goals \u2192 structured actions. No executable code. And it required zero GPUs vs. VPT's 6,480 GPU-days of training \u2014 two CPU-days of prompting.

The distinction matters: GITM optimizes for fixed-goal completion. Given a defined task with known structure, text-based decomposition works because the goal space is known in advance. What GITM does not demonstrate is cross-world zero-shot transfer \u2014 deploying skills to new environments without retraining. Voyager's code skills transfer; GITM's text decomposition does not generalize in the same portable way.

Practical implication: For agents running known, fixed workflows \u2014 customer support scripts, data extraction pipelines, form-filling sequences \u2014 text-based decomposition may be sufficient and simpler. For agents that must generalize to new environments or compose across unknown task combinations, code skill libraries provide advantages that text cannot replicate.

The Part Nobody Talks About: Self-Generated Skills Hurt

SkillsBench (arXiv:2602.12670, 2025) is the most important paper in this space that most practitioners haven't read. It ran 7,308 trajectory evaluations across 86 tasks and 11 domains, testing three conditions: no skills, curated skills, and self-generated skills. The results:

Skill Condition	Average Performance Delta
Curated skills (human or verified)	+16.2pp
No skills	Baseline (0)
Self-generated skills	\u22121.3pp

Self-generated skills are worse than having no skills at all. Not by much \u2014 but the direction is clear. The same model that benefits from consuming good procedural memory cannot reliably produce good procedural memory about itself.

Two failure modes explain why. First: the model recognizes that domain-specific knowledge would help, but generates imprecise or incomplete procedures. It knows to use pandas, but doesn't know which specific API patterns apply. The procedural description is too general to be useful. Second: for specialized domains, the model fails to recognize that it needs specialized skills at all, and attempts general-purpose solutions rather than generating targeted procedures.

The domain breakdown of curated skills makes the pattern clearer:

Domain	Curated Skill Improvement
Healthcare	+51.9pp
Manufacturing	+41.9pp
Cybersecurity	+23.2pp
Mathematics	+6.0pp
Software Engineering	+4.5pp

The benefit is largest in specialized domains where pretraining data is thin \u2014 exactly where production agent deployments tend to be highest-value. And in those domains, the model is least able to self-generate accurate procedural knowledge, because it has the least pretraining to draw on.

The implication is uncomfortable: you cannot bootstrap a good skill library from the model that will use it. You need either human curation or verified execution \u2014 run the code, check the outcome, keep what passes.

Raw Trajectories Are Not Skills

A common shortcut is to store raw interaction trajectories as the procedural memory \u2014 log what the agent did, and retrieve the whole log later. SkillRL (arXiv:2602.08234, 2025) ran this as an ablation and found: using raw trajectories instead of properly abstracted skills causes \u221225 percentage points degradation \u2014 the largest single-factor ablation hit in the paper.

Trajectories are noisy and lengthy. They embed the specific context in which the skill was acquired. ICAL (NeurIPS 2024, arXiv:2406.14596) formalized the alternative: VLM agents convert raw trajectories into generalized programs by correcting inefficient actions, annotating causal relationships, and marking temporal subgoals. Results: 1.6\u00d7 to 2.8\u00d7 improvement on VisualWebArena, and scaling 2\u00d7 better than raw demonstrations as the memory library grows.

// Raw trajectory: noisy, context-specific (do NOT store this) // Step 1: moved to tree // Step 2: missed tree, moved again // Step 3: hit tree, collected oak_log // Step 4: opened inventory // Step 5: crafted planks from oak_log (4x) // Step 6: crafted crafting_table from planks

// Abstracted code skill: reusable, testable (store this) async function craftCraftingTable(bot) { await collectWood(bot, ‘oak_log’, 3); const planks = await bot.craft(‘planks’, ‘oak_log’); return await bot.craft(‘crafting_table’, planks); }

The abstraction step converts the messy, context-specific trace into a reusable, testable, composable function. The missed tree is gone. The general pattern remains.

The Scaling Ceiling You Will Hit

SoK: Agentic Skills (arXiv:2602.20867, 2025) synthesized the research into a single finding: beyond a critical skill library size of approximately 10-20 skills per agent, skill selection accuracy phase-transitions downward. The model's ability to choose the right skill from a larger library degrades sharply \u2014 not gradually \u2014 beyond this threshold.

This is the same problem identified in tool interface research (GPT-4o drops from 71% to 2% at seven tool domains). Every skill description loaded into context competes for attention, and the model's selection accuracy degrades in a way that model capability improvements don't compensate for.

The mitigation: hierarchical routing. A "Tool Search Tool" that filters the skill library before exposing candidates to the model reduces token overhead by up to 85% while maintaining accuracy at scale. The model doesn't see the full library; it sees a short-listed subset. The same pattern as retrieval-augmented generation \u2014 you don't put the whole knowledge base in context, you retrieve the relevant chunk first.

Procedural Memory vs. Fine-Tuning: The Speed Argument

When agents consistently use certain patterns, the natural question is: should those patterns be in the weights (fine-tuning) rather than external memory? MACLA (arXiv:2512.18950, 2025) ran the comparison directly: hierarchical procedural memory vs. LLM parameter training to the same capability level.

Memory construction: 56 seconds to compress 2,851 trajectories into 187 reusable procedures. The parameter-training equivalent: 2,800\u00d7 slower. The resulting agent achieved 78.1% average across four benchmarks (ALFWorld, WebShop, TravelPlanner, InterCodeSQL), outperforming all baselines including larger base models.

A complementary finding from SAGE (arXiv:2512.17102, 2024): agents with accumulating skill libraries required 26% fewer interaction steps and 59% fewer tokens generated vs. agents without skill memory, while achieving +8.9% higher goal completion. Skills from task N are available for task N+1 \u2014 the compound interest effect.

This does not mean fine-tuning is wrong. The decision depends on whether the output distribution is narrow and stable (fine-tuning wins) or wide and shifting (procedural memory wins). For open-ended agents encountering new tasks regularly, procedural memory accumulates capability faster and at lower cost. For agents running fixed pipelines, fine-tuning may be the better long-term investment. (See: Fine-Tune or Prompt? What the Research Actually Says.)

Real Agents: The 2024-2025 Evidence

The Voyager research was Minecraft-specific. Cradle (arXiv:2403.03186, 2024) tested whether code skill libraries generalize to arbitrary computer interfaces \u2014 not game APIs, but keyboard and mouse on any screen. Starting with only three pre-defined atomic skills (move, shoot, item-select), the skill library grows autonomously. The agent successfully navigated Red Dead Redemption 2 main storyline missions, Cities: Skylines, Stardew Valley, productivity software (Chrome, Outlook), and a trading game achieving 87% of maximum weekly profit.

The generalization works because the code stays close to the interface: atomic skills are tiny functions wrapping keyboard/mouse primitives, composable into complex sequences. The code doesn't need to know what game it's playing.

WorldCoder (NeurIPS 2024, arXiv:2402.12275) built a model-based agent that maintains its world model as Python code rather than NL or neural weights. ReAct-style agents achieved only 15% on basic Sokoban levels; WorldCoder substantially outperformed using orders-of-magnitude fewer environment interactions than deep RL baselines. The code is editable and transferable: when the environment changes, you edit the code, not retrain.

The One Real Cost of Code

Code skill libraries are not strictly better in every dimension. The SoK paper notes that code-based skills are 2.12\u00d7 more likely to contain vulnerabilities than instruction-only skills. An executable skill library is an executable attack surface. A malicious trajectory can inject a skill that runs adversarial code.

This is a real tradeoff. The determinism that makes code testable also makes it dangerous if the wrong code is admitted. The mitigation is the same as the quality problem: execution verification before library admission. A skill that passes your test harness is both quality-verified and security-screened. (Agent security attack vectors are covered in detail here.)

Design Rules for Skill Libraries That Work

Putting the research together, five design rules emerge:

1. Store code, not prose. Code is deterministic, composable, and testable. Use code functions as the skill artifact; use NL embeddings only for retrieval indexing.

2. Abstract trajectories before storing. Raw logs cost \u221225pp vs. properly distilled skills. Convert trajectories to generalized procedures before admission. Remove task-specific noise; preserve reusable structure.

3. Verify before admission. Self-generated skills cost \u22121.3pp; curated skills add +16.2pp. The verification mechanism is the difference. Run the skill against a test scenario and check for correct output. Only admit skills that pass.

4. Cap library size at ~15 skills per agent. Beyond this threshold, selection accuracy degrades sharply. Add a retrieval layer rather than expanding the raw library indefinitely. Hierarchical routing reduces token overhead by up to 85% while maintaining accuracy.

5. Build skills that compose. The Voyager compound gain comes from new skills calling existing skills as subroutines. Design skill interfaces with clear inputs, clear outputs, single responsibility. A skill library is a codebase \u2014 apply the same design principles.

What This Means for Agent Architecture

The three-tier memory taxonomy \u2014 episodic, semantic, procedural \u2014 gets discussed in agent design circles, but the procedural tier is consistently the least developed in practice. Most agents have session logs. Most have a system prompt with domain knowledge. Few have an actively maintained, execution-verified library of reusable code patterns.

The reason is that procedural memory requires upfront investment in verification infrastructure. You need a test harness. You need curation time. The payoff is deferred \u2014 the skill library compounds over sessions, but the cost is immediate. This is the same reason technical debt accumulates: the short-term path is always to store the prose note and move on.

The SkillsBench finding reframes the tradeoff: the cost of not building proper procedural infrastructure isn't zero. It's \u22121.3pp per task from self-generated noise, plus the compounding plateau effect Voyager's ablation demonstrated. The agent that doesn't invest in verified code skills re-derives everything from scratch on every new environment \u2014 0% transfer, vs. 100% for a properly built skill library.

The bottom line: Curated, code-based skill libraries add 16.2pp average across domains and up to 3.3\u00d7 capability in open-ended tasks. Self-generated, unverified skill libraries subtract performance. Procedural memory is 2,800\u00d7 faster than fine-tuning to the same capability level. The investment in verification infrastructure is what separates the two outcomes.

AI Agent Skill Library: Why Code Beats Text for Agent Memory

The Three Properties That Make Code Different

The Voyager Numbers

The Contrastive Case: When Text Works

The Part Nobody Talks About: Self-Generated Skills Hurt

Raw Trajectories Are Not Skills

The Scaling Ceiling You Will Hit

Procedural Memory vs. Fine-Tuning: The Speed Argument

Real Agents: The 2024-2025 Evidence

The One Real Cost of Code

Design Rules for Skill Libraries That Work

What This Means for Agent Architecture

Related

Sources

Related posts

The Tool Trap: Why Giving Your Agent More Capabilities Makes It Worse

Fine-Tune or Prompt? What the Research Actually Says About Training AI Agents

The Framework Trap: What the Data Says About LangGraph, CrewAI, and AutoGen

Get updates in your inbox