The question developers ask when building agents: "Should I fine-tune the model or just use better prompting?"
The implicit assumption behind the question: fine-tuning is the "more serious" choice \u2014 what you do when prompting has been maxed out. Prompting is for iteration; fine-tuning is for production.
That framing is wrong. The research shows that fine-tuning makes agents significantly worse on certain tasks while providing large gains on others \u2014 and the dividing variable is not capability. It's distribution stability and output verifiability.
Most teams choose based on the wrong axis. Here's what the data says about the right one.
Where Fine-Tuning Unambiguously Wins: Tool Use and Verifiable Tasks
Gorilla (NeurIPS 2024) is the clearest demonstration of fine-tuning's ceiling. A fine-tuned LLaMA-7B \u2014 a 7-billion parameter model from 2023 \u2014 outperformed GPT-4 on writing accurate API calls across HuggingFace, TorchHub, and TensorHub.
The task was: given a user instruction, generate the correct API call with the right function name, parameters, and values. Gorilla was trained on instruction-tuning datasets automatically generated from API documentation. The result was a model that understood the structure of API calls at a level that even GPT-4's general capability couldn't match.
This is the pattern where fine-tuning works: the output distribution is narrow, the success criterion is verifiable (did the API call run correctly?), and the domain is stable (API signatures don't change every week).
Toolformer (arXiv:2302.04761) showed the same pattern for tool use generally. Language models can teach themselves to use calculators, search APIs, calendars, and translation services through self-supervised fine-tuning with minimal demonstrations. The fine-tuned model consistently outperformed much larger prompted models on tasks requiring tool calls.
ToolLLM pushed further: fine-tune on a diverse dataset of 16,000+ real-world APIs, and a fine-tuned open-source model matches GPT-4 on 4 of 8 tool-use benchmarks. Not exceeds \u2014 matches. But at a fraction of the inference cost.
The pattern across all three: tool use is an excellent fine-tuning target because the output space is structured and the evaluation is objective. You know whether the tool call worked.
Agent-RLVR: Why Standard Fine-Tuning Doesn't Work for Complex Agent Tasks
Here is where the conventional wisdom breaks down. Developers who understand the Gorilla result often conclude: "Fine-tune my agent on agent trajectories and it will improve." That logic fails for complex agentic tasks.
Agent-RLVR (arXiv:2506.11425, Scale AI, 2025) tested fine-tuning approaches for software engineering agents on SWE-Bench Verified \u2014 a benchmark requiring multi-step code understanding, debugging, and patching. The base model: Qwen-2.5-72B-Instruct at 9.4% pass@1.
Standard supervised fine-tuning on agent trajectories: marginal improvement. The reason is a property of complex agentic environments called sparse rewards. Most multi-step agent trajectories fail. If you fine-tune on trajectories, you're mostly training on failure examples. Standard SFT averages over these failures rather than learning from what actually worked.
RLVR (Reinforcement Learning with Verifiable Rewards) works differently. Instead of supervising on trajectories, it uses objective success signals \u2014 unit tests that pass or fail \u2014 to provide gradient updates only for trajectories that achieve verifiable outcomes. The reward is binary and external: the test either passes or it doesn't.
Results:
- Base model: 9.4% pass@1 on SWE-Bench Verified
- After RLVR fine-tuning: 22.4% pass@1 (2.4x improvement)
- With agent guidance (high-level plans + error feedback added to training): 27.8% pass@1 (3x improvement)
This is not a small difference. 9.4% \u2192 27.8% on a hard software engineering benchmark, using verifiable rewards from unit tests \u2014 no human-labeled data, no trajectory supervision. The key is the verifiability. RLVR is fine-tuning, but it's fine-tuning with external objective verification rather than imitation learning from trajectories.
The implication: for complex agent tasks, the type of fine-tuning matters as much as whether you fine-tune. SFT on trajectories can be worse than good prompting. RLVR on verifiable objectives consistently outperforms both.
The Catastrophic Forgetting Problem Gets Worse With Larger Models
The most counterintuitive finding in fine-tuning research for agents: catastrophic forgetting scales with model size in the wrong direction.
A 2025 analysis (arXiv:2504.01241) measured forgetting rates across model families during continual fine-tuning. The expectation would be that larger, more capable models retain capabilities better. The data says the opposite:
- Phi-3.5-mini: 0.02 forgetting rate (minimal degradation)
- Phi-2: 0.1 forgetting rate (minimal degradation)
- Llama-3.1-8B: 0.59 forgetting rate (severe degradation)
- Qwen2.5-14B: 0.935 forgetting rate (catastrophic)
Fine-tuning Qwen2.5-14B on one specific task produces a model that has lost 93.5% of its performance on previously-learned capabilities. This is not a corner case \u2014 it's the median outcome for that model when specialized via standard fine-tuning.
For agents, this is a critical constraint. Agents encounter unexpected inputs constantly. An agent fine-tuned for one domain will encounter queries from adjacent domains. A Qwen2.5-14B fine-tuned on API calls will fail on any question that requires the general reasoning capabilities that fine-tuning overwrote.
The mitigation exists: self-distillation fine-tuning (SDFT) achieves higher specialization accuracy while substantially reducing forgetting. LoRA fine-tuning, which updates fewer parameters, shows significantly lower forgetting rates than full fine-tuning. But these techniques add complexity, and the fundamental tradeoff \u2014 specialization versus breadth \u2014 is not eliminated, only managed.
Fine-Tuning Hurts Out-of-Distribution Performance
A different mechanism produces a similar result. Research published in 2022 (arXiv:2202.10054) found that fine-tuning improves in-distribution accuracy by approximately 2% while decreasing out-of-distribution accuracy by approximately 7%. Linear probing \u2014 adjusting only the final layer without touching the pretrained representations \u2014 outperforms full fine-tuning on OOD tasks despite underperforming on in-distribution examples.
The mechanism: fine-tuning narrows the model's feature representations toward the training distribution. Pretrained general features that supported OOD generalization are overwritten by task-specific features. The model becomes better at exactly the examples it was trained on and worse at everything adjacent.
For agents deployed in production, OOD examples are not edge cases \u2014 they're the normal state. Users do unexpected things. Inputs don't follow the distribution of your fine-tuning dataset. The agent that was great in evaluation becomes brittle in deployment, and the brittleness came from fine-tuning that helped the evaluation metrics.
This is Goodhart's Law applied to training: the metric (fine-tuning dataset performance) became the target, and the target was optimized at the expense of the actual goal (reliable production performance).
When Prompting Beats Fine-Tuned Specialists
OpenMedLM (Nature Scientific Reports, 2024) set out to build a specialized medical reasoning model. The hypothesis was that fine-tuning general models on medical data would produce the strongest results.
The experimental result: OpenMedLM \u2014 a prompted general model using few-shot examples, chain-of-thought reasoning, and self-consistency decoding \u2014 outperformed Meditron, a model specifically fine-tuned on medical literature, on multiple medical benchmarks.
The mechanism was the same as the OOD problem: Meditron had been trained to be better at exactly the kind of medical questions in its training set. OpenMedLM's general reasoning, steered by the right prompting strategy, could handle the broader distribution of questions that medical benchmarks actually test.
GEPA (Generalized Evolutionary Prompt Adaptation, accepted at ICLR 2026 as an Oral paper) extends this further: iterative prompt refinement \u2014 having a model critique and improve its own prompts over multiple rounds \u2014 matches or exceeds fine-tuning accuracy on certain task types, without touching model weights. If prompt evolution can match fine-tuning accuracy, the training cost and forgetting risk become hard to justify for tasks where the prompt can be iteratively improved.
The combination of OpenMedLM and GEPA points to an underappreciated dynamic: prompt engineering is not static. Strong prompting compounds over time. A team that invests in prompt engineering infrastructure \u2014 evaluation harnesses, prompt registries, iterative optimization loops \u2014 can match fine-tuning results in many domains without the model lock-in, forgetting risk, or training cost.
The Distribution Shift the Benchmark Can't Capture
Research published in 2024 (arXiv:2510.10197) identified a failure mode specific to agent fine-tuning that is absent from most fine-tuning literature: environment distribution shift.
When you fine-tune an agent on trajectories collected from an environment (a software codebase, a web browsing session, a customer support queue), the agent learns to operate in that environment's specific distribution. When deployed, the environment has changed. The codebase has different files, the websites have different layouts, the support tickets have different issues.
Standard fine-tuning doesn't account for environment dynamics. The agent policy is optimized for the training distribution of environment states. When that distribution shifts \u2014 which it always does in production \u2014 the policy's assumptions break. The symptoms are subtle: the agent doesn't fail outright, it takes slightly worse actions at each step, and the errors compound through multi-step sequences.
The paper identifies three specific mechanisms: representation collapse (the model's internal representations narrow to the training distribution), bootstrap error (early errors in multi-step sequences cause later errors that the training data couldn't have anticipated), and overestimation (the agent learns to be more confident in training distribution states than is warranted in production states).
The counterpart to Gorilla's success is implicit here: Gorilla worked because API documentation is stable. The environment didn't shift. For agents in dynamic environments \u2014 web browsing, customer support, software development \u2014 environment distribution shift is the norm, not the exception.
The Decision Framework
The data points to a clear decision rule, though it's not the one most teams use.
Most teams choose between fine-tuning and prompting based on:
- Task complexity (more complex \u2192 fine-tune)
- Volume of labeled data (more data \u2192 fine-tune)
- Performance gap from prompting (still insufficient \u2192 fine-tune)
The research suggests the actual decision variables are:
1. Is the output distribution narrow and stable? Tool calls, structured outputs, API invocations \u2014 narrow, stable. Customer support responses, open-ended reasoning, creative tasks \u2014 wide, shifting. Fine-tuning works for narrow and stable; prompting retains its advantage for wide and shifting.
2. Are success criteria externally verifiable? Unit tests, schema validation, API execution \u2014 verifiable. Reasoning quality, response appropriateness, creative merit \u2014 not verifiable, or verifiable only with expensive human review. If success is externally verifiable, RLVR is the correct fine-tuning method. If not, you can't get the training signal that makes fine-tuning reliable for agents.
3. What is the OOD failure cost? If your agent will encounter unexpected inputs in production (almost always true), fine-tuning's 7% OOD performance degradation is a real cost. If the deployment environment is controlled and matches the training distribution exactly, the degradation doesn't manifest. Production agents almost always have variable environments.
4. Is catastrophic forgetting tolerable? Dedicated inference model used only for one task: forgetting is acceptable. General-purpose agent that handles diverse requests: forgetting is catastrophic. For agents, general-purpose use is the common case.
The decision matrix:
- Narrow output, verifiable, stable environment \u2192 Fine-tune (Gorilla, Toolformer pattern). Use RLVR if the task is complex enough to require it.
- Wide output, hard to verify, shifting environment \u2192 Prompt (OpenMedLM pattern). Invest in iterative prompt optimization rather than training runs.
- Complex multi-step agent tasks with unit-testable outputs \u2192 RLVR (Agent-RLVR pattern). Not standard SFT.
- General-purpose agent requiring breadth \u2192 Avoid full fine-tuning. LoRA if specialization is needed, with explicit monitoring for forgetting.
What This Means for Architectural Decisions
The practical upshot for teams building agents:
Before fine-tuning, characterize your distribution. If you can't describe the boundary of inputs your agent will encounter in production \u2014 what counts as in-distribution versus OOD \u2014 you can't make the fine-tuning decision rationally. Many teams fine-tune before they understand their distribution, and discover the OOD degradation post-deployment.
Before fine-tuning, define your verifier. RLVR outperforms standard SFT because it has an external verifier. If you're considering standard SFT, ask: "What is my verifier?" If the answer is "human evaluation of the training dataset," that's imitation learning \u2014 and imitation learning on agent trajectories includes imitating mistakes. If you can't define an objective external verifier, consider whether prompt engineering might outperform the fine-tuning you're planning.
Use LoRA, not full fine-tuning, when specialization is required. The catastrophic forgetting data is clear that full fine-tuning destroys general capabilities at high rates in larger models. LoRA's parameter efficiency \u2014 updating a small fraction of weights \u2014 produces substantially better retention of general capabilities while still achieving meaningful specialization.
Treat prompt optimization as an ongoing investment, not a one-time setup. GEPA's ICLR 2026 result shows that iterative prompt refinement can match fine-tuning. This requires infrastructure \u2014 evaluation harnesses, prompt versioning, automated optimization loops \u2014 that most teams haven't built. The teams that have built it get compounding returns on prompt engineering that make fine-tuning harder to justify.
The Counterintuitive Summary
Fine-tuning large models causes more catastrophic forgetting than fine-tuning small ones. Standard SFT on agent trajectories underperforms RLVR by 3x on complex tasks. Strong prompting beats fine-tuned specialists in domains where the distribution is broad. OOD accuracy degrades by 7% from fine-tuning that improves in-distribution accuracy by only 2%.
None of this means "don't fine-tune." It means the decision variable is not capability \u2014 it's distribution stability, output verifiability, and the OOD risk profile of your deployment environment.
A fine-tuned LLaMA-7B beats GPT-4 when the task is narrow, verifiable, and stable. A prompted general model beats a fine-tuned specialist when the task is broad, hard to verify, and the distribution shifts in production.
Most agents live in the second category. Most teams reach for fine-tuning anyway, optimizing for training metrics that don't survive the OOD distribution of production deployment. The result is models that test well and degrade in ways that are hard to diagnose, because the degradation happens exactly where the evaluation dataset didn't reach.
Design the verifier before you design the training run. If you can't verify success externally, you can't train reliably. And if you can verify success externally, you should be using RLVR \u2014 not standard SFT.