How GEPA Works (High-Level)
Genetic/evolutionary prompt optimization
GEPA treats your prompt as a population of mutable transformations. Each generation, the optimizer scores every candidate on a validation slice, picks parents by how often they “win” on individual seeds, and then respawns children through LLM-guided mutations and crossovers. The population manager maintains both a hot population (recent candidates) and a Pareto archive of non-dominated transformations so you always retain high-quality, diverse instructions.Online vs offline modes
- Online mode intercepts live traffic through a proxy URL, registers each trial with the backend, substitutes the chosen candidate into every system message, and streams the rollouts back to GEPA for instant feedback. See the online Banking77 walkthrough for the proxy wiring and session lifecycle.
- Offline mode replays stored traces using a standard prompt-learning job. You configure seeds, budgets, and proposers in TOML, submit the job once, and GEPA evaluates every candidate batch against your dataset. The offline flow is ideal when you want reproducible experiments or have a fixed validation set.
Pareto frontier selection across objectives
Every candidate is scored along multiple axes—accuracy is the headline metric, but latency (model choice + token usage) and cost (rollouts, inference pricing) are first-class citizens. GEPA maintains a Pareto front by comparing candidates on a sharedpareto_set of seeds, so even if a prompt sacrifices a millisecond of latency, it can survive if it unlocks accuracy gains without dominating existing solutions.
Interceptor trial logging and scoring
The interceptor is the glue that lets GEPA experiment on live traffic. When a candidate is selected, the backend:- Registers a trial ID and the transformation metadata (proposer template, mutation history, parent references).
- Captures the entire trace (input, substituted prompt, outputs, tool calls, verifier verdict).
- Sends the trace to the evaluation stack (verifier graphs and/or RLM scoring).
- Pushes the score into the Pareto archive so subsequent generations can compare candidates on the same footing. Because the interceptor never ships optimized prompts to your Task App, your environment, secrets, and tooling remain untouched.
Candidate generation: mutations, crossovers, and proposers
GEPA generates new candidates via three levers:- Mutations mutate discrete parts of the instruction using templates that understand roles, heuristics, and constraints.
- Crossovers splice together transforms from two parents to compose novel instruction fragments.
- Proposer strategies set the quality envelope. The default
synthproposer balances cost and diversity, but you can opt forgepa-aiordspyproposers when you need richer reasoning. On the scoring side, advanced verifiers (RLM v1,RLM v2) act like agentic proposers: they evaluate traces with multi-step reasoning, so you can confidently iterate even when your traces exceed hundreds of thousands of tokens.
Configuration Deep-Dive
Seeds, archives, and minibatch gating
| Knob | What it controls | Default | Tips |
|---|---|---|---|
pareto_set_size | Number of seeds used for Pareto dominance checks | 15–30 | Too small → noisy objectives; too large → high cost. Keep it aligned with your validation set size. |
feedback_seeds | Seed pool used for minibatch gating and live feedback | remainder of your seed split | Focus on diverse failure cases so early gating catches regressions. |
minibatch_size | Sample size for fast gating before full Pareto eval | 2–5 | Matches feedback_seeds. Smaller values save inference steps but increase noise. |
archive_size | Retains the top N candidates across objectives | 32–128 | Enough capacity prevents premature convergence. Use archive_multiplier in defaults to keep it proportional to pop_size. |
Verifier settings (RLM v1/v2, custom graphs)
GEPA relies on verifiers to translate traces into scalar rewards. The built-in rubrics (single prompt) cover most classification tasks. Switch to RLM v1 when traces get long or noisy, and go to RLM v2 when you need multi-agent debate-style scoring. Custom verifiers—trained Graphs or rubric graphs tailored to your domain—drop in viaverifier_config in the job payload. RLMs are slower/higher cost, so only enable them for Pareto scoring (pareto_set) while keeping minibatch gating on faster rubrics.
Proposer selection and candidate quality
proposer_effort and proposer_output_tokens are the levers that trade compute for signal. LOW_CONTEXT/LOW effort levels run quickly (good for early exploration), while MEDIUM/HIGH invite more reasoning layers and longer outputs. proposer_mode also matters: synth is the default (balanced), dspy mirrors the DSPy proposer, and gepa-ai tracks the reference GEPA implementation. If you ship your own meta-model, wrap it with a custom proposer; GEPA injects trace summaries, tool usage stats, and failure hints into the proposer context so every mutation is grounded in feedback.
Budget controls (rollouts, tokens, time)
rollout_budgetlimits the total number of call attempts (e.g., 1,000 rollouts for Banking77, 200 for lightweight tests).max_rolloutsguards against runaway sessions by capping the outer loop regardless of budget.max_concurrentthrottles how many rollouts run in parallel; use lower values when your Task App or tunnel is resource constrained.- Termination knobs like
max_cost_usd,max_trials, andeval_timeout(prompt_learning.termination_configand evaluation configs) let you bound spend or runtime. Every evaluator shipseval_timeout(default 600s) to ensure hung requests fail fast. If you use a policy graph, setpolicy.max_completion_tokensorverifier.max_prompt_tokensto stay within your supported pricing tier.
Online vs offline configuration differences
Offline jobs are pure TOML—define seeds, proposer settings, budgets, and verifiers, submit to/api/policy-optimization, and watch GEPA iterate. Online jobs (GepaOnlineSession) add mode: "online", online_proposer_min_rollouts, and proxy configuration so the backend can swap the candidate in-flight. Online mode also exposes online_proposer_mode to choose between inline proposers or deferred ones; offline relies on the normal proposer pipeline.
Common misconfigurations and how to avoid them
- Forgetting to wire the online proxy: GEPA reports
baselinefor every rollout if the backend never sees candidate registrations—check that the Tunnel URL andregister_trial_urlare reachable. - Setting
pareto_set_size>rollout_budget: you will never finish a Pareto evaluation if there are not enough rollouts left; keep the Pareto set smaller than your budget per generation. - Running RLM scoring everywhere: RLM v1/v2 are powerful but slow. Use them for Pareto evaluations and keep gating on lightweight rubrics.
- Missing seeds split:
feedback_seedsandpareto_setmust cover disjoint sets so generation metrics do not leak into gating signals. - Ignoring proposer tokens:
proposer_output_tokensshould align with your model budget. If you seeproposal_too_longerrors, drop the effort level or cap tokens.
Example Applications
Banking77 classification
Banking77 is GEPA’s bread-and-butter classification demo. Start with the quickstart (prompt-optimization-banking77) to split seeds into pareto/feedback slices, then scalepareto_set_size when tackling new distribution shifts (fraud, support, contact changes). A 20–30 seed Pareto set with a 3 seed minibatch gives you fast gating without sacrificing statistical power—results typically improve baseline accuracy from ~65% to 85%+ within 10–15 generations.
Crafter/VLM agent optimization
When your agent executes in complex environments, trace evaluation matters more than raw labels. GEPA pairs a Crafter verifier (see Crafter Verifier Cookbook) with RLM scoring: the verifier digests action traces, scores achievements/survival, and feeds that signal back into the Pareto archive. The proposer uses metadata about tool calls and environment state so mutations emphasize mission-critical directives (e.g., “collect energy before leaving the sandbox”).Coding agents (Codex/OpenCode)
Coding agent jobs (see the GEPA Coding Agent quickstart) optimizeAGENTS.md, skills definitions, and core system prompts simultaneously. The optimization objective is the test pass rate reported by the Task App, so GEPA learns how to prompt Codex or OpenCode to compile and pass cargo test. Budget controls (rollout_budget, max_concurrent) keep evaluation time reasonable, while proposer_effort=MEDIUM and proposer_output_tokens=FAST ensure mutation candidates remain precise enough to compile right away.