Skip to main content

How GEPA Works (High-Level)

Genetic/evolutionary prompt optimization

GEPA treats your prompt as a population of mutable transformations. Each generation, the optimizer scores every candidate on a validation slice, picks parents by how often they “win” on individual seeds, and then respawns children through LLM-guided mutations and crossovers. The population manager maintains both a hot population (recent candidates) and a Pareto archive of non-dominated transformations so you always retain high-quality, diverse instructions.

Online vs offline modes

  • Online mode intercepts live traffic through a proxy URL, registers each trial with the backend, substitutes the chosen candidate into every system message, and streams the rollouts back to GEPA for instant feedback. See the online Banking77 walkthrough for the proxy wiring and session lifecycle.
  • Offline mode replays stored traces using a standard prompt-learning job. You configure seeds, budgets, and proposers in TOML, submit the job once, and GEPA evaluates every candidate batch against your dataset. The offline flow is ideal when you want reproducible experiments or have a fixed validation set.

Pareto frontier selection across objectives

Every candidate is scored along multiple axes—accuracy is the headline metric, but latency (model choice + token usage) and cost (rollouts, inference pricing) are first-class citizens. GEPA maintains a Pareto front by comparing candidates on a shared pareto_set of seeds, so even if a prompt sacrifices a millisecond of latency, it can survive if it unlocks accuracy gains without dominating existing solutions.

Interceptor trial logging and scoring

The interceptor is the glue that lets GEPA experiment on live traffic. When a candidate is selected, the backend:
  1. Registers a trial ID and the transformation metadata (proposer template, mutation history, parent references).
  2. Captures the entire trace (input, substituted prompt, outputs, tool calls, verifier verdict).
  3. Sends the trace to the evaluation stack (verifier graphs and/or RLM scoring).
  4. Pushes the score into the Pareto archive so subsequent generations can compare candidates on the same footing. Because the interceptor never ships optimized prompts to your Task App, your environment, secrets, and tooling remain untouched.

Candidate generation: mutations, crossovers, and proposers

GEPA generates new candidates via three levers:
  • Mutations mutate discrete parts of the instruction using templates that understand roles, heuristics, and constraints.
  • Crossovers splice together transforms from two parents to compose novel instruction fragments.
  • Proposer strategies set the quality envelope. The default synth proposer balances cost and diversity, but you can opt for gepa-ai or dspy proposers when you need richer reasoning. On the scoring side, advanced verifiers (RLM v1, RLM v2) act like agentic proposers: they evaluate traces with multi-step reasoning, so you can confidently iterate even when your traces exceed hundreds of thousands of tokens.

Configuration Deep-Dive

Seeds, archives, and minibatch gating

KnobWhat it controlsDefaultTips
pareto_set_sizeNumber of seeds used for Pareto dominance checks15–30Too small → noisy objectives; too large → high cost. Keep it aligned with your validation set size.
feedback_seedsSeed pool used for minibatch gating and live feedbackremainder of your seed splitFocus on diverse failure cases so early gating catches regressions.
minibatch_sizeSample size for fast gating before full Pareto eval2–5Matches feedback_seeds. Smaller values save inference steps but increase noise.
archive_sizeRetains the top N candidates across objectives32–128Enough capacity prevents premature convergence. Use archive_multiplier in defaults to keep it proportional to pop_size.

Verifier settings (RLM v1/v2, custom graphs)

GEPA relies on verifiers to translate traces into scalar rewards. The built-in rubrics (single prompt) cover most classification tasks. Switch to RLM v1 when traces get long or noisy, and go to RLM v2 when you need multi-agent debate-style scoring. Custom verifiers—trained Graphs or rubric graphs tailored to your domain—drop in via verifier_config in the job payload. RLMs are slower/higher cost, so only enable them for Pareto scoring (pareto_set) while keeping minibatch gating on faster rubrics.

Proposer selection and candidate quality

proposer_effort and proposer_output_tokens are the levers that trade compute for signal. LOW_CONTEXT/LOW effort levels run quickly (good for early exploration), while MEDIUM/HIGH invite more reasoning layers and longer outputs. proposer_mode also matters: synth is the default (balanced), dspy mirrors the DSPy proposer, and gepa-ai tracks the reference GEPA implementation. If you ship your own meta-model, wrap it with a custom proposer; GEPA injects trace summaries, tool usage stats, and failure hints into the proposer context so every mutation is grounded in feedback.

Budget controls (rollouts, tokens, time)

  • rollout_budget limits the total number of call attempts (e.g., 1,000 rollouts for Banking77, 200 for lightweight tests). max_rollouts guards against runaway sessions by capping the outer loop regardless of budget.
  • max_concurrent throttles how many rollouts run in parallel; use lower values when your Task App or tunnel is resource constrained.
  • Termination knobs like max_cost_usd, max_trials, and eval_timeout (prompt_learning.termination_config and evaluation configs) let you bound spend or runtime. Every evaluator ships eval_timeout (default 600s) to ensure hung requests fail fast. If you use a policy graph, set policy.max_completion_tokens or verifier.max_prompt_tokens to stay within your supported pricing tier.

Online vs offline configuration differences

Offline jobs are pure TOML—define seeds, proposer settings, budgets, and verifiers, submit to /api/policy-optimization, and watch GEPA iterate. Online jobs (GepaOnlineSession) add mode: "online", online_proposer_min_rollouts, and proxy configuration so the backend can swap the candidate in-flight. Online mode also exposes online_proposer_mode to choose between inline proposers or deferred ones; offline relies on the normal proposer pipeline.

Common misconfigurations and how to avoid them

  • Forgetting to wire the online proxy: GEPA reports baseline for every rollout if the backend never sees candidate registrations—check that the Tunnel URL and register_trial_url are reachable.
  • Setting pareto_set_size > rollout_budget: you will never finish a Pareto evaluation if there are not enough rollouts left; keep the Pareto set smaller than your budget per generation.
  • Running RLM scoring everywhere: RLM v1/v2 are powerful but slow. Use them for Pareto evaluations and keep gating on lightweight rubrics.
  • Missing seeds split: feedback_seeds and pareto_set must cover disjoint sets so generation metrics do not leak into gating signals.
  • Ignoring proposer tokens: proposer_output_tokens should align with your model budget. If you see proposal_too_long errors, drop the effort level or cap tokens.

Example Applications

Banking77 classification

Banking77 is GEPA’s bread-and-butter classification demo. Start with the quickstart (prompt-optimization-banking77) to split seeds into pareto/feedback slices, then scale pareto_set_size when tackling new distribution shifts (fraud, support, contact changes). A 20–30 seed Pareto set with a 3 seed minibatch gives you fast gating without sacrificing statistical power—results typically improve baseline accuracy from ~65% to 85%+ within 10–15 generations.

Crafter/VLM agent optimization

When your agent executes in complex environments, trace evaluation matters more than raw labels. GEPA pairs a Crafter verifier (see Crafter Verifier Cookbook) with RLM scoring: the verifier digests action traces, scores achievements/survival, and feeds that signal back into the Pareto archive. The proposer uses metadata about tool calls and environment state so mutations emphasize mission-critical directives (e.g., “collect energy before leaving the sandbox”).

Coding agents (Codex/OpenCode)

Coding agent jobs (see the GEPA Coding Agent quickstart) optimize AGENTS.md, skills definitions, and core system prompts simultaneously. The optimization objective is the test pass rate reported by the Task App, so GEPA learns how to prompt Codex or OpenCode to compile and pass cargo test. Budget controls (rollout_budget, max_concurrent) keep evaluation time reasonable, while proposer_effort=MEDIUM and proposer_output_tokens=FAST ensure mutation candidates remain precise enough to compile right away.