Skip to main content
GEPA (Generalized Evolutionary Prompt Adaptation) is a reflective evolutionary optimizer that automatically improves your prompts through LLM-guided mutations and multi-objective selection. This walkthrough covers everything from setup to retrieving your optimized prompts. Reference: Agrawal et al. (2025). “GEPA: Reflective Prompt Evolution.” arXiv:2507.19457

Why GEPA?

GEPA outperforms GRPO by 10% on average (up to 20%) while using up to 35x fewer rollouts. Best for:
  • Classification tasks (Banking77, intent classification)
  • Multi-hop QA (HotpotQA)
  • Instruction-following tasks
  • When you want diverse prompt variants (Pareto front)
Typical results: 60-75% baseline accuracy → 85-90%+ after 15 generations

Prerequisites

Before starting, ensure you have:
# Required environment variables in .env
GROQ_API_KEY=gsk_...          # For policy model inference
SYNTH_API_KEY=sk_...          # For backend authentication
ENVIRONMENT_API_KEY=sk_env_... # Optional - auto-generated if not set
Install the Synth AI SDK:
pip install synth-ai

How GEPA Works

GEPA uses evolutionary principles to explore the prompt space. Understanding the algorithm helps you configure it effectively.

The Optimization Flow

  1. Initialize
    • Split seeds into pareto_seeds and feedback_seeds
    • Evaluate baseline transformation
    • Generate initial population via proposer
    • Evaluate & add to Pareto archive
  2. Evolve (for each generation)
    • For each child:
      • Select parent (instance-wise Pareto sampling)
      • Generate feedback from parent trace
      • Mutate via proposer (LLM-guided)
      • Minibatch gating (quick eval)
      • Full Pareto evaluation (if gating passed)
      • Update archive if non-dominated
  3. Terminate
    • Budget exhausted OR
    • Generation limit OR
    • No improvement for N generations
  4. Return best transformation, by accuracy

Key Components

1. Pattern-Based Transformations

GEPA represents prompt changes as transformations that can be applied to your baseline:
# A transformation replaces text in your prompt
TextTransformation(
    old_text="You are a helpful assistant.",      # Original text
    new_text="You are a banking classification expert...",  # Optimized text
    apply_to_role="system"  # Only apply to system messages
)

2. Pareto Archive

GEPA maintains a Pareto front of non-dominated solutions, balancing multiple objectives:
  • Accuracy (primary) – Task performance
  • Tool call rate – Function calling frequency (for agentic tasks)
Solutions are kept if they’re not dominated by any other solution across all objectives.

3. Instance-Wise Parent Selection

Unlike traditional selection that uses aggregate scores, GEPA counts how many individual seeds each prompt “wins” on:
# Parent selection weights prompts by per-seed wins
wins = count_seeds_where_prompt_is_best(prompt)
selection_weight = (wins + ε) ** selection_pressure
This favors prompts that excel on specific example types, encouraging specialization.

4. LLM-Guided Mutations

The proposer (meta-model) generates new prompts by analyzing:
  • Current instruction (baseline)
  • Rollout examples (input/output/feedback for each seed)
  • Trace feedback (e.g., “model under-utilizes tools”)
  • Dataset and program context
The proposer uses instruction typology to structure outputs with: input descriptions, core task, premises, heuristics, constraints, rules, and output descriptions.

Step 1: Create a LocalAPI

Your LocalAPI evaluates prompts by running rollouts and returning scores. See LocalAPI Guide for details. Example Banking77 LocalAPI structure:
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class RolloutRequest(BaseModel):
    seed: int
    run_id: str
    # ... other fields

@app.post("/rollout")
async def rollout(request: RolloutRequest):
    # 1. Load example for this seed
    example = load_example(request.seed)

    # 2. Call your LLM with the prompt (interceptor handles substitution)
    prediction = await call_llm(example.query)

    # 3. Score the prediction
    correct = prediction == example.expected_label

    return {
        "metrics": {"correct": correct},
        "outcome": 1.0 if correct else 0.0
    }

Step 2: Deploy Your LocalAPI

The Synth AI backend needs to reach your LocalAPI over the internet to send rollout requests. Use the Python SDK’s InProcessTaskApp for seamless deployment with automatic tunneling:
import os
from synth_ai.sdk import InProcessTaskApp

# Import your task app
from my_task_app import app

async with InProcessTaskApp(app=app) as task_app:
    print(f"Task app running at: {task_app.url}")
    # task_app.url is a SynthTunnel URL like https://st.usesynth.ai/s/rt_...
This:
  1. Starts your LocalAPI locally
  2. Creates a SynthTunnel URL (or another tunnel backend if configured)
  3. Returns the tunnel URL via task_app.url

Verify the Deployment

Check that your LocalAPI is accessible:
curl https://st.usesynth.ai/s/rt_.../health

Alternative Tunnel Backends

SynthTunnel is the default and recommended backend. If you need a Cloudflare tunnel instead:
# Cloudflare quick tunnel (ephemeral, requires cloudflared binary)
async with InProcessTaskApp(
    app=app,
    tunnel_backend="cloudflare_quick",
) as task_app:
    print(f"URL: {task_app.url}")

# Cloudflare managed lease (stable hostname)
async with InProcessTaskApp(
    app=app,
    tunnel_backend="cloudflare_managed_lease",
) as task_app:
    print(f"Stable URL: {task_app.url}")

Step 3: Create the Configuration

Create a TOML file defining your optimization parameters. The task_app_url should match the URL from Step 2 (stored in your .env as TASK_APP_URL):
[prompt_learning]
algorithm = "gepa"
task_app_url = "https://my-company.usesynth.ai"  # From TASK_APP_URL in .env
task_app_id = "banking77"

# Training seeds (used during optimization)
evaluation_seeds = [50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79]

# Validation seeds (held-out for final evaluation)
validation_seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]

# Initial prompt template
[prompt_learning.initial_prompt]
messages = [
  { role = "system", content = "You are a banking intent classification assistant." },
  { role = "user", pattern = "Customer Query: {query}\n\nClassify this query into one of 77 banking intents." }
]

# GEPA-specific configuration
[prompt_learning.gepa]
num_generations = 15              # Evolutionary cycles to run
children_per_generation = 5       # Mutations per generation
pareto_set_size = 20              # Seeds for Pareto evaluation
minibatch_size = 3                # Seeds for quick gating
rollout_budget = 1000             # Total rollouts allowed
archive_size = 64                 # Max Pareto archive size

Configuration Parameters

ParameterDescriptionDefaultRecommended Range
num_generationsEvolutionary cycles105-20
children_per_generationMutations per generation53-10
pareto_set_sizeSeeds for Pareto evaluation2015-30
minibatch_sizeSeeds for gating evaluation32-5
rollout_budgetTotal rollouts allowed1000200-2000
archive_sizeMax Pareto archive size6432-128
feedback_fractionFraction of seeds for feedback0.30.2-0.5
proposer_modeProposer type (synth, gepa-ai, dspy)synth-

Step 4: Launch the Optimization Job

import os
from synth_ai.sdk import PolicyOptimizationJob

async def run_optimization():
    client = PromptLearningClient(api_key=os.environ["SYNTH_API_KEY"])

    # Create job from TOML config
    job = await client.create_job_from_toml("configs/prompt_learning/banking77_gepa.toml")
    print(f"Created job: {job['id']}")

    # Start the job
    await client.start_job(job["id"])

    # Poll until completion
    result = await client.poll_until_terminal(job["id"])
    print(f"Best score: {result['best_score']}")
    return result
The SDK will:
  1. Validate your TOML configuration
  2. Verify the task app is reachable
  3. Submit the job to Synth AI
  4. Poll for completion

Understanding the Output

During optimization, you’ll see progress updates:
[18:35:37]    0.0s  Status: running
[18:35:42]    5.2s  Status: running | Best: 0.500
[18:35:48]   11.4s  Status: running | Best: 0.625
[18:35:54]   17.6s  Status: running | Best: 0.750
[18:36:00]   23.8s  Status: running | Best: 0.875
...
[18:38:50]  175.9s  Status: succeeded | Best: 0.875
Your LocalAPI logs will show rollout requests:
[TASK_APP] INBOUND_ROLLOUT: run_id=prompt-learning-74-5bec8a6f seed=74
[TASK_APP] PREDICTION: expected=card_arrival predicted=card_delivery_estimate correct=False
[BANKING77_ROLLOUT] run_id=prompt-learning-74-5bec8a6f reward=0.0

Step 5: Understanding the Optimization Process

Generation-by-Generation Progress

GenerationWhat HappensExpected Accuracy
0 (baseline)Evaluate initial prompt60-75%
1-3Explore diverse mutations70-80%
5-10Convergence begins80-85%
10-15Fine-tuning best solutions85-90%+

How Mutations Are Generated

The proposer receives:
  1. Baseline instruction: Your current system prompt
  2. Rollout examples: Input/output pairs with feedback (correct/incorrect, error messages)
  3. Trace statistics: Tool call rate, trajectory length, etc.
  4. Feedback hints: Rule-based suggestions like “model under-utilizes tools”
It generates a new instruction following instruction typology:
[Input Description]
You will be given a customer banking query.

[Core Task Description]
Your task is to classify the query into one of 77 banking intents.

[Premises]
Banking queries often contain domain-specific terminology.
Multiple intents may seem applicable; choose the most specific.

[Heuristics]
Look for keywords indicating the customer's primary need.
Consider the emotional tone to distinguish complaints from inquiries.

[Constraints]
Avoid defaulting to generic intents when specific ones apply.

[Rules]
Output only the intent name, nothing else.

[Output Description]
Return exactly one intent from the predefined list.

Minibatch Gating

Before full evaluation, GEPA performs a quick check:
  1. Evaluate child on a small minibatch (3 seeds)
  2. Compare to parent’s score on the same seeds
  3. If child is worse → skip full evaluation (saves budget)
  4. If child is promising → proceed to full Pareto evaluation
This saves significant compute by filtering out poor mutations early.

Step 6: Retrieve Optimized Prompts

After completion, fetch your results using the Python SDK:
import os
from synth_ai.sdk import PolicyOptimizationJob

API_KEY = os.environ["SYNTH_API_KEY"]
JOB_ID = "pl_abc123"  # From the job submission output

# Get all results
results = get_prompts(job_id=JOB_ID, api_key=API_KEY)
print(f"Best Score: {results['best_score']:.3f}")

# Get top 5 prompts from Pareto front
for rank in range(1, 6):
    prompt = get_prompt_text(job_id=JOB_ID, api_key=API_KEY, rank=rank)
    print(f"Rank {rank}: {len(prompt)} chars")
    print(prompt[:200] + "...")

# Get scoring summary
summary = get_scoring_summary(job_id=JOB_ID, api_key=API_KEY)
print(f"Train={summary['best_train_accuracy']:.3f}")
print(f"Validation={summary.get('best_validation_accuracy', 0.0):.3f}")
print(f"Candidates Tried={summary['num_candidates_tried']}")

Understanding the Pareto Front

GEPA returns multiple prompts representing different trade-offs:
RankAccuracyToken CountTrade-off
192%450Highest accuracy
290%280Good accuracy, shorter
388%150Efficient, still performant
Choose based on your latency/cost requirements.

Step 7: Use the Optimized Prompt

Replace your baseline prompt with the optimized version:
# Before: baseline prompt
system_prompt = "You are a banking intent classification assistant."

# After: optimized prompt (rank 1 from GEPA)
system_prompt = get_prompt_text(job_id=JOB_ID, rank=1)

# Use in your application
response = await openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Customer Query: {query}"}
    ]
)

In-Process Optimization

For development and testing, run everything from a single Python script:
import os
from synth_ai.sdk import InProcessTaskApp
from synth_ai.sdk import PolicyOptimizationJob

# Start LocalAPI in-process (handles tunneling automatically)
async with InProcessTaskApp(app=my_task_app) as task_app:
    # LocalAPI is now accessible via tunnel
    task_app_url = task_app.url

    # Submit optimization job
    client = PromptLearningClient(api_key=os.environ["SYNTH_API_KEY"])
    job = await client.create_job(config=my_config)
    await client.start_job(job["id"])

    # Poll until complete
    result = await client.poll_until_terminal(job["id"])
    print(f"Best score: {result['best_score']}")
See In-Process Task App Walkthrough for a complete example.

Termination Conditions

GEPA stops when any condition is met:
ConditionDescriptionConfiguration
rollout_budgetTotal rollouts exhaustedrollout_budget = 1000
max_spend_usdUSD budget limitmax_spend_usd = 5.0
num_generationsGeneration limit reachednum_generations = 15
patience_generationsNo improvement for N generationspatience_generations = 5

Supported Models

See Supported Models for Prompt Optimization for the full list of policy models.