- GEPA: Agrawal et al. (2025). “GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning.” arXiv:2507.19457
- MIPRO: Opsahl-Ong et al. (2024). “Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs.” arXiv:2406.11695
Quick Start Checklist
1. Build a prompt evaluation task app
Define a task app that evaluates prompt performance on your task (classification accuracy, QA correctness, etc.). → Read: Task App requirements2. Deploy and verify the service
Smoke-test locally, then deploy to Modal or your host of choice once health checks pass.→ Read: Deploying task apps
3. Author the prompt optimization config
Capture algorithm choice (GEPA or MIPRO), initial prompt template, training/validation seeds, and optimization parameters in TOML.→ Read: Prompt optimization configs
4. Launch the optimization job
Runuvx synth-ai train --config config.toml to create the job and stream status/metrics.→ Read: Launch training jobs
5. Query and evaluate results
Use the Python API or REST endpoints to retrieve optimized prompts and evaluate them on held-out validation sets.→ Read: Querying results
Algorithm Overview
GEPA (Genetic Evolution of Prompt Architectures)
Best for: Broad exploration, diverse prompt variants, classification tasksReference: Agrawal et al. (2025) GEPA uses evolutionary principles to explore the prompt space:
- Population-based search with multiple prompt variants
- LLM-guided mutations for intelligent prompt modifications
- Pareto optimization balancing performance and prompt length
- Multi-stage support for pipeline optimization
- Maintains a Pareto front of non-dominated solutions
- Supports both template mode and pattern-based transformations
- Module-aware evolution for multi-stage pipelines
- Reflective feedback from execution traces
MIPRO (Meta-Instruction PROposer)
Best for: Efficient optimization, task-specific improvements, faster convergenceReference: Opsahl-Ong et al. (2024) MIPRO uses meta-learning to propose better instructions:
- Meta-LLM (e.g., GPT-4o-mini) generates instruction variants
- TPE (Tree-structured Parzen Estimator) guides Bayesian search
- Bootstrap phase collects few-shot examples from high-scoring seeds
- Reference corpus (up to 50k tokens) enriches meta-prompts
- System spec integration for constraint-aware optimization
- Bootstrap phase initializes with task-specific examples
- Program-aware instruction proposals
- Multi-stage pipeline support with LCS-based stage detection
- Token budget tracking and cost optimization
- System spec integration for constraint-aware optimization
Architecture: Inference Interception
🚨 Critical: Both algorithms use an interceptor pattern that ensures optimized prompts never reach task apps. All prompt modifications happen in the backend via an inference interceptor that substitutes prompts before they reach the LLM.- Task apps remain unchanged during optimization
- Prompt optimization logic stays in the backend
- Secure, correct prompt substitution
Supported Models
Policy Models (Task Execution)
Both GEPA and MIPRO support policy models from:- OpenAI:
gpt-4o,gpt-4o-mini,gpt-4.1,gpt-4.1-mini,gpt-4.1-nano,gpt-5,gpt-5-mini,gpt-5-nano - Groq:
gpt-oss-20b,gpt-oss-120b,llama-3.3-70b-versatile,qwen-32b,qwen3-32b - Google:
gemini-2.5-pro,gemini-2.5-pro-gt200k,gemini-2.5-flash,gemini-2.5-flash-lite
Mutation Models (GEPA Only)
Used to generate prompt mutations:- Common choices:
openai/gpt-oss-120b,llama-3.3-70b-versatile - Nano models are rejected (too small for generation tasks)
Meta Models (MIPRO Only)
Used to generate instruction proposals:- Common choices:
gpt-4o-mini,gpt-4.1-mini(most common default) - Nano models are rejected (too small for generation tasks)
gpt-5-pro is explicitly rejected for all model types (too expensive: 120 per 1M tokens)
See Supported Models for complete details.
When to Use Each Algorithm
| Aspect | GEPA | MIPRO |
|---|---|---|
| Search Method | Genetic evolution | Meta-LLM + TPE |
| Exploration | Broad, diverse variants | Focused, efficient |
| Computational Cost | Lower (fewer LLM calls) | Higher (meta-model calls) |
| Convergence | 10-15 generations | 10-20 iterations |
| Best For | Classification, multi-hop QA | Task-specific optimization |
| Evaluation Budget | ~1000 rollouts | ~96 rollouts |
- You want diverse prompt variants (Pareto front)
- You have a large evaluation budget (1000+ rollouts)
- You need broad exploration of the prompt space
- You want faster convergence with fewer evaluations
- You have clear task structure (can bootstrap with examples)
- You need efficient optimization (mini-batch evaluation)
Multi-Stage Pipeline Support
Both algorithms support optimizing prompts for multi-stage pipelines (e.g., Banking77 classifier → calibrator):- LCS-based stage detection automatically identifies which stage is being called
- Per-stage optimization evolves separate instructions for each pipeline module
- Unified evaluation tracks end-to-end performance across all stages
Next Steps
- Algorithm Comparison – Detailed comparison of GEPA vs MIPRO
- System Specifications – How specs guide optimization
- Configuration Reference – Complete parameter documentation
- Training Guide – Step-by-step training instructions
- Banking77 Example – Complete walkthrough