Skip to main content
Prompt optimization uses evolutionary algorithms to automatically optimize prompts for classification, reasoning, and instruction-following tasks. Synth supports two state-of-the-art algorithms: GEPA (Genetic Evolution of Prompt Architectures) and MIPRO (Meta-Instruction PROposer). References:
  • GEPA: Agrawal et al. (2025). “GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning.” arXiv:2507.19457
  • MIPRO: Opsahl-Ong et al. (2024). “Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs.” arXiv:2406.11695

Quick Start Checklist

1. Build a prompt evaluation task app

Define a task app that evaluates prompt performance on your task (classification accuracy, QA correctness, etc.). → Read: Task App requirements

2. Deploy and verify the service

Smoke-test locally, then deploy to Modal or your host of choice once health checks pass.
→ Read: Deploying task apps

3. Author the prompt optimization config

Capture algorithm choice (GEPA or MIPRO), initial prompt template, training/validation seeds, and optimization parameters in TOML.
→ Read: Prompt optimization configs

4. Launch the optimization job

Run uvx synth-ai train --config config.toml to create the job and stream status/metrics.
→ Read: Launch training jobs

5. Query and evaluate results

Use the Python API or REST endpoints to retrieve optimized prompts and evaluate them on held-out validation sets.
→ Read: Querying results

Algorithm Overview

GEPA (Genetic Evolution of Prompt Architectures)

Best for: Broad exploration, diverse prompt variants, classification tasks
Reference: Agrawal et al. (2025)
GEPA uses evolutionary principles to explore the prompt space:
  • Population-based search with multiple prompt variants
  • LLM-guided mutations for intelligent prompt modifications
  • Pareto optimization balancing performance and prompt length
  • Multi-stage support for pipeline optimization
Typical results: Improves accuracy from 60-75% (baseline) to 85-90%+ over 15 generations Key features:
  • Maintains a Pareto front of non-dominated solutions
  • Supports both template mode and pattern-based transformations
  • Module-aware evolution for multi-stage pipelines
  • Reflective feedback from execution traces

MIPRO (Meta-Instruction PROposer)

Best for: Efficient optimization, task-specific improvements, faster convergence
Reference: Opsahl-Ong et al. (2024)
MIPRO uses meta-learning to propose better instructions:
  • Meta-LLM (e.g., GPT-4o-mini) generates instruction variants
  • TPE (Tree-structured Parzen Estimator) guides Bayesian search
  • Bootstrap phase collects few-shot examples from high-scoring seeds
  • Reference corpus (up to 50k tokens) enriches meta-prompts
  • System spec integration for constraint-aware optimization
Typical results: Achieves similar accuracy gains with fewer evaluations (~96 rollouts vs ~1000 for GEPA) Key features:
  • Bootstrap phase initializes with task-specific examples
  • Program-aware instruction proposals
  • Multi-stage pipeline support with LCS-based stage detection
  • Token budget tracking and cost optimization
  • System spec integration for constraint-aware optimization

Architecture: Inference Interception

🚨 Critical: Both algorithms use an interceptor pattern that ensures optimized prompts never reach task apps. All prompt modifications happen in the backend via an inference interceptor that substitutes prompts before they reach the LLM.
✅ CORRECT FLOW:
Backend → register_prompt → Interceptor → substitutes → LLM

❌ WRONG FLOW:
Backend → prompt_template in payload → Task App (NEVER DO THIS)
This separation ensures:
  • Task apps remain unchanged during optimization
  • Prompt optimization logic stays in the backend
  • Secure, correct prompt substitution

Supported Models

Policy Models (Task Execution)

Both GEPA and MIPRO support policy models from:
  • OpenAI: gpt-4o, gpt-4o-mini, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-5, gpt-5-mini, gpt-5-nano
  • Groq: gpt-oss-20b, gpt-oss-120b, llama-3.3-70b-versatile, qwen-32b, qwen3-32b
  • Google: gemini-2.5-pro, gemini-2.5-pro-gt200k, gemini-2.5-flash, gemini-2.5-flash-lite

Mutation Models (GEPA Only)

Used to generate prompt mutations:
  • Common choices: openai/gpt-oss-120b, llama-3.3-70b-versatile
  • Nano models are rejected (too small for generation tasks)

Meta Models (MIPRO Only)

Used to generate instruction proposals:
  • Common choices: gpt-4o-mini, gpt-4.1-mini (most common default)
  • Nano models are rejected (too small for generation tasks)
Note: gpt-5-pro is explicitly rejected for all model types (too expensive: 15/15/120 per 1M tokens) See Supported Models for complete details.

When to Use Each Algorithm

AspectGEPAMIPRO
Search MethodGenetic evolutionMeta-LLM + TPE
ExplorationBroad, diverse variantsFocused, efficient
Computational CostLower (fewer LLM calls)Higher (meta-model calls)
Convergence10-15 generations10-20 iterations
Best ForClassification, multi-hop QATask-specific optimization
Evaluation Budget~1000 rollouts~96 rollouts
Choose GEPA if:
  • You want diverse prompt variants (Pareto front)
  • You have a large evaluation budget (1000+ rollouts)
  • You need broad exploration of the prompt space
Choose MIPRO if:
  • You want faster convergence with fewer evaluations
  • You have clear task structure (can bootstrap with examples)
  • You need efficient optimization (mini-batch evaluation)

Multi-Stage Pipeline Support

Both algorithms support optimizing prompts for multi-stage pipelines (e.g., Banking77 classifier → calibrator):
  • LCS-based stage detection automatically identifies which stage is being called
  • Per-stage optimization evolves separate instructions for each pipeline module
  • Unified evaluation tracks end-to-end performance across all stages

Next Steps