Skip to main content
Verifier optimization tunes LLM-based evaluators—their rubrics, criteria weights, and evaluation prompts—to produce scores that correlate with ground truth. This cookbook covers two patterns:
PatternDomainChallengeGround Truth
Data-HeavyCode evaluationLarge artifacts, many test casesDeterministic tests
Criteria-HeavyVisual evaluationMulti-dimensional assessmentHuman ratings

Data-Heavy: Code Evaluation

When evaluating code, deterministic signals exist (compilation, tests) but don’t capture all quality dimensions. Verifier optimization tunes the rubric so scores correlate with these signals while also capturing harder-to-measure qualities.

The Challenge

Deterministic tests provide partial ground truth:
SignalTests ProvideVerifier Adds
Correctness✅ Pass/failPattern adherence to reference
Completeness❌ May miss stubsDetects todo!() placeholders
Code quality❌ Not measuredIdiomatic patterns, readability
Architecture❌ Not measuredEngine pattern compliance

What Gets Optimized

GEPA evolves the verifier’s rubric and evaluation prompt:
# Baseline rubric - hand-written criteria
baseline_rubric = Rubric(
    criteria=[
        Criterion(id="compilation", weight=1.0),
        Criterion(id="correctness", weight=1.0),
        Criterion(id="completeness", weight=1.0),
    ]
)

# After optimization - learned weights and refined descriptions
optimized_rubric = Rubric(
    criteria=[
        Criterion(id="compilation", weight=2.0,
                  description="Code compiles without errors or warnings"),
        Criterion(id="correctness_vs_gold", weight=3.0,
                  description="Implementation matches reference behavior exactly"),
        Criterion(id="completeness", weight=2.5,
                  description="All todo!() markers replaced with implementations"),
        Criterion(id="pattern_adherence", weight=1.5,
                  description="Follows engine architecture patterns"),
        Criterion(id="code_quality", weight=1.0,
                  description="Uses idiomatic Rust constructs"),
    ]
)

Architecture

Configuration

verifier_optimization.toml
[optimization]
algorithm = "gepa"
target = "verifier"

[optimization.verifier]
optimize_rubric = true
optimize_evaluation_prompt = true
optimize_criteria_weights = true

[ground_truth]
source = "deterministic_tests"
signals = ["compile_success", "test_pass_rate"]

[evaluation]
metric = "correlation"

Data-Heavy Characteristics

FactorImpact on Verifier Optimization
Many test casesMore ground truth for correlation
Large artifactsVerifier must handle 200-300KB code
Expensive rolloutsAmortize cost across verifier candidates
5+ criteriaMore weights to optimize

Run It

View on GitHub
pip install synth-ai
curl -LO https://raw.githubusercontent.com/synth-laboratories/synth-ai/main/demos/engine_bench/run_gepa_minimal.py
curl -LO https://raw.githubusercontent.com/synth-laboratories/synth-ai/main/demos/engine_bench/localapi_engine_bench.py
curl -LO https://raw.githubusercontent.com/synth-laboratories/synth-ai/main/demos/engine_bench/enginebench_gepa_minimal.toml
export OPENAI_API_KEY="your-openai-key"
python run_gepa_minimal.py

Criteria-Heavy: Visual Evaluation

Visual evaluation has no deterministic ground truth—only human judgment. Verifier optimization tunes criteria to match human ratings across multiple dimensions.

The Challenge

Visual fidelity requires multi-dimensional assessment:
CriterionWhat to MeasureBaseline Score
Color SchemeBackground, text, accent colors2.4/10
TypographyFont sizes, weights, hierarchy3.8/10
LayoutSpacing, margins, positioning4.0/10
Visual ElementsIcons, images, decorations2.8/10
OverallWould it pass for the original?3.2/10

What Gets Optimized

GEPA evolves the verifier’s evaluation prompt and criteria definitions:
# Baseline - generic evaluation prompt
baseline_prompt = "Rate how similar the generated image is to the original."

# After optimization - specific, calibrated evaluation
optimized_prompt = """Evaluate visual fidelity on these criteria (0-10 each):

1. COLOR SCHEME & BRANDING:
   - Exact color matches for backgrounds (#F8F8F8, not dark)
   - Brand colors preserved (green CTAs, purple accents)

2. TYPOGRAPHY & TEXT STYLING:
   - Heading sizes 2-3x body text
   - Sans-serif fonts, clear hierarchy

3. LAYOUT & SPACING:
   - Wide margins (not edge-to-edge)
   - Generous padding between sections

4. VISUAL ELEMENTS:
   - Correct icons and logos
   - Gradients and shadows match

5. OVERALL:
   - Would someone mistake it for the real site?

Deduct points for: dark themes when original is light,
wrong font weights, missing whitespace."""

Architecture

Configuration

visual_verifier_optimization.toml
[optimization]
algorithm = "gepa"
target = "verifier"

[optimization.verifier]
optimize_evaluation_prompt = true
optimize_criteria_definitions = true
backend_model = "gemini-2.5-flash"

[ground_truth]
source = "human_ratings"
dimensions = ["color", "typography", "layout", "elements", "overall"]

[evaluation]
metric = "correlation"

Criteria-Heavy Characteristics

FactorImpact on Verifier Optimization
5 evaluation dimensionsEach criterion needs calibration
Subjective ground truthHuman ratings have variance
Vision model requiredMultimodal prompt optimization
No deterministic signalsEntirely dependent on verifier quality

Run It

View on GitHub
pip install synth-ai
curl -LO https://raw.githubusercontent.com/synth-laboratories/synth-ai/main/demos/web-design/run_demo.py
curl -LO https://raw.githubusercontent.com/synth-laboratories/synth-ai/main/demos/web-design/gepa_config.toml
export GEMINI_API_KEY="your-gemini-key"
python run_demo.py

Comparison

AspectData-Heavy (Code)Criteria-Heavy (Visual)
Ground truthDeterministic testsHuman ratings
Optimized componentsRubric weights, criterion descriptionsEvaluation prompt, criteria definitions
Verifier modelgpt-5-minigemini-2.5-flash (vision)
Correlation targetCompile + test pass rateHuman similarity scores
Main challengeMany criteria to weightSubjective dimensions to calibrate

When to Use Each Pattern

Data-Heavy:
  • Ground truth from deterministic signals (tests, validators)
  • Many evaluation criteria to weight
  • Large artifacts requiring structured evaluation
  • Correlation with existing metrics
Criteria-Heavy:
  • Subjective or creative evaluation
  • No deterministic ground truth
  • Multi-dimensional quality assessment
  • Vision or multimodal evaluation