Evaluate

After a prompt optimization job completes, use the Python SDK helpers (the only supported interface) to fetch optimized prompts, inspect Pareto fronts, and re-run validation seeds against your container. GEPA supports multiple optimization modes: prompt (LLM prompts), context (retrieval/context), and optimize_anything (code, DSL configs, etc.). Result structure varies by mode; use the extraction patterns below for mode-agnostic access.

Querying Results (Python SDK)

import os
from synth_ai import SynthClient

JOB_ID = "offline_job_id_here"

client = SynthClient(api_key=os.environ["SYNTH_API_KEY"])
job = client.optimization.offline.get(job_id=JOB_ID)

print("status:", job.status())
print("events:", job.events(limit=100))
print("artifacts:", job.artifacts())

Optimization Modes

GEPA supports three optimization modes; artifact structure varies by mode:

Mode	Artifact shape	Typical use
prompt	`messages` with `role`/`pattern`/`content`	LLM system prompts
context	Similar to prompt; retrieval/context config	RAG, context injection
optimize_anything	`dsl_config`, `candidate_code`, `solver_code`, etc.	Code, configs, arbitrary artifacts

Use list_candidates_typed(include="artifact_payload") for mode-agnostic access; each PolicyCandidate has candidate_content (pre-extracted text) and artifact_kind (e.g. prompt, dsl_config).

Understanding Results

Score Types

Prompt learning jobs track two types of scores:

prompt_best_train_score: Best accuracy on training seeds (used during optimization)
prompt_best_validation_score: Best accuracy on validation seeds (held-out evaluation)

The validation score provides an unbiased estimate of generalization performance.

Pareto Front

GEPA maintains a Pareto front of candidates balancing objectives (e.g. accuracy, token count, tool call rate). The structure depends on optimization mode (prompt, context, optimize_anything). Query multiple ranks via list_candidates_typed; each PolicyCandidate has candidate_content (pre-extracted text) and artifact_payload (full artifact):

# Canonical v1 currently exposes job status/events/artifacts for evaluation.
# Use `job.artifacts()` plus your verifier outputs to inspect winners.
print(job.artifacts())

Validation Evaluation

After optimization, run held-out seeds against your container using the SDK or direct HTTP calls:

import httpx

# Send rollouts directly to your container
async with httpx.AsyncClient() as client:
    response = await client.post(
        f"{container_url}/rollout",
        json={
            "trace_correlation_id": "validation-1",
            "env": {"seed": 100},  # Held-out seed
            "policy": {"config": {"prompt_template": optimized_prompt}}
        },
        headers={"X-API-Key": environment_api_key}
    )
    result = response.json()
    print(f"Reward: {result['reward_info']['outcome_reward']}")

Expected Performance

GEPA typically improves accuracy over generations:

Generation	Typical Accuracy	Notes
1 (baseline)	60-75%	Initial random/baseline prompts
5	75-80%	Early optimization gains
10	80-85%	Convergence begins
15 (final)	85-90%+	Optimized prompts on Pareto front

Next Steps

Prompt Optimization Cookbook – Complete GEPA walkthrough
Configuration Reference – All algorithm parameters

Getting started

Products

Container

Tunnel/Deploy

Querying Results (Python SDK)

Optimization Modes

Understanding Results

Score Types

Pareto Front

Validation Evaluation

Expected Performance

Next Steps

Getting started

Products

Container

Tunnel/Deploy

​Querying Results (Python SDK)

​Optimization Modes

​Understanding Results

​Score Types

​Pareto Front

​Validation Evaluation

​Expected Performance

​Next Steps

Querying Results (Python SDK)

Optimization Modes

Understanding Results

Score Types

Pareto Front

Validation Evaluation

Expected Performance

Next Steps