Skip to main content

Required Sections

[algorithm]
[services]
[policy]        # or legacy [model]
[rollout]
[training]
[evaluation]
[compute]       # optional but recommended
[judge]         # optional (rubric/judge settings)
[tags]          # optional metadata

Field Requirements (from synth_ai/train/configs/rl.py)

  • [algorithm]
    • type must be "online"
    • method must be one of "policy_gradient", "ppo", or "gspo"
    • variety (string identifying the training variant)
  • [services]
    • task_url (required; judge_url optional) – matches RLServicesConfig
  • [policy] (preferred) or legacy [model]
    • trainer_mode and label are required
    • Either source or base must be set, but not both (see ModelConfig validator)
  • [rollout]
    • env_name
    • policy_name
    • max_turns
    • episodes_per_batch
    • max_concurrent_rollouts
  • [training]
    • num_epochs
    • iterations_per_epoch
    • max_turns
    • batch_size
    • group_size
    • learning_rate
    • Optional: gradient_accumulation_steps, weight_sync, lora, rewards (per RLTrainingConfig)
  • [evaluation]
    • instances
    • every_n_iters
    • seeds (list of ints)
  • [compute] (optional but strongly recommended)
    • Standard fields from ComputeConfig (e.g., gpu_type, gpu_count, nodes, topology.reference_placement)
  • [judge] / [rubric]
    • Optional fields for judge/rubric weighting; see JudgeConfig if you need blended rewards

Sample TOML

[algorithm]
type = "online"
method = "gspo"
variety = "default"

[services]
task_url = "https://my-task-app.modal.run"

[policy]
trainer_mode = "ppo"
label = "gpt-4o-mini"
source = "gpt-4o-mini"

[rollout]
env_name = "crafter"
policy_name = "policy-gspo"
max_turns = 32
episodes_per_batch = 4
max_concurrent_rollouts = 16

[training]
num_epochs = 50
iterations_per_epoch = 200
max_turns = 32
batch_size = 64
group_size = 4
learning_rate = 3e-4

[evaluation]
instances = 32
every_n_iters = 20
seeds = [1, 2, 3, 4]

[compute]
gpu_type = "A100"
gpu_count = 4