On-Policy RL

This guide uses the v1 RL example under synth-ai/examples/rl/ and your backend.

0) Prereqs

Backend URL + API key
Modal CLI logged in
uv installed

1) Create the Task App (environment service)

Deploy the example Task App to Modal (uses Crafter helpers):

cd synth-ai/examples/rl
bash deploy_task_app.sh

The script prints the service URL. Export it:

export TASK_APP_BASE_URL="https://<your-modal-app>.modal.run"

2) Mint & upload ENVIRONMENT_API_KEY

Use the helper to mint and store the key as a Modal secret used by the Task App:

cd synth-ai/examples/rl
bash update_task_app_environment_api.sh

This creates/updates the crafter-environment-sdk secret with ENVIRONMENT_API_KEY and optional pass‑throughs.

3) Health & wiring check

Run diagnostics against backend + Task App to verify headers/auth and routes:

uv run python synth-ai/examples/rl/check.py \
  --backend-url "$PROD_BACKEND_URL" \
  --api-key "$SYNTH_API_KEY" \
  --task-app-url "$TASK_APP_BASE_URL"

4) OpenAI in the Task App (smoke test)

Call OpenAI from inside the Task App to ensure outbound provider access works:

uv run python synth-ai/examples/rl/openai_in_task_app.py \
  --model gpt-5-nano --num-rollouts 2 --max-steps-each 7 --timeout-seconds 1200

5) Run full on-policy RL (backend‑orchestrated)

Kick off a full RL job via the backend (uses trainer id server‑side):

uv run python synth-ai/examples/rl/run_rl_job.py \
  --backend-url "$PROD_BACKEND_URL" \
  --api-key "$SYNTH_API_KEY" \
  --task-app-url "$TASK_APP_BASE_URL" \
  --trainer-id "$TRAINER_ID" \
  --model "Qwen/Qwen3-0.6B" \
  --batch-size 2 --group-size 4 --stream-seconds 0

6) Inference with RL weights

RL_WEIGHTS_PATH="models/Qwen-Qwen3-0.6B/rl-job_.../checkpoint-epoch-1.tar.gz" \
uv run python synth-ai/examples/rl/hello_rl_completion.py \
  --model "rl:Qwen/Qwen3-0.6B:job_<id>:checkpoint-epoch-1"

Refs: example scripts live in repo synth-laboratories/synth-ai.

Synth-AI

​0) Prereqs

​1) Create the Task App (environment service)

​2) Mint & upload ENVIRONMENT_API_KEY

​3) Health & wiring check

​4) OpenAI in the Task App (smoke test)

​5) Run full on-policy RL (backend‑orchestrated)

​6) Inference with RL weights