Changelog #0244 – Week of December 15, 2025

TL;DR

Unified Graphs API for running optimized graphs and verifiers
Graph Gen workflows for dataset-in, graph-out training
Verifier graph training + scoring endpoints
RLM graphs for massive-context inference
Live monitoring dashboard for graph execution
Expanded model provider support
VLM judge support for multi-modal evaluation
Hosted SWE agent judges
Clearer end-to-end onboarding paths across paid products

Graphs API: One Inference Surface

The Graphs API is now the single entry point for running optimized graphs and built-in zero-shot graphs.

What this unlocks

Single endpoint: Run GraphGen and graph-evolve outputs through a unified API.
Consistent validation: Input schemas and non-blocking output validation surface warnings without breaking inference.
Unified UX: One way to run policy graphs, verifier graphs, and RLM graphs.

Graph Gen: Workflows From Datasets

Graph Gen is the dataset-in, graph-out product surface for building reliable LLM workflows.

Highlights

Built-in judging: Rubric, contrastive, and gold-examples modes.
Graph types: Policy, verifier, and RLM graphs from the same dataset format.
Production inference: Run optimized graphs via the Graphs API.

Verifier Graphs

Verifier graphs let you score traces with calibrated, structured rewards.

What’s new

Graph judge endpoint: Submit a trace and get score + reasoning + event/outcome rewards.
Training path: Use Graph Gen with verifier datasets to train custom judges.

RLM Graphs (Massive Context)

RLM graphs handle large contexts by materializing content and searching locally instead of stuffing prompts.

Use cases

Multi-document QA
Codebase analysis
Large trace evaluation

UX: End-to-End Paid Product Flows

We’ve tightened the end-to-end onboarding story across paid products:

Task app requirements are explicit (dataset, rubric/judge config, on-demand execution).
Clearer first-run paths for Graph Gen, GEPA, GSPO, SFT, and verifier training.
Unified language for how users set up task apps, rollouts, and inference.

Monitoring: Graph Execution

A new live monitoring dashboard provides real-time visibility into graph execution, latency, and failures.

Judges: VLM + SWE

VLM judge support: Multi-modal evaluation for traces with image inputs.
Hosted SWE agent judges: Hosted judge graphs tuned for software engineering workflows.

Provider Support

Expanded model provider coverage for graph execution and judging (see provider lists in docs).

Documentation

Workflows overview: /product/workflows
Judging in Graph Gen: /product/workflows/judging
RLM graphs: /product/workflows/rlm
Graphs overview: /sdk/graphs/overview

Changelog

Benchmarks

​TL;DR

​Graphs API: One Inference Surface

​What this unlocks

​Graph Gen: Workflows From Datasets

​Highlights

​Verifier Graphs

​What’s new

​RLM Graphs (Massive Context)

​Use cases

​UX: End-to-End Paid Product Flows

​Monitoring: Graph Execution

​Judges: VLM + SWE

​Provider Support

​Documentation