Documentation Index
Fetch the complete documentation index at: https://docs.adaptive-ml.com/llms.txt
Use this file to discover all available pages before exploring further.
Why multi-turn training?
Single-turn algorithms like GSPO generate one completion per prompt and evaluate it with a grader. Multi-turn training goes further: the policy model responds, optionally calls tools, the environment processes the response and evaluates it, and optionally generates the next user message. This cycle repeats until the conversation terminates. The entire trajectory contributes to the training signal. Harmony supports multi-turn training through ENVGRPO and ENVGSPO, the environment-based variants of GRPO and GSPO respectively. Both use the sameEnvironment and EnvironmentFactory abstractions described in this cookbook.
Multi-turn training is useful when your task involves:
- Back-and-forth dialogue where context accumulates over turns
- Dynamic tool calling based on conversation state
- Evaluation that depends on how the full conversation unfolds
The Environment
The central abstraction in multi-turn training is theEnvironment. It orchestrates each turn of the conversation: processes the agent’s output, executes tool calls if any, runs graders, generates follow-up user messages, and decides whether to continue or terminate. Everything — tools, grading, user simulation, termination — lives inside the environment.
A user simulator (an LLM that generates follow-up messages on behalf of the user) is one common pattern, but it’s not required. The environment’s react_to() method has full control over what happens next, and how the next user message is produced is entirely up to the implementation.
Tools
Tools are functions the agent can call to query information or take actions. Each tool receives arguments from the agent’s output, performs some operation, and returns a string result that gets appended to the conversation as a tool turn. Tool implementations are entirely up to you — they could query a database, call an API, search documents, or perform calculations.Subclassing Environment
The environment subclassesEnvironment from adaptive_harmony.environment and overrides react_to(). This method is called after every policy model completion. It receives the full conversation thread and returns either:
(next_thread, None)— to continue the conversation (optionally with per-turn grades)(None, grades)— to terminate and return final grades
react_to(): response parsing, tool execution, grading, generating the next user message (if any), and termination.
Note:_parse(),_execute_tool(), and_should_end()are not part of the Harmony API — they represent logic you implement yourself based on your use case. The only method the framework requires isreact_to().
Example: grading strategies
There is no single rule for how to structure grading in multi-turn training. Here are two common approaches: Per-turn grading — evaluate each agent response individually. For example, a grader might check whether the agent used the right tool, or whether the response was relevant to the user’s latest message. These run on every turn and provide immediate signal. Trajectory-level grading — evaluate the conversation as a whole after it ends. For example, checking whether the user’s goal was ultimately achieved, or whether the conversation was concise. These run once and capture holistic quality. You can combine both. For instance, a per-turnBinaryJudgeGrader checking tool correctness alongside a trajectory-level RangeJudgeGrader scoring overall helpfulness, aggregated with a CombinedGrader.
SFT warm-up
When the agent needs to produce a specific output format (e.g., structured responses with tool calls), an SFT phase on example conversations can bootstrap format compliance before RL refines quality. Without this, the environment may not be able to parse responses and training stalls.EnvironmentFactory and ENVGRPO recipe
ENVGRPO creates a fresh environment for each training trajectory via anEnvironmentFactory. The factory subclasses EnvironmentFactory and implements create_environment(metadata), which returns a new environment instance. This ensures state (turn counts, tool call history) is not shared across parallel rollouts.
A complete recipe spawns models, builds graders, creates the factory, and calls ENVGRPO(...).run(). If the environment uses a user simulator or other inference models internally, the factory is responsible for wiring them in.
To use ENVGSPO instead of ENVGRPO, replacefrom adaptive_harmony.common import ENVGRPOwithfrom adaptive_harmony.common import ENVGSPOand swap the class in the trainer call. TheEnvironmentandEnvironmentFactoryinterfaces are identical for both algorithms.
Key takeaways
- Multi-turn training captures conversational dynamics — The training signal comes from entire conversations, not isolated completions
-
Subclass
Environmentand overridereact_to()— This is where you wire together response parsing, tool execution, grading, and termination. It receives aStringThreadand returns either(next_thread, None)to continue or(None, grades)to terminate -
EnvironmentFactorycreates fresh environments per trajectory — ENVGRPO/ENVGSPO run many rollouts in parallel; each needs its own state - User simulation is one common pattern — An LLM generating follow-up messages on behalf of the user removes the need for real user data during training, but how the next user message is produced is entirely up to the environment implementation
- Grading strategies are flexible — You can grade per-turn, per-trajectory, or both. There’s no single right approach; it depends on what behaviors you want to reinforce
-
ENVGRPO and ENVGSPO — Both algorithms support environment-based multi-turn training with the same
EnvironmentandEnvironmentFactoryinterfaces

