Multi-turn training with environments

Why multi-turn training?

Single-turn algorithms like GSPO generate one completion per prompt and evaluate it with a grader. Multi-turn training goes further: the policy model responds, optionally calls tools, the environment processes the response and evaluates it, and optionally generates the next user message. This cycle repeats until the conversation terminates. The entire trajectory contributes to the training signal. Harmony supports multi-turn training through ENVGRPO and ENVGSPO, the environment-based variants of GRPO and GSPO respectively. Both use the same Environment and EnvironmentFactory abstractions described in this cookbook. Multi-turn training is useful when your task involves:

Back-and-forth dialogue where context accumulates over turns
Dynamic tool calling based on conversation state
Evaluation that depends on how the full conversation unfolds

The Environment

The central abstraction in multi-turn training is the Environment. It orchestrates each turn of the conversation: processes the agent’s output, executes tool calls if any, runs graders, generates follow-up user messages, and decides whether to continue or terminate. Everything — tools, grading, user simulation, termination — lives inside the environment. A user simulator (an LLM that generates follow-up messages on behalf of the user) is one common pattern, but it’s not required. The environment’s react_to() method has full control over what happens next, and how the next user message is produced is entirely up to the implementation.

Tools

Tools are functions the agent can call to query information or take actions. Each tool receives arguments from the agent’s output, performs some operation, and returns a string result that gets appended to the conversation as a tool turn. Tool implementations are entirely up to you — they could query a database, call an API, search documents, or perform calculations.

Subclassing Environment

The environment subclasses Environment from adaptive_harmony.environment and overrides react_to(). This method is called after every policy model completion. It receives the full conversation thread and returns either:

(next_thread, None) — to continue the conversation (optionally with per-turn grades)
(None, grades) — to terminate and return final grades

Everything happens inside react_to(): response parsing, tool execution, grading, generating the next user message (if any), and termination.

from adaptive_harmony import StringThread
from adaptive_harmony.environment import Environment, EnvironmentFactory
from adaptive_harmony.graders import BaseGrader, Grade


class MyEnvironment(Environment):

    def __init__(self, logging_function, graders, tools, max_turns=10):
        super().__init__(logging_function)
        self.graders = graders
        self.tools = tools
        self.max_turns = max_turns
        self.num_turns = 0

    async def react_to(self, thread: StringThread) -> tuple[StringThread | None, list[Grade] | None]:
        """Process the agent's latest response.

        Returns:
            (next_thread, None) to continue the conversation
            (None, grades)      to terminate with final grades
        """
        self.num_turns += 1

        # 1. Parse the agent's response (extract text, tool calls, etc.)
        #    Implementation depends on your output format — e.g., parsing JSON
        #    with tool_name/arguments fields, or extracting XML tags, etc.
        response = self._parse(thread.last_content())

        # 2. Execute tool calls if present, append results to thread.
        #    How tools are dispatched and what they return is use-case specific —
        #    e.g., calling a search API, querying a database, running a calculator.
        if response.has_tool_call:
            thread = await self._execute_tool(thread, response.tool_call)
            return (thread, None)  # Continue — let the model see the tool result

        # 3. Grade the response
        grades = [await g.grade(thread) for g in self.graders]

        # 4. Check termination.
        #    Define your own stopping criteria — e.g., the agent said "DONE",
        #    the user's goal was achieved, or a maximum turn count was reached.
        if self.num_turns >= self.max_turns or self._should_end(thread):
            return (None, grades)  # Terminate

        # 5. Generate the next user message and continue.
        #    How this is done is entirely use-case dependent — it could be
        #    an LLM-based user simulator, a scripted reply, a lookup from
        #    real conversation logs, etc.
        next_message = ...
        return (thread.user(next_message), None)

Note: _parse(), _execute_tool(), and _should_end() are not part of the Harmony API — they represent logic you implement yourself based on your use case. The only method the framework requires is react_to().

Example: grading strategies

There is no single rule for how to structure grading in multi-turn training. Here are two common approaches: Per-turn grading — evaluate each agent response individually. For example, a grader might check whether the agent used the right tool, or whether the response was relevant to the user’s latest message. These run on every turn and provide immediate signal. Trajectory-level grading — evaluate the conversation as a whole after it ends. For example, checking whether the user’s goal was ultimately achieved, or whether the conversation was concise. These run once and capture holistic quality. You can combine both. For instance, a per-turn BinaryJudgeGrader checking tool correctness alongside a trajectory-level RangeJudgeGrader scoring overall helpfulness, aggregated with a CombinedGrader.

SFT warm-up

When the agent needs to produce a specific output format (e.g., structured responses with tool calls), an SFT phase on example conversations can bootstrap format compliance before RL refines quality. Without this, the environment may not be able to parse responses and training stalls.

EnvironmentFactory and ENVGRPO recipe

ENVGRPO creates a fresh environment for each training trajectory via an EnvironmentFactory. The factory subclasses EnvironmentFactory and implements create_environment(metadata), which returns a new environment instance. This ensures state (turn counts, tool call history) is not shared across parallel rollouts. A complete recipe spawns models, builds graders, creates the factory, and calls ENVGRPO(...).run(). If the environment uses a user simulator or other inference models internally, the factory is responsible for wiring them in.

To use ENVGSPO instead of ENVGRPO, replace from adaptive_harmony.common import ENVGRPO with from adaptive_harmony.common import ENVGSPO and swap the class in the trainer call. The Environment and EnvironmentFactory interfaces are identical for both algorithms.

from adaptive_harmony.common import ENVGRPO
from adaptive_harmony.environment.environment import EnvironmentFactory
from adaptive_harmony.runtime import InputConfig, recipe_main
from adaptive_harmony.runtime.context import RecipeContext
from adaptive_harmony.parameters import Model, Dataset, dataset_kinds
from adaptive_harmony.graders import BinaryJudgeGrader
from adaptive_harmony.metric_logger import get_prod_logger


class MyFactory(EnvironmentFactory):
    """Creates a fresh environment per trajectory."""

    def __init__(self, graders, tools, **kwargs):
        super().__init__()
        self.graders = graders
        self.tools = tools
        self.kwargs = kwargs

    def create_environment(self, metadata):
        return MyEnvironment(
            logging_function=self.add_log,
            graders=self.graders,
            tools=self.tools,
            **self.kwargs,
        )


class AgentConfig(InputConfig):
    train_dataset: Dataset[dataset_kinds.Prompt]
    policy_model: Model
    judge_model: Model


@recipe_main
async def main(config: AgentConfig, ctx: RecipeContext):
    # Spawn models
    policy_builder = await config.policy_model.to_builder(ctx, tp=2)
    policy = await policy_builder.spawn_train("policy")

    judge_builder = await config.judge_model.to_builder(ctx, tp=1)
    judge_model = await judge_builder.spawn_inference("judge")

    # Build graders
    graders = [
        BinaryJudgeGrader(
            grader_key="helpfulness",
            model=judge_model,
            criteria="The agent's response is helpful and addresses the user's request.",
        ),
    ]

    # Create factory and run ENVGRPO
    dataset = await config.train_dataset.load(ctx)
    factory = MyFactory(graders=graders, tools=[...])
    logger = get_prod_logger(ctx)

    await ENVGRPO(
        dataset=dataset,
        model=policy,
        environment_factory=factory,
        logger=logger,
        lr=1e-6,
        max_num_grpo_steps=80,
        completions_per_sample=8,
        samples_per_batch=256,
    ).run()

    await policy.save("trained-agent", ctx=ctx)

Key takeaways

Multi-turn training captures conversational dynamics — The training signal comes from entire conversations, not isolated completions
Subclass Environment and override react_to() — This is where you wire together response parsing, tool execution, grading, and termination. It receives a StringThread and returns either (next_thread, None) to continue or (None, grades) to terminate
EnvironmentFactory creates fresh environments per trajectory — ENVGRPO/ENVGSPO run many rollouts in parallel; each needs its own state
User simulation is one common pattern — An LLM generating follow-up messages on behalf of the user removes the need for real user data during training, but how the next user message is produced is entirely up to the environment implementation
Grading strategies are flexible — You can grade per-turn, per-trajectory, or both. There’s no single right approach; it depends on what behaviors you want to reinforce
ENVGRPO and ENVGSPO — Both algorithms support environment-based multi-turn training with the same Environment and EnvironmentFactory interfaces

Getting Started

Building Blocks

Training & Evaluation

Cookbook

Updates

Why multi-turn training?

The Environment

Tools

Subclassing Environment

Example: grading strategies

SFT warm-up

EnvironmentFactory and ENVGRPO recipe

Key takeaways

Getting Started

Building Blocks

Training & Evaluation

Cookbook

Updates

Documentation Index

​Why multi-turn training?

​The Environment

​Tools

​Subclassing Environment

​Example: grading strategies

​SFT warm-up

​EnvironmentFactory and ENVGRPO recipe

​Key takeaways

Why multi-turn training?

The Environment

Tools

Subclassing Environment

Example: grading strategies

SFT warm-up

EnvironmentFactory and ENVGRPO recipe

Key takeaways