Writing a training recipe requires three main components:
  1. A model to train - the base model you want to fine-tune
  2. A training dataset - data to train the model on
  3. Graders - for RL methods, the functions that will provide a reward to the model

Training recipe structure

1. Create an input configuration

First, you must create an input configuration that will hold all the parameters you want to allow to be passed to your recipe.
from typing import Annotated, List
from pydantic import Field
from adaptive_harmony.runtime import (
    InputConfig,
    AdaptiveModel,
    AdaptiveDataset,
    AdaptiveGrader
)

class TrainingConfig(InputConfig):
    model_to_train: Annotated[
        AdaptiveModel,
        Field(
            description="Base model to fine-tune",
            title="Model to Train",
            json_schema_extra={"adaptive_model_kind": "trainable"},
        ),
    ]
    output_model_key: Annotated[
        str,
        Field(
            description="New key for the resulting fine-tuned model",
            title="Output Model Key",
        ),
    ]
    dataset: Annotated[
        AdaptiveDataset,
        Field(
            description="Dataset to be used for training.",
            title="Training Dataset",
        ),
    ]
    graders: Annotated[
        Set[AdaptiveGrader],
        Field(
            description="The graders, or reward functions, to train with. If several graders are provided, the student model will see an aggregate of their grades as an average. The student will learn to maximize that grade.",
            title="Graders",
        ),
    ]



1. Load a model to train

To train a model, you need to use the spawn_train method (see details here). This method creates a trainable instance of your model. We set the world_size (the number of GPU’s that the job is running on) as the tp degree of the model by grabbing it from the recipe context ctx.
from adaptive_harmony.runtime import recipe_main, RecipeContext

@recipe_main
async def training_recipe(config: TrainingConfig, ctx: RecipeContext):
    client = ctx.client    
    # Spawn the model for training
    model = await client.model(config.model_to_train.path).tp(ctx.world_size).spawn_train("main", 4096)

2. Load a dataset

You can load datasets from Adaptive or external sources:
from adaptive_harmony.core.dataset import load_adaptive_dataset

# Load from Adaptive
dataset = load_adaptive_dataset(config.dataset.file)

3. Set up Graders

For reinforcement learning training, you’ll need reward functions. Below, we show the very common case where you have a reward strategy with several components and you want to have a final reward by aggregating all your grader score. You can use the CombinedGrader class to do this aggregation.
from adaptive_harmony.graders import Grader, CombinedGrader

graders = [Grader.from_config(grader, client) for grader in config.graders]
combined_grader = CombinedGrader(graders)
combined_grader.setup()
The setup method should always be called on a grader before it is used, it makes sure to spawn your AI judges if you have AI judges graders (this is why you also pass the client to graders). If you have a CombinedGrader class, calling setup on it on it will make sure to call the setup method on every child grader.
You do not need to rely solely on Graders you import from Adaptive to train on; you can write your own custom grader within your recipe, and pass it to CombinedGrader along with other graders you’ve created and registered in the platform.

4. Call your training class

Everything is now setup to call and run your training class. The GRPO class will run the main training loop.
from adaptive_harmony.common.grpo import GRPO
from adaptive_harmony.metric_logger import get_prod_logger

logger = get_prod_logger()

# Run training
await GRPO(
    dataset=train_dataset,
    model=policy_model,
    grader=grader
    logger=logger
).run()

5. Save Your trained model

After training, save your model using the model.save() method. If you want to guarantee there are no model key collisions (in case you’ve saved a model with the same key in the past), you can use the save_model_safely method instead.
from adaptive_harmony.runtime import save_model_safely

# Save with a custom name
await model.save(config.output_model_key)

# Save safely; if a model with same key exists in the model registry already,
# a simple timestamp will be added to the end of your user-defined key
save_model_safely(
    training_model=model,
    model_key=config.output_model_key
)

Training classes

Harmony provides four main training algorithms, each suited for different use cases:

SFT (Supervised Fine-Tuning)

SFT is the foundation training method that teaches a model to follow instructions using demonstration data.
from adaptive_harmony.common.sft import SFT

# Run SFT training
sft = SFT(
    dataset=dataset,
    model=model,
    lr=1e-5,
    samples_per_batch=512,
    max_grad_norm=1.0
)
await sft.run()
Key Parameters:
  • dataset: List of StringThread objects containing training examples
  • model: Training model instance
  • lr: Learning rate
  • samples_per_batch: Batch size
  • max_grad_norm: Gradient clipping norm

PPO (Proximal Policy Optimization)

PPO is a reinforcement learning algorithm that uses a reward function to improve model behavior, and estimates token-level advantages during training by use of a separate value model.
from adaptive_harmony.common.ppo import PPO

# Run PPO training
await PPO(
    dataset=dataset,
    model=policy_model,
    value_model=value_model,
    grader=grader,
    lr_policy=0.75e-6,
    lr_value=1e-6,
    samples_per_batch=128,
    max_grad_norm=1.0,
    clip_range=0.1,
    kl_beta=0.01,
    max_num_ppo_steps=100
).run()
Key Parameters:
  • dataset: List of StringThread prompts
  • policy_model: Policy model for training
  • value_model: Value model for advantage estimation
  • grader: Grader that returns reward values
  • lr_policy: Policy learning rate
  • lr_value: Value learning rate
  • kl_beta: KL divergence penalty coefficient
  • samples_per_batch: Number of samples that compose a single PPO step
  • max_num_ppo_steps: Total number of PPO training steps to take

GRPO (Group Relative Policy Optimization)

GRPO generates a group of several completions per prompt and uses relative ranking for training. The advantage for each completion is calculated as its individual reward minus the group’s mean reward. It is important the samples_per_batch is larger than samples_per_mini_batch, so that there is sample diversity for each gradient update.
from adaptive_harmony.common.grpo import GRPO

# Run GRPO training
await GRPO(
    dataset=dataset,
    model=model,
    grader=grader,
    lr=7.5e-7,
    samples_per_batch=512,
    samples_per_mini_batch=64,
    max_grad_norm=1.0,
    clip_range=0.1,
    kl_beta=0.01,
    max_num_grpo_steps=100,
    completions_per_sample=4
).run()
Key Parameters:
  • dataset: List of StringThread prompts
  • model: Training model
  • grader: Grader that returns reward values
  • completions_per_sample: Number of completions per prompt
  • lr: Learning rate
  • kl_beta: KL divergence penalty coefficient

DPO (Direct Preference Optimization)

DPO trains models using preference data (preferred vs non-preferred responses) without explicit reward modeling.
from adaptive_harmony.common.dpo import DPO

# Run DPO training
await DPO(
    dataset=dataset,
    model=model,
    lr=1e-4,
    samples_per_batch=32,
    max_grad_norm=1.0,
    beta=0.1
).run()
Key Parameters:
  • dataset: List of tuples containing (preferred_response, dispreferred_response)
  • model: Training model
  • lr: Learning rate
  • samples_per_batch: Batch size
  • beta: DPO beta parameter