Skip to main content
Writing a training recipe requires three main components:
  1. A model to train - the base model you want to fine-tune
  2. A training dataset - data to train the model on
  3. Graders - for RL methods, the functions that will provide a reward to the model

Training recipe structure

1. Create an input configuration

First, you must create an input configuration that will hold all the parameters you want to allow to be passed to your recipe.
from typing import Annotated, List, Set
from pydantic import Field
from adaptive_harmony.runtime import InputConfig
from adaptive_harmony.parameters import Model, Dataset, Grader, model_kinds

class TrainingConfig(InputConfig):
    model_to_train: Annotated[
        Model[model_kinds.Trainable],  # Restrict to trainable models only
        Field(
            description="Base model to fine-tune",
            title="Model to Train",
        ),
    ]
    output_model_key: Annotated[
        str,
        Field(
            description="New key for the resulting fine-tuned model",
            title="Output Model Key",
        ),
    ]
    dataset: Annotated[
        Dataset,
        Field(
            description="Dataset to be used for training.",
            title="Training Dataset",
        ),
    ]
    graders: Annotated[
        Set[Grader],
        Field(
            description="The graders, or reward functions, to train with. If several graders are provided, the student model will see an aggregate of their grades as an average. The student will learn to maximize that grade.",
            title="Graders",
        ),
    ]



1. Load a model to train

To train a model, you need to use the spawn_train method (see details here). This method creates a trainable instance of your model. We set the world_size (the number of GPU’s that the job is running on) as the tp degree of the model by grabbing it from the recipe context ctx (beware, you cannot TP across node boundaries, so this only works if your world size < # GPUs available in each node).
from adaptive_harmony.runtime import recipe_main, RecipeContext

@recipe_main
async def training_recipe(config: TrainingConfig, ctx: RecipeContext):
    # Spawn the model for training
    model_builder = await config.model_to_train.to_builder(ctx, tp=ctx.world_size)
    model = await model_builder.spawn_train("main", 4096)

2. Load a dataset

You can load datasets from Adaptive or external sources:
# Load from Adaptive
dataset = await config.dataset.load(ctx)

3. Set up Graders

For reinforcement learning training, you’ll need reward functions. Below, we show the very common case where you have a reward strategy with several components and you want to have a final reward by aggregating all your grader score. You can use the CombinedGrader class to do this aggregation.
from adaptive_harmony.graders.combined_grader import CombinedGrader

graders = [await grader.load(ctx) for grader in config.graders]
combined_grader = CombinedGrader("combined", graders)
You do not need to rely solely on Graders you import from Adaptive to train on; you can write your own custom grader within your recipe, and pass it to CombinedGrader along with other graders you’ve created and registered in the platform.

4. Call your training class

Everything is now setup to call and run your training class. The GRPO class will run the main training loop.
from adaptive_harmony.common.grpo import GRPO
from adaptive_harmony.metric_logger import get_prod_logger

logger = get_prod_logger()

# Run training
await GRPO(
    dataset=train_dataset,
    model=policy_model,
    grader=grader,
    logger=logger
).run()
See Log job metrics for more information about tracking training metrics.

5. Save Your trained model

After training, save your model using the model.save() method.
# Save with a custom name
await model.save(config.output_model_key)

Training classes

Harmony provides five main training algorithms, each suited for different use cases:

SFT (Supervised Fine-Tuning)

SFT is the foundation training method that teaches a model to follow instructions using demonstration data.
from adaptive_harmony.common.sft import SFT
from adaptive_harmony.metric_logger import get_prod_logger

logger = get_prod_logger()

# Run SFT training
await SFT(
    dataset=dataset,
    model=model,
    logger=logger,
    lr=1e-5,
    samples_per_batch=512,
    max_grad_norm=1.0,
    epochs=1
).run()
Key Parameters:
  • dataset: List of StringThread objects containing training examples
  • model: Training model instance
  • logger: Metric logger for tracking training metrics. See Log job metrics for available loggers. (default: StdoutLogger())
  • stage_notifier: Progress notifier (default: JobNotifier().stage_notifier("SFT Training"))
  • callbacks: List of training callbacks (default: [])
  • lr: Learning rate (default: 1e-5)
  • lr_scheduler: Optional learning rate scheduler function
  • samples_per_batch: Number of samples per training batch (default: 512)
  • max_grad_norm: Gradient clipping norm (default: 1.0)
  • epochs: Number of training epochs (default: 1)
  • weight_decay: Weight decay coefficient (default: 0)
  • skip_nan_gradients: Skip batches with NaN gradients (default: False)
  • restart_from_checkpoint: Path to checkpoint to resume from
  • checkpoint_frequency: How often to checkpoint (default: 0.2)

PPO (Proximal Policy Optimization)

PPO is a reinforcement learning algorithm that uses a reward function to improve model behavior, and estimates token-level advantages during training by use of a separate value model.
from adaptive_harmony.common.ppo import PPO
from adaptive_harmony.metric_logger import get_prod_logger

logger = get_prod_logger()

# Run PPO training
await PPO(
    dataset=dataset,
    model=model,
    value_model=value_model,
    grader=grader,
    logger=logger,
    lr_policy=0.75e-6,
    lr_value=1e-6,
    samples_per_batch=128,
    samples_per_mini_batch=128,
    max_grad_norm=1.0,
    clip_range=0.1,
    kl_beta=0.01,
    max_num_ppo_steps=100
).run()
Key Parameters:
  • dataset: List of StringThread prompts
  • model: Policy model for training
  • value_model: Value model for advantage estimation (must be a scalar model)
  • grader: Grader that returns reward values
  • logger: Metric logger (default: StdoutLogger())
  • stage_notifier: Progress notifier (default: JobNotifier().stage_notifier("PPO Training"))
  • callbacks: List of training callbacks (default: [])
  • max_num_ppo_steps: Total number of PPO training steps (default: None for unlimited)
  • value_only_fraction: Fraction of training for value-only updates (default: 0.25)
  • lr_policy: Policy learning rate (default: 0.75e-6)
  • lr_scheduler_policy: Optional policy learning rate scheduler
  • lr_scheduler_value: Optional value learning rate scheduler
  • lr_value: Value learning rate (default: 1e-6)
  • samples_per_batch: Number of samples per PPO step (default: 128)
  • samples_per_mini_batch: Mini-batch size for gradient updates (default: 128)
  • mini_epochs_per_batch: Number of epochs per batch (default: 1)
  • max_grad_norm: Gradient clipping norm (default: 1.0)
  • clip_range: PPO clipping range (default: 0.1)
  • kl_beta: KL divergence penalty coefficient (default: 0.1)
  • gae_lambda: GAE lambda parameter for advantage estimation (default: 0.95)
  • gae_gamma: GAE gamma discount factor (default: 1.0)
  • weight_decay: Weight decay coefficient (default: 0)
  • skip_nan_gradients: Skip batches with NaN gradients (default: False)
  • restart_from_checkpoint: Path to checkpoint to resume from
  • checkpoint_frequency: How often to checkpoint (default: 0.2)

GRPO (Group Relative Policy Optimization)

GRPO generates a group of several completions per prompt and uses relative ranking for training. The advantage for each completion is calculated as its individual reward minus the group’s mean reward. It is important the samples_per_batch is larger than samples_per_mini_batch, so that there is sample diversity for each gradient update.
from adaptive_harmony.common.grpo import GRPO
from adaptive_harmony.metric_logger import get_prod_logger

logger = get_prod_logger()

# Run GRPO training
await GRPO(
    dataset=dataset,
    model=model,
    grader=grader,
    logger=logger,
    lr=7.5e-7,
    samples_per_batch=128,
    samples_per_mini_batch=128,
    max_grad_norm=1.0,
    clip_range=0.1,
    kl_beta=0.01,
    max_num_grpo_steps=100,
    completions_per_sample=8
).run()
Key Parameters:
  • dataset: List of StringThread prompts
  • model: Training model
  • grader: Grader that returns reward values
  • logger: Metric logger (default: StdoutLogger())
  • stage_notifier: Progress notifier (default: JobNotifier().stage_notifier("GRPO Training"))
  • callbacks: List of training callbacks (default: [])
  • max_num_grpo_steps: Total number of GRPO training steps (default: None for unlimited)
  • completions_per_sample: Number of completions generated per prompt (default: 8)
  • lr: Learning rate (default: 7.5e-7)
  • lr_scheduler: Optional learning rate scheduler function
  • samples_per_batch: Number of samples per batch (default: 128)
  • samples_per_mini_batch: Mini-batch size for gradient updates (default: 128)
  • mini_epochs_per_batch: Number of epochs per batch (default: 1)
  • max_grad_norm: Gradient clipping norm (default: 1.0)
  • clip_range: Clipping range for policy updates (default: 0.1)
  • kl_beta: KL divergence penalty coefficient (default: 0.01)
  • weight_decay: Weight decay coefficient (default: 0.0)
  • skip_nan_gradients: Skip batches with NaN gradients (default: False)
  • restart_from_checkpoint: Path to checkpoint to resume from
  • checkpoint_frequency: How often to checkpoint (default: 0.2)

GSPO (Group Sequence Policy Optimization)

GSPO is a variant of GRPO that computes advantages at the sequence level rather than per-token, making it more aligned with the goal of most training runs (where token-level advantages are not meaningful at the task level) and generally more effective. Like GRPO, it generates multiple completions per prompt for relative ranking, but optimizes the entire sequence as a unit. GSPO is recommended over GRPO for most use cases.
from adaptive_harmony.common.gspo import GSPO
from adaptive_harmony.metric_logger import get_prod_logger

logger = get_prod_logger()

# Run GSPO training
await GSPO(
    dataset=dataset,
    model=model,
    grader=grader,
    logger=logger,
    lr=7.5e-7,
    samples_per_batch=128,
    samples_per_mini_batch=128,
    max_grad_norm=1.0,
    clip_range=0.01,
    kl_beta=0.01,
    max_num_gspo_steps=100,
    completions_per_sample=8
).run()
Key Parameters:
  • dataset: List of StringThread prompts
  • model: Training model
  • grader: Grader that returns reward values
  • logger: Metric logger (default: StdoutLogger())
  • stage_notifier: Progress notifier (default: JobNotifier().stage_notifier("GSPO Training"))
  • callbacks: List of training callbacks (default: [])
  • max_num_gspo_steps: Total number of GSPO training steps (default: None for unlimited)
  • completions_per_sample: Number of completions per prompt (default: 8, typically higher than GRPO)
  • lr: Learning rate (default: 7.5e-7)
  • lr_scheduler: Optional learning rate scheduler function
  • samples_per_batch: Number of samples per batch (default: 128)
  • samples_per_mini_batch: Mini-batch size for gradient updates (default: 128)
  • mini_epochs_per_batch: Number of epochs per batch (default: 1)
  • max_grad_norm: Gradient clipping norm (default: 1.0)
  • clip_range: Clipping range (default: 0.01, lower than GRPO’s 0.1)
  • kl_beta: KL divergence penalty coefficient (default: 0.01)
  • weight_decay: Weight decay coefficient (default: 0.0)
  • skip_nan_gradients: Skip batches with NaN gradients (default: False)
  • restart_from_checkpoint: Path to checkpoint to resume from
  • checkpoint_frequency: How often to checkpoint (default: 0.2)

DPO (Direct Preference Optimization)

DPO trains models using preference data (preferred vs non-preferred responses) without explicit reward modeling.
from adaptive_harmony.common.dpo import DPO
from adaptive_harmony.metric_logger import get_prod_logger

logger = get_prod_logger()

# Run DPO training
await DPO(
    dataset=dataset,
    model=model,
    logger=logger,
    lr=1e-6,
    samples_per_batch=32,
    max_grad_norm=1.0,
    kl_beta=0.1,
    epochs=1
).run()
Key Parameters:
  • dataset: List of tuples containing (preferred_response, dispreferred_response) as StringThread pairs
  • model: Training model
  • logger: Metric logger (default: StdoutLogger())
  • stage_notifier: Progress notifier (default: JobNotifier().stage_notifier("DPO Training"))
  • callbacks: List of training callbacks (default: [])
  • lr: Learning rate (default: 1e-6)
  • lr_scheduler: Optional learning rate scheduler function
  • samples_per_batch: Number of samples per batch (default: 32)
  • max_grad_norm: Gradient clipping norm (default: 1.0)
  • kl_beta: DPO beta parameter controlling KL divergence from reference model (default: 0.1)
  • epochs: Number of training epochs (default: 1)
  • skip_nan_gradients: Skip batches with NaN gradients (default: False)

Checkpointing

All training classes support automatic checkpointing to save your model periodically during training. This is useful for long-running jobs where you want to recover from failures or inspect intermediate model states.

Using checkpoint_frequency

Set the checkpoint_frequency parameter to control how often checkpoints are saved. The frequency is measured as a fraction of training completion (0.0 to 1.0).
from adaptive_harmony.common.gspo import GSPO

await GSPO(
    dataset=train_dataset,
    model=policy_model,
    grader=safety_grader,
    logger=logger,
    checkpoint_frequency=0.2  # Save every 20% of training
).run()
How it works:
  • Checkpoints are saved at regular intervals based on training progress
  • Default frequency is 0.2 (every 20% of training)
  • Models are saved with names like {model_name}-{progress}, e.g., policy-0.2, policy-0.4, policy-0.6
  • The final trained model must still be saved explicitly using model.save() after training completes

Best practices

  • Set appropriate frequency: For long jobs, use checkpoint_frequency=0.1 or 0.2. For shorter jobs, you can use higher values like 0.25 or 0.5.
  • Storage considerations: Checkpoints can be large for big models. Make sure you have sufficient storage for multiple checkpoints.
  • Cleanup: Remove intermediate checkpoints after training completes if you only need the final model.