- A model to train - the base model you want to fine-tune
- A training dataset - data to train the model on
- Graders - for RL methods, the functions that will provide a reward to the model
Training recipe structure
1. Create an input configuration
First, you must create an input configuration that will hold all the parameters you want to allow to be passed to your recipe.1. Load a model to train
To train a model, you need to use thespawn_train method (see details here). This method creates a trainable instance of your model.
We set the world_size (the number of GPU’s that the job is running on) as the tp degree of the model by grabbing it from the recipe context ctx (beware, you cannot TP across node boundaries, so this only works if your world size < # GPUs available in each node).
2. Load a dataset
You can load datasets from Adaptive or external sources:3. Set up Graders
For reinforcement learning training, you’ll need reward functions. Below, we show the very common case where you have a reward strategy with several components and you want to have a final reward by aggregating all your grader score. You can use theCombinedGrader class to do this aggregation.
You do not need to rely solely on Graders you import from Adaptive to train on; you can write your own custom grader within your recipe, and pass it to
CombinedGrader along with other graders you’ve created and registered in the platform.4. Call your training class
Everything is now setup to call and run your training class. TheGRPO class will run the main training loop.
See Log job metrics for more information about tracking training metrics.
5. Save Your trained model
After training, save your model using themodel.save() method.
Training classes
Harmony provides five main training algorithms, each suited for different use cases:SFT (Supervised Fine-Tuning)
SFT is the foundation training method that teaches a model to follow instructions using demonstration data.dataset: List ofStringThreadobjects containing training examplesmodel: Training model instancelogger: Metric logger for tracking training metrics. See Log job metrics for available loggers. (default:StdoutLogger())stage_notifier: Progress notifier (default:JobNotifier().stage_notifier("SFT Training"))callbacks: List of training callbacks (default:[])lr: Learning rate (default:1e-5)lr_scheduler: Optional learning rate scheduler functionsamples_per_batch: Number of samples per training batch (default:512)max_grad_norm: Gradient clipping norm (default:1.0)epochs: Number of training epochs (default:1)weight_decay: Weight decay coefficient (default:0)skip_nan_gradients: Skip batches with NaN gradients (default:False)restart_from_checkpoint: Path to checkpoint to resume fromcheckpoint_frequency: How often to checkpoint (default:0.2)
PPO (Proximal Policy Optimization)
PPO is a reinforcement learning algorithm that uses a reward function to improve model behavior, and estimates token-level advantages during training by use of a separate value model.dataset: List ofStringThreadpromptsmodel: Policy model for trainingvalue_model: Value model for advantage estimation (must be a scalar model)grader: Grader that returns reward valueslogger: Metric logger (default:StdoutLogger())stage_notifier: Progress notifier (default:JobNotifier().stage_notifier("PPO Training"))callbacks: List of training callbacks (default:[])max_num_ppo_steps: Total number of PPO training steps (default:Nonefor unlimited)value_only_fraction: Fraction of training for value-only updates (default:0.25)lr_policy: Policy learning rate (default:0.75e-6)lr_scheduler_policy: Optional policy learning rate schedulerlr_scheduler_value: Optional value learning rate schedulerlr_value: Value learning rate (default:1e-6)samples_per_batch: Number of samples per PPO step (default:128)samples_per_mini_batch: Mini-batch size for gradient updates (default:128)mini_epochs_per_batch: Number of epochs per batch (default:1)max_grad_norm: Gradient clipping norm (default:1.0)clip_range: PPO clipping range (default:0.1)kl_beta: KL divergence penalty coefficient (default:0.1)gae_lambda: GAE lambda parameter for advantage estimation (default:0.95)gae_gamma: GAE gamma discount factor (default:1.0)weight_decay: Weight decay coefficient (default:0)skip_nan_gradients: Skip batches with NaN gradients (default:False)restart_from_checkpoint: Path to checkpoint to resume fromcheckpoint_frequency: How often to checkpoint (default:0.2)
GRPO (Group Relative Policy Optimization)
GRPO generates a group of several completions per prompt and uses relative ranking for training. The advantage for each completion is calculated as its individual reward minus the group’s mean reward. It is important thesamples_per_batch is larger than samples_per_mini_batch, so that there is sample diversity for each gradient update.
dataset: List ofStringThreadpromptsmodel: Training modelgrader: Grader that returns reward valueslogger: Metric logger (default:StdoutLogger())stage_notifier: Progress notifier (default:JobNotifier().stage_notifier("GRPO Training"))callbacks: List of training callbacks (default:[])max_num_grpo_steps: Total number of GRPO training steps (default:Nonefor unlimited)completions_per_sample: Number of completions generated per prompt (default:8)lr: Learning rate (default:7.5e-7)lr_scheduler: Optional learning rate scheduler functionsamples_per_batch: Number of samples per batch (default:128)samples_per_mini_batch: Mini-batch size for gradient updates (default:128)mini_epochs_per_batch: Number of epochs per batch (default:1)max_grad_norm: Gradient clipping norm (default:1.0)clip_range: Clipping range for policy updates (default:0.1)kl_beta: KL divergence penalty coefficient (default:0.01)weight_decay: Weight decay coefficient (default:0.0)skip_nan_gradients: Skip batches with NaN gradients (default:False)restart_from_checkpoint: Path to checkpoint to resume fromcheckpoint_frequency: How often to checkpoint (default:0.2)
GSPO (Group Sequence Policy Optimization)
GSPO is a variant of GRPO that computes advantages at the sequence level rather than per-token, making it more aligned with the goal of most training runs (where token-level advantages are not meaningful at the task level) and generally more effective. Like GRPO, it generates multiple completions per prompt for relative ranking, but optimizes the entire sequence as a unit. GSPO is recommended over GRPO for most use cases.dataset: List ofStringThreadpromptsmodel: Training modelgrader: Grader that returns reward valueslogger: Metric logger (default:StdoutLogger())stage_notifier: Progress notifier (default:JobNotifier().stage_notifier("GSPO Training"))callbacks: List of training callbacks (default:[])max_num_gspo_steps: Total number of GSPO training steps (default:Nonefor unlimited)completions_per_sample: Number of completions per prompt (default:8, typically higher than GRPO)lr: Learning rate (default:7.5e-7)lr_scheduler: Optional learning rate scheduler functionsamples_per_batch: Number of samples per batch (default:128)samples_per_mini_batch: Mini-batch size for gradient updates (default:128)mini_epochs_per_batch: Number of epochs per batch (default:1)max_grad_norm: Gradient clipping norm (default:1.0)clip_range: Clipping range (default:0.01, lower than GRPO’s0.1)kl_beta: KL divergence penalty coefficient (default:0.01)weight_decay: Weight decay coefficient (default:0.0)skip_nan_gradients: Skip batches with NaN gradients (default:False)restart_from_checkpoint: Path to checkpoint to resume fromcheckpoint_frequency: How often to checkpoint (default:0.2)
DPO (Direct Preference Optimization)
DPO trains models using preference data (preferred vs non-preferred responses) without explicit reward modeling.dataset: List of tuples containing(preferred_response, dispreferred_response)asStringThreadpairsmodel: Training modellogger: Metric logger (default:StdoutLogger())stage_notifier: Progress notifier (default:JobNotifier().stage_notifier("DPO Training"))callbacks: List of training callbacks (default:[])lr: Learning rate (default:1e-6)lr_scheduler: Optional learning rate scheduler functionsamples_per_batch: Number of samples per batch (default:32)max_grad_norm: Gradient clipping norm (default:1.0)kl_beta: DPO beta parameter controlling KL divergence from reference model (default:0.1)epochs: Number of training epochs (default:1)skip_nan_gradients: Skip batches with NaN gradients (default:False)
Checkpointing
All training classes support automatic checkpointing to save your model periodically during training. This is useful for long-running jobs where you want to recover from failures or inspect intermediate model states.Using checkpoint_frequency
Set thecheckpoint_frequency parameter to control how often checkpoints are saved. The frequency is measured as a fraction of training completion (0.0 to 1.0).
- Checkpoints are saved at regular intervals based on training progress
- Default frequency is
0.2(every 20% of training) - Models are saved with names like
{model_name}-{progress}, e.g.,policy-0.2,policy-0.4,policy-0.6 - The final trained model must still be saved explicitly using
model.save()after training completes
Best practices
- Set appropriate frequency: For long jobs, use
checkpoint_frequency=0.1or0.2. For shorter jobs, you can use higher values like0.25or0.5. - Storage considerations: Checkpoints can be large for big models. Make sure you have sufficient storage for multiple checkpoints.
- Cleanup: Remove intermediate checkpoints after training completes if you only need the final model.

