- A model to train - the base model you want to fine-tune
- A training dataset - data to train the model on
- Graders - for RL methods, the functions that will provide a reward to the model
Training recipe structure
1. Create an input configuration
First, you must create an input configuration that will hold all the parameters you want to allow to be passed to your recipe.1. Load a model to train
To train a model, you need to use thespawn_train
method (see details here). This method creates a trainable instance of your model. We set the world_size (the number of GPU’s that the job is running on) as the tp
degree of the model by grabbing it from the recipe context ctx
.
2. Load a dataset
You can load datasets from Adaptive or external sources:3. Set up Graders
For reinforcement learning training, you’ll need reward functions. Below, we show the very common case where you have a reward strategy with several components and you want to have a final reward by aggregating all your grader score. You can use theCombinedGrader
class to do this aggregation.
setup
method should always be called on a grader before it is used, it makes sure to spawn your AI judges if you have AI judges graders (this is why you also pass the client
to graders).
If you have a CombinedGrader
class, calling setup
on it on it will make sure to call the setup
method on every child grader.
You do not need to rely solely on Graders you import from Adaptive to train on; you can write your own custom grader within your recipe, and pass it to
CombinedGrader
along with other graders you’ve created and registered in the platform.4. Call your training class
Everything is now setup to call and run your training class. TheGRPO
class will run the main training loop.
5. Save Your trained model
After training, save your model using themodel.save()
method. If you want to guarantee there are no model key collisions (in case you’ve saved a model with the same key in the past), you can use the save_model_safely
method instead.
Training classes
Harmony provides four main training algorithms, each suited for different use cases:SFT (Supervised Fine-Tuning)
SFT is the foundation training method that teaches a model to follow instructions using demonstration data.dataset
: List ofStringThread
objects containing training examplesmodel
: Training model instancelr
: Learning ratesamples_per_batch
: Batch sizemax_grad_norm
: Gradient clipping norm
PPO (Proximal Policy Optimization)
PPO is a reinforcement learning algorithm that uses a reward function to improve model behavior, and estimates token-level advantages during training by use of a separate value model.dataset
: List ofStringThread
promptspolicy_model
: Policy model for trainingvalue_model
: Value model for advantage estimationgrader
: Grader that returns reward valueslr_policy
: Policy learning ratelr_value
: Value learning ratekl_beta
: KL divergence penalty coefficientsamples_per_batch
: Number of samples that compose a single PPO stepmax_num_ppo_steps
: Total number of PPO training steps to take
GRPO (Group Relative Policy Optimization)
GRPO generates a group of several completions per prompt and uses relative ranking for training. The advantage for each completion is calculated as its individual reward minus the group’s mean reward. It is important thesamples_per_batch
is larger than samples_per_mini_batch
, so that there is sample diversity for each gradient update.
dataset
: List ofStringThread
promptsmodel
: Training modelgrader
: Grader that returns reward valuescompletions_per_sample
: Number of completions per promptlr
: Learning ratekl_beta
: KL divergence penalty coefficient
DPO (Direct Preference Optimization)
DPO trains models using preference data (preferred vs non-preferred responses) without explicit reward modeling.dataset
: List of tuples containing (preferred_response, dispreferred_response)model
: Training modellr
: Learning ratesamples_per_batch
: Batch sizebeta
: DPO beta parameter