adaptive_harmony.common.
In this page, we assume reasonable familiarity with reinforcement learning with LLMs.
Training primitives
TrainingModel
This is the main class through which you interact with model weights. Load models explains in detail how to a model, but to recap:
tp parameter (Tensor Parallelism) to be the smallest possible while still fitting in your GPUs.
Choose the max_seq_len parameter to be larger than your expected rollout size in tokens, while still fitting in your GPUs.
Generating text
Once aTrainingModel has been loaded, you can use it straightforwardly to make async inference calls. The input and outputs are StringThread objects, corresponding to chats. Details about creation methods are given in the thread page.
Generating tokens
When training with RL, you need the raw tokens to be able to compute log-probabilities and losses. Harmony provides tokenized versions of threads calledadaptive_harmony.TokenizedThread, as well as utils for tokenization and detokenization:
Logprobs
Get model logprobs like so:Training: backward
For training we have implemented losses of the most common algorithms (cross-entropy for SFT, PPO, GRPO, DPO…) inside Harmony. You can do a backward step for each of these losses by applying the corresponding trainer function on theTrainingModel, for instance:
train_ppotrain_grpotrain_gspotrain_dpotrain_tangents
harmony_client.harmony_client for definition of the parameters for these methods.
Note that these methods only compute gradient contribution for the given samples, they do not change model parameters (this is performed by calling optim_step).
The train_tangents method is special in that it allows you to pass arbitrary losses, allowing you to implement any RL algorithm not covered in our libraries. See the Bring your own loss functions section at the end of this file for instructions on how to use it.
Training: optim step
Once you have computed the backward pass for all samples in your batch, you need to perform a gradient descent step. We use the AdamW optimizer under the hood.Putting it all together: simplified GRPO walkthrough
We will see how all these building blocks interact by implementing GRPO, one of the most common RL methods for LLMs. Please refer to the original paper for an in-depth explanation of the algorithm. For simplicity, we only illustrate the single-turn variant of GRPO here. Please take a look atadaptive_harmony.common.env_grpo for our multi-turn variant.
Init: defining useful variables
We will need to use the following objects in theGRPO class:
Generate data
The first phase in the algorithm is to generateself.completions_per_sample completions from a single prompt.
We use a helper dataclass for sample collection:
Run the algorithm
Once the data generation part is written, the entire algorithm is quite straightforward.Bring your own loss functions
We also offer a way to train with arbitrary loss functions withtrain_tangents, by specifying gradient values (tangents) at each token. There are two options:
- If you have a closed-form solution of the gradient of your loss with respect to the model logprobs, you can simply compute this value with your numerical library of choice and pass it to
train_tangents. - Otherwise, you can also use autodiff software to compute the gradient of your loss function with respect to the logprobs.
loss_fn written in torch.
You would do the following, for a given sample (TokenizedThread):

