Loading a model
To load a model in a custom recipe, you should use the harmony client and provide the model key after the prefixmodel_registry://
to the asynchronous model
method. This method returns a ModelBuilder
object, on which you can call several methods to configure how your model will be spawned.
.tp()
to set the tensor parallelism (TP) of the model (explain in a next section in this page).with_adapter()
to enable lightweight adapter training instead of full parameter fine-tuning (only use this if you are training the model).into_scoring_model()
to convert the model into a scoring model (only use this if you are training the model to predict a scalar value, e.g. when training a value model for PPO or a reward model)
Spawn methods
ModelBuilder also exposesspawn_train
and spawn_inference
methods.
Adaptive Engine unifies training and inference. Instead of requiring different frameworks/runtimes for training and inference, you can simply spawn models meant for training or inference with spawn_train
and spawn_inference
. A model spawned with spawn_train
will require more GPU memory upfront, since Adaptive makes sure that enough memory is available at spawn time to fit the required max_batch_size
during training (model activations, optimizer state, etc…). If you are not training a given model in your recipe (you are spawning a judge model for example), make sure to always spawn it with spawn_inference
to reduce GPU memory pressure.
The max_batch_size
parameter defines the maximum number of tokens that can be allocated in a single training batch - i.e a mini batch that is processed by the model in parallel, the batch size corresponding to each optimization step is user-defined and independent from this parameter. It also limits the maximum sequence length that the model is able to train on. In the worst case scenario, for a dataset of samples with length =~ to max_batch_size
, the model will train on a single sample at a time. Any sequences larger than max_batch_size
are simply dropped in the training classes, which also reconcile the desired optimization step batch size in # of samples.
spawn_train
returns a TrainingModel
, and spawn_inference
returns an InferenceModel
; they are both async methods.
Tensor parallelism (tp)
Tensor parallelism (tp
) determines how many GPUs a model is split across during execution.
Choosing the right tp
value depends on your model size and available hardware: larger models typically require higher tp
to fit into memory, while smaller models may run efficiently with tp=1
. Also, as explained above, a given model that fits on 2 GPUs with tp=2
that was spawned with inference_spawn
might not fit if it is spawned with spawn_train
.
Always ensure that the number you set for tp
matches the number of devices you want to use and is supported by your infrastructure.
Passing model parameters as config input
In custom recipes, you can pass models in the recipe config using the magic classAdaptiveModel
. Adaptive will validate that the user-configured parameter for model_to_train
below is a valid model key in Adaptive Engine.
You can access the path to deploy the model with AdaptiveModel().path