Loading models in your recipes
model_registry://
to the asynchronous model
method. This method returns a ModelBuilder
object, on which you can call several methods to configure how your model will be spawned.
.tp()
to set the tensor parallelism (TP) of the model (explain in a next section in this page).with_adapter()
to enable lightweight adapter training instead of full parameter fine-tuning (only use this if you are training the model).into_scoring_model()
to convert the model into a scoring model (only use this if you are training the model to predict a scalar value, e.g. when training a value model for PPO or a reward model)spawn_train
and spawn_inference
methods.
Adaptive Engine unifies training and inference. Instead of requiring different frameworks/runtimes for training and inference, you can simply spawn models meant for training or inference with spawn_train
and spawn_inference
. A model spawned with spawn_train
will require more GPU memory upfront, since Adaptive makes sure that enough memory is available at spawn time to fit the required max_batch_size
during training (model activations, optimizer state, etc…). If you are not training a given model in your recipe (you are spawning a judge model for example), make sure to always spawn it with spawn_inference
to reduce GPU memory pressure.
The max_batch_size
parameter defines the maximum number of tokens that can be allocated in a single training batch - i.e a mini batch that is processed by the model in parallel, the batch size corresponding to each optimization step is user-defined and independent from this parameter. It also limits the maximum sequence length that the model is able to train on. In the worst case scenario, for a dataset of samples with length =~ to max_batch_size
, the model will train on a single sample at a time. Any sequences larger than max_batch_size
are simply dropped in the training classes, which also reconcile the desired optimization step batch size in # of samples.
spawn_train
returns a TrainingModel
, and spawn_inference
returns an InferenceModel
; they are both async methods.
tp
) determines how many GPUs a model is split across during execution.
Choosing the right tp
value depends on your model size and available hardware: larger models typically require higher tp
to fit into memory, while smaller models may run efficiently with tp=1
. Also, as explained above, a given model that fits on 2 GPUs with tp=2
that was spawned with inference_spawn
might not fit if it is spawned with spawn_train
.
Always ensure that the number you set for tp
matches the number of devices you want to use and is supported by your infrastructure.
AdaptiveModel
. Adaptive will validate that the user-configured parameter for model_to_train
below is a valid model key in Adaptive Engine.
You can access the path to deploy the model with AdaptiveModel().path