Skip to main content

Loading a model

To load a model in a custom recipe, you should use the harmony client and provide the model key after the prefix model_registry:// to the asynchronous model method. This method returns a ModelBuilder object, on which you can call several methods to configure how your model will be spawned.
  • .tp() to set the tensor parallelism (TP) of the model (explain in a next section in this page)
  • .with_adapter() to enable lightweight adapter training instead of full parameter fine-tuning (only use this if you are training the model)
  • .into_scoring_model() to convert the model into a scoring model (only use this if you are training the model to predict a scalar value, e.g. when training a value model for PPO or a reward model)
from adaptive_harmony.runtime import recipe_main, RecipeContext

@recipe_main
async def main(ctx: RecipeContext):
    client = ctx.client
    
    # Full parameter Llama 3.1 8B
    policy = client.model(
        path="model_registry://llama-3.1-8b-instruct",
        kv_cache_len=100_000,
        tokens_to_generate=2048,
    ).tp(1)

    # Adapter based Qwen 3 32B, turned into a scoring model (for use as reward model)
    reward_model = client.model(
        path="model_registry://qwen3-32b",
        kv_cache_len=100_000,
        tokens_to_generate=2048,
    ).tp(1).with_adapter().into_scoring_model()

Spawn methods

ModelBuilder also exposes spawn_train and spawn_inference methods. Adaptive Engine unifies training and inference. Instead of requiring different frameworks/runtimes for training and inference, you can simply spawn models meant for training or inference with spawn_train and spawn_inference. A model spawned with spawn_train will require more GPU memory upfront, since Adaptive makes sure that enough memory is available at spawn time to fit the required max_batch_size during training (model activations, optimizer state, etc…). If you are not training a given model in your recipe (you are spawning a judge model for example), make sure to always spawn it with spawn_inference to reduce GPU memory pressure. The max_batch_size parameter defines the maximum number of tokens that can be allocated in a single training batch - i.e a mini batch that is processed by the model in parallel, the batch size corresponding to each optimization step is user-defined and independent from this parameter. It also limits the maximum sequence length that the model is able to train on. In the worst case scenario, for a dataset of samples with length =~ to max_batch_size, the model will train on a single sample at a time. Any sequences larger than max_batch_size are simply dropped in the training classes, which also reconcile the desired optimization step batch size in # of samples. spawn_train returns a TrainingModel, and spawn_inference returns an InferenceModel; they are both async methods.
from adaptive_harmony.runtime import recipe_main, RecipeContext

@recipe_main
async def main(ctx: RecipeContext):
    client = ctx.client
    

    train_model = (
        await client.model(
            path="model_registry://llama-3.1-8b-instruct",
            kv_cache_len=100_000,
            tokens_to_generate=2048,
        )
        .tp(1)
        .spawn_train(name="train_model", max_batch_size=4096)
    )

    inference_model = (
        await client.model(
            path="model_registry://llama-3.1-8b-instruct",
            kv_cache_len=100_000,
            tokens_to_generate=2048,
        )
        .tp(1)
        .spawn_inference(name="inf_model")
    )

Tensor parallelism (tp)

Tensor parallelism (tp) determines how many GPUs a model is split across during execution. Choosing the right tp value depends on your model size and available hardware: larger models typically require higher tp to fit into memory, while smaller models may run efficiently with tp=1. Also, as explained above, a given model that fits on 2 GPUs with tp=2 that was spawned with inference_spawn might not fit if it is spawned with spawn_train. Always ensure that the number you set for tp matches the number of devices you want to use and is supported by your infrastructure.

Vision-language models

VLMs spawn an image encoder alongside the text model. One encoder is created per backbone and shared across all adapters on that backbone.

Passing model parameters as config input

In custom recipes, you can pass models in the recipe config using the magic class AdaptiveModel. Adaptive will validate that the user-configured parameter for model_to_train below is a valid model key in Adaptive Engine. You can access the path to deploy the model with AdaptiveModel().path.

Using default deployment parameters

When you spawn a model with client.model(), the model is configured with the default deployment parameters as set in the Adaptive platform (KV cache length, tensor parallelism, tokens to generate). You can change these parameters globally for a model by visiting its model details page (click the model in the organizational model registry page), and editing the “Inference Configuration” setting on the right-hand menu. Often, inference defaults won’t make sense for the more memory-intensive training regime, so you can override any default by passing parameters directly: Overridable parameters:
  • kv_cache_len - KV cache length for the model
  • tokens_to_generate - Maximum tokens to generate per completion
  • .tp() - Tensor parallelism (number of GPUs to split the model across)
from typing import Annotated
from adaptive_harmony.runtime import InputConfig, AdaptiveModel, recipe_main, RecipeContext

class MyConfig(InputConfig):
    model_to_train: Annotated[AdaptiveModel]
    tp: Annotated[int]
    max_seq_len: Annotated[int]
    train_adapter: Annotated[bool]

@recipe_main
async def main(config: MyConfig, ctx: RecipeContext):
    client = ctx.client

    model_builder = client.model(
        config.model_to_train.path,
        kv_cache_len=100_000,              # Override KV cache length
        tokens_to_generate=2048,           # Override max tokens to generate
    ).tp(config.tp)                        # Override tensor parallelism

    if config.train_adapter:
        model_builder = model_builder.with_adapter()

    await model_builder.spawn_train("model_trained", config.max_seq_len)