Load models

Loading a model

To load a model in a custom recipe, use the Model class from adaptive_harmony.parameters. Call await model.to_builder(ctx) to get a ModelBuilder object, on which you can call several methods to configure how your model will be spawned. For example, use builder.with_adapter() to enable lightweight adapter training instead of full parameter fine-tuning (only use this if you are training the model).

from adaptive_harmony.runtime import recipe_main, RecipeContext
from adaptive_harmony.parameters import Model

@recipe_main
async def main(ctx: RecipeContext):
    # Full parameter Llama 3.1 8B
    policy_builder = await Model(model_key="llama-3.1-8b-instruct").to_builder(
        ctx,
        tp=1,
        kv_cache_len=100_000,
        tokens_to_generate=2048,
    )

    # Adapter based Qwen 3 32B
    reward_builder = await Model(model_key="qwen3-32b").to_builder(
        ctx,
        tp=1,
        kv_cache_len=100_000,
        tokens_to_generate=2048,
    )
    reward_builder = reward_builder.with_adapter()

Spawn methods

ModelBuilder also exposes spawn_train and spawn_inference methods. Adaptive Engine unifies training and inference. Instead of requiring different frameworks/runtimes for training and inference, you can simply spawn models meant for training or inference with spawn_train and spawn_inference. A model spawned with spawn_train will require more GPU memory upfront, since Adaptive makes sure that enough memory is available at spawn time to fit the required max_batch_size during training (model activations, optimizer state, etc…). If you are not training a given model in your recipe (you are spawning a judge model for example), make sure to always spawn it with spawn_inference to reduce GPU memory pressure. The max_batch_size parameter defines the maximum number of tokens that can be allocated in a single training batch - i.e a mini batch that is processed by the model in parallel, the batch size corresponding to each optimization step is user-defined and independent from this parameter. It also limits the maximum sequence length that the model is able to train on. In the worst case scenario, for a dataset of samples with length =~ to max_batch_size, the model will train on a single sample at a time. Any sequences larger than max_batch_size are simply dropped in the training classes, which also reconcile the desired optimization step batch size in # of samples. spawn_train returns a TrainingModel, and spawn_inference returns an InferenceModel; they are both async methods.

from adaptive_harmony.runtime import recipe_main, RecipeContext
from adaptive_harmony.parameters import Model

@recipe_main
async def main(ctx: RecipeContext):
    # Spawn for training
    train_builder = await Model(model_key="llama-3.1-8b-instruct").to_builder(
        ctx,
        tp=1,
        kv_cache_len=100_000,
        tokens_to_generate=2048,
    )
    train_model = await train_builder.spawn_train(name="train_model", max_batch_size=4096)

    # Spawn for inference
    inference_builder = await Model(model_key="llama-3.1-8b-instruct").to_builder(
        ctx,
        tp=1,
        kv_cache_len=100_000,
        tokens_to_generate=2048,
    )
    inference_model = await inference_builder.spawn_inference(name="inf_model")

Tensor parallelism (tp)

Tensor parallelism (tp) determines how many GPUs a model is split across during execution. Choosing the right tp value depends on your model size and available hardware: larger models typically require higher tp to fit into memory, while smaller models may run efficiently with tp=1. Also, as explained above, a given model that fits on 2 GPUs with tp=2 that was spawned with inference_spawn might not fit if it is spawned with spawn_train. Always ensure that the number you set for tp matches the number of devices you want to use and is supported by your infrastructure.

Passing model parameters as config input

In custom recipes, you can pass models in the recipe config using the Model class from adaptive_harmony.parameters. Adaptive will validate that the user-configured parameter for model_to_train below is a valid model key in Adaptive Engine.

Using default deployment parameters

When you call await model.to_builder(ctx), the model is configured with the default deployment parameters as set in the Adaptive platform (KV cache length, tensor parallelism, tokens to generate). You can change these parameters globally for a model by visiting its model details page (click the model in the organizational model registry page), and editing the “Inference Configuration” setting on the right-hand menu. Often, inference defaults will not make sense for the more memory-intensive training regime, so any default can be overridden by passing parameters directly to to_builder(): Overridable parameters:

tp - Tensor parallelism (number of GPUs to split the model across)
kv_cache_len - KV cache length for the model
tokens_to_generate - Maximum tokens to generate per completion

from typing import Annotated
from adaptive_harmony.runtime import InputConfig, recipe_main, RecipeContext
from adaptive_harmony.parameters import Model

class MyConfig(InputConfig):
    model_to_train: Annotated[Model]
    tp: Annotated[int]
    max_seq_len: Annotated[int]
    train_adapter: Annotated[bool]

@recipe_main
async def main(config: MyConfig, ctx: RecipeContext):
    # Override default deployment parameters by passing them to to_builder()
    model_builder = await config.model_to_train.to_builder(
        ctx,
        tp=config.tp,                      # Override tensor parallelism
        kv_cache_len=100_000,              # Override KV cache length
        tokens_to_generate=2048            # Override max tokens to generate
    )

    if config.train_adapter:
        model_builder = model_builder.with_adapter()

    await model_builder.spawn_train("model_trained", config.max_seq_len)

Spawning in Jupyter notebooks

When developing interactively in Jupyter notebooks, you can spawn models directly using client.model() without creating a full RecipeContext. This is convenient for quick experimentation and prototyping. First, create a client using get_client:

from adaptive_harmony import get_client

# Create a client directly
client = await get_client(
    addr="wss://YOUR_ADAPTIVE_DEPLOYMENT_URL",
    num_gpus=2,
    api_key="YOUR_API_KEY",
    use_case="your-use-case",
)

# Spawn a model using client.model()
policy_builder = client.model(
    path="llama-3.1-8b-instruct",
    kv_cache_len=100_000,
    tokens_to_generate=2048,
).tp(1)

policy_model = await policy_builder.spawn_train(name="policy", max_batch_size=4096)

Notice that, in recipe scripts, ctx.client provides access to the same client object. The Model().to_builder() approach shown above is the recommended pattern for recipe scripts as it integrates with the platform’s inference parameters configuration system.

Getting Started

Building Blocks

Training & Evaluation

Loading a model

Spawn methods

Tensor parallelism (tp)

Passing model parameters as config input

Using default deployment parameters

Spawning in Jupyter notebooks

Getting Started

Building Blocks

Training & Evaluation

​Loading a model

​Spawn methods

​Tensor parallelism (tp)

​Passing model parameters as config input

​Using default deployment parameters

​Spawning in Jupyter notebooks

Loading a model

Spawn methods

Tensor parallelism (tp)

Passing model parameters as config input

Using default deployment parameters

Spawning in Jupyter notebooks