GPU management

Tune inference throughput, latency, and GPU utilization.

Adapters

Adapters share the same backbone model, increasing batching opportunities and reducing memory. The default limit is 4 adapters per base model. Contact support to increase this.

KV-cache

The stateful KV-cache reuses computations from previous tokens and chat turns in the same session.

adaptive.models.update_compute_config(
    model="<model key>",
    compute_config={
        "kv_cache_len": 300000
    }
)

Set the cache size in tokens. The combined memory of model weights and KV-cache must fit in GPU memory. Tuning: Larger cache enables more batching and computation reuse. Tune iteratively based on your traffic patterns.

Tensor parallelism

TP (tensor parallelism) is the number of GPUs over which a model replica is partitioned.

adaptive.models.update_compute_config(
    model="<model key>",
    compute_config={
        "tp": 4
    }
)

Trade-offs:

Higher TP → lower per-GPU memory → more room for KV-cache
Higher TP → lower latency
Adaptive Engine automatically creates replicas based on TP, available GPUs, and traffic

Speculative decoding

Reduce inference latency without changing output quality. A small “draft” model proposes multiple tokens, the target model verifies them in one forward pass, and accepted tokens are kept. See Speculative Decoding Visualized for an in-depth explanation.

Train a draft model

adaptive.jobs.run(
    recipe_key="specdec_alignment",
    num_gpus=2,
    args={
        "target_model": "qwen-3-8b",
        "draft_model": "qwen-3-500m",
        "dataset": "my-prompts",
    },
)

The draft must share the target’s tokenizer (same model family). The output is an accelerated model named <target>_accelerated by default.

Deploy with speculative decoding

adaptive.models.deploy(
    model="qwen-3-8b_accelerated",
    num_draft_steps=4,
    wait=True,
)

num_draft_steps sets how many tokens the draft proposes per step (1 to 8). The recipe logs acceptance rate during training: the fraction of draft tokens the target accepts. Higher acceptance rate means more speedup.

Timeouts

Three timeout controls for 408/timeout errors:

Setting	Default	Description
`ADAPTIVE_SERVER__TIMEOUT_SECONDS`	60s	HTTP request timeout in control plane
`ADAPTIVE_HTTP_CLIENT__TIMEOUT_SECS`	60s	Control plane to dependencies timeout
`max_ttft`	60s	Max time before first token (includes queue time)

Set max_ttft via SDK:

adaptive.models.update(
    model="<model key>",
    placement={
        "max_ttft_ms": 60000
    }
)

Compute pools

Compute pools are isolated partitions of your cluster. Deploy models to multiple pools for reliability.

adaptive.models.update(
    model="<model key>",
    placement={
        "compute_pools": ["pool-1", "pool-2"],
    }
)

Configure autoscaling via the Helm chart. See self-hosting.

Additional tips

Models:

Use FP8 quantization
Use smaller models - they often outperform frontier models after fine-tuning

Database:

Use a properly-sized database near your inference machines (permission checks happen on every inference call)

FAQ

Why does TP=1 use more per-GPU memory than TP=N? With TP=N, the model is sharded across N GPUs, so each GPU holds 1/N of the weights. What’s the connection between TP and KV-cache? Higher TP frees more memory per GPU, leaving more room for KV-cache. What if there’s not enough memory for model + KV-cache? Model loading fails with an insufficient memory error. Why is GPU utilization low? If traffic reuses the KV-cache, Adaptive Engine returns cached results instead of recomputing. This means lower GPU usage but also lower latency.

Start

Core

Advanced

Deploy

Updates

Adapters

KV-cache

Tensor parallelism

Speculative decoding

Train a draft model

Deploy with speculative decoding

Timeouts

Compute pools

Additional tips

FAQ

​Adapters

​KV-cache

​Tensor parallelism

​Speculative decoding

​Train a draft model

​Deploy with speculative decoding

​Timeouts

​Compute pools

​Additional tips

​FAQ

Adapters

KV-cache

Tensor parallelism

Speculative decoding

Train a draft model

Deploy with speculative decoding

Timeouts

Compute pools

Additional tips

FAQ