Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.adaptive-ml.com/llms.txt

Use this file to discover all available pages before exploring further.

Tune inference throughput, latency, and GPU utilization.

Adapters

Adapters share the same backbone model, increasing batching opportunities and reducing memory. The default limit is 4 adapters per base model. Contact support to increase this.

KV-cache

The stateful KV-cache reuses computations from previous tokens and chat turns in the same session.
adaptive.models.update_compute_config(
    model="<model key>",
    compute_config={
        "kv_cache_len": 300000
    }
)
Set the cache size in tokens. The combined memory of model weights and KV-cache must fit in GPU memory. Tuning: Larger cache enables more batching and computation reuse. Tune iteratively based on your traffic patterns.

Tensor parallelism

TP (tensor parallelism) is the number of GPUs over which a model replica is partitioned.
adaptive.models.update_compute_config(
    model="<model key>",
    compute_config={
        "tp": 4
    }
)
Trade-offs:
  • Higher TP → lower per-GPU memory → more room for KV-cache
  • Higher TP → lower latency
  • Adaptive Engine automatically creates replicas based on TP, available GPUs, and traffic

Speculative decoding

Reduce inference latency without changing output quality. A small “draft” model proposes multiple tokens, the target model verifies them in one forward pass, and accepted tokens are kept. See Speculative Decoding Visualized for an in-depth explanation.

Train a draft model

adaptive.jobs.run(
    recipe_key="specdec_alignment",
    num_gpus=2,
    args={
        "target_model": "qwen-3-8b",
        "draft_model": "qwen-3-500m",
        "dataset": "my-prompts",
    },
)
The draft must share the target’s tokenizer (same model family). The output is an accelerated model named <target>_accelerated by default.

Deploy with speculative decoding

adaptive.models.deploy(
    model="qwen-3-8b_accelerated",
    num_draft_steps=4,
    wait=True,
)
num_draft_steps sets how many tokens the draft proposes per step (1 to 8). The recipe logs acceptance rate during training: the fraction of draft tokens the target accepts. Higher acceptance rate means more speedup.

Timeouts

Three timeout controls for 408/timeout errors:
SettingDefaultDescription
ADAPTIVE_SERVER__TIMEOUT_SECONDS60sHTTP request timeout in control plane
ADAPTIVE_HTTP_CLIENT__TIMEOUT_SECS60sControl plane to dependencies timeout
max_ttft60sMax time before first token (includes queue time)
Set max_ttft via SDK:
adaptive.models.update(
    model="<model key>",
    placement={
        "max_ttft_ms": 60000
    }
)

Compute pools

Compute pools are isolated partitions of your cluster. Deploy models to multiple pools for reliability.
adaptive.models.update(
    model="<model key>",
    placement={
        "compute_pools": ["pool-1", "pool-2"],
    }
)
Configure autoscaling via the Helm chart. See self-hosting.

Additional tips

Models:
  • Use FP8 quantization
  • Use smaller models - they often outperform frontier models after fine-tuning
Database:
  • Use a properly-sized database near your inference machines (permission checks happen on every inference call)

FAQ

Why does TP=1 use more per-GPU memory than TP=N? With TP=N, the model is sharded across N GPUs, so each GPU holds 1/N of the weights. What’s the connection between TP and KV-cache? Higher TP frees more memory per GPU, leaving more room for KV-cache. What if there’s not enough memory for model + KV-cache? Model loading fails with an insufficient memory error. Why is GPU utilization low? If traffic reuses the KV-cache, Adaptive Engine returns cached results instead of recomputing. This means lower GPU usage but also lower latency.