Skip to main content
Tune inference throughput, latency, and GPU utilization.

Adapters

Adapters share the same backbone model, increasing batching opportunities and reducing memory. The default limit is 4 adapters per base model. Contact support to increase this.

KV-Cache

The stateful KV-cache reuses computations from previous tokens and chat turns in the same session.
adaptive.models.update_compute_config(
    model="<model key>",
    compute_config={
        "kv_cache_len": 300000
    }
)
Set the cache size in tokens. The combined memory of model weights and KV-cache must fit in GPU memory. Tuning: Larger cache enables more batching and computation reuse. Tune iteratively based on your traffic patterns.

Tensor Parallelism

TP (tensor parallelism) is the number of GPUs over which a model replica is partitioned.
adaptive.models.update_compute_config(
    model="<model key>",
    compute_config={
        "tp": 4
    }
)
Trade-offs:
  • Higher TP → lower per-GPU memory → more room for KV-cache
  • Higher TP → lower latency
  • Adaptive Engine automatically creates replicas based on TP, available GPUs, and traffic

Timeouts

Three timeout controls for 408/timeout errors:
SettingDefaultDescription
ADAPTIVE_SERVER__TIMEOUT_SECONDS60sHTTP request timeout in control plane
ADAPTIVE_HTTP_CLIENT__TIMEOUT_SECS60sControl plane to dependencies timeout
max_ttft60sMax time before first token (includes queue time)
Set max_ttft via SDK:
adaptive.models.update(
    model="<model key>",
    placement={
        "max_ttft_ms": 60000
    }
)

Compute Pools

Compute pools are isolated partitions of your cluster. Deploy models to multiple pools for reliability.
adaptive.models.update(
    model="<model key>",
    placement={
        "compute_pools": ["pool-1", "pool-2"],
    }
)
Configure autoscaling via the Helm chart. See self-hosting.

Additional Tips

Models:
  • Use FP8 quantization
  • Use smaller models - they often outperform frontier models after fine-tuning
Database:
  • Use a properly-sized database near your inference machines (permission checks happen on every inference call)

FAQ

Why does TP=1 use more per-GPU memory than TP=N? With TP=N, the model is sharded across N GPUs, so each GPU holds 1/N of the weights. What’s the connection between TP and KV-cache? Higher TP frees more memory per GPU, leaving more room for KV-cache. What if there’s not enough memory for model + KV-cache? Model loading fails with an insufficient memory error. Why is GPU utilization low? If traffic reuses the KV-cache, Adaptive Engine returns cached results instead of recomputing. This means lower GPU usage but also lower latency.