GPU Management

Tune inference throughput, latency, and GPU utilization.

Adapters

Adapters share the same backbone model, increasing batching opportunities and reducing memory. The default limit is 4 adapters per base model. Contact support to increase this.

KV-Cache

The stateful KV-cache reuses computations from previous tokens and chat turns in the same session.

adaptive.models.update_compute_config(
    model="<model key>",
    compute_config={
        "kv_cache_len": 300000
    }
)

Set the cache size in tokens. The combined memory of model weights and KV-cache must fit in GPU memory. Tuning: Larger cache enables more batching and computation reuse. Tune iteratively based on your traffic patterns.

Tensor Parallelism

TP (tensor parallelism) is the number of GPUs over which a model replica is partitioned.

adaptive.models.update_compute_config(
    model="<model key>",
    compute_config={
        "tp": 4
    }
)

Trade-offs:

Higher TP → lower per-GPU memory → more room for KV-cache
Higher TP → lower latency
Adaptive Engine automatically creates replicas based on TP, available GPUs, and traffic

Timeouts

Three timeout controls for 408/timeout errors:

Setting	Default	Description
`ADAPTIVE_SERVER__TIMEOUT_SECONDS`	60s	HTTP request timeout in control plane
`ADAPTIVE_HTTP_CLIENT__TIMEOUT_SECS`	60s	Control plane to dependencies timeout
`max_ttft`	60s	Max time before first token (includes queue time)

Set max_ttft via SDK:

adaptive.models.update(
    model="<model key>",
    placement={
        "max_ttft_ms": 60000
    }
)

Compute Pools

Compute pools are isolated partitions of your cluster. Deploy models to multiple pools for reliability.

adaptive.models.update(
    model="<model key>",
    placement={
        "compute_pools": ["pool-1", "pool-2"],
    }
)

Configure autoscaling via the Helm chart. See self-hosting.

Additional Tips

Models:

Use FP8 quantization
Use smaller models - they often outperform frontier models after fine-tuning

Database:

Use a properly-sized database near your inference machines (permission checks happen on every inference call)

FAQ

Why does TP=1 use more per-GPU memory than TP=N? With TP=N, the model is sharded across N GPUs, so each GPU holds 1/N of the weights. What’s the connection between TP and KV-cache? Higher TP frees more memory per GPU, leaving more room for KV-cache. What if there’s not enough memory for model + KV-cache? Model loading fails with an insufficient memory error. Why is GPU utilization low? If traffic reuses the KV-cache, Adaptive Engine returns cached results instead of recomputing. This means lower GPU usage but also lower latency.

Start

Core

Advanced

Deploy

Updates

Adapters

KV-Cache

Tensor Parallelism

Timeouts

Compute Pools

Additional Tips

FAQ

​Adapters

​KV-Cache

​Tensor Parallelism

​Timeouts

​Compute Pools

​Additional Tips

​FAQ

Adapters

KV-Cache

Tensor Parallelism

Timeouts

Compute Pools

Additional Tips

FAQ