Adapters
Adapters share the same backbone model, increasing batching opportunities and reducing memory. The default limit is 4 adapters per base model. Contact support to increase this.KV-Cache
The stateful KV-cache reuses computations from previous tokens and chat turns in the same session.Tensor Parallelism
TP (tensor parallelism) is the number of GPUs over which a model replica is partitioned.- Higher TP → lower per-GPU memory → more room for KV-cache
- Higher TP → lower latency
- Adaptive Engine automatically creates replicas based on TP, available GPUs, and traffic
Timeouts
Three timeout controls for 408/timeout errors:| Setting | Default | Description |
|---|---|---|
ADAPTIVE_SERVER__TIMEOUT_SECONDS | 60s | HTTP request timeout in control plane |
ADAPTIVE_HTTP_CLIENT__TIMEOUT_SECS | 60s | Control plane to dependencies timeout |
max_ttft | 60s | Max time before first token (includes queue time) |
max_ttft via SDK:
Compute Pools
Compute pools are isolated partitions of your cluster. Deploy models to multiple pools for reliability.Additional Tips
Models:- Use FP8 quantization
- Use smaller models - they often outperform frontier models after fine-tuning
- Use a properly-sized database near your inference machines (permission checks happen on every inference call)

