Adapters
Adapters share the same backbone model, increasing batching opportunities and reducing memory. The default limit is 4 adapters per base model. Contact support to increase this.KV-cache
The stateful KV-cache reuses computations from previous tokens and chat turns in the same session.Tensor parallelism
TP (tensor parallelism) is the number of GPUs over which a model replica is partitioned.- Higher TP → lower per-GPU memory → more room for KV-cache
- Higher TP → lower latency
- Adaptive Engine automatically creates replicas based on TP, available GPUs, and traffic
Speculative decoding
Reduce inference latency without changing output quality. A small “draft” model proposes multiple tokens, the target model verifies them in one forward pass, and accepted tokens are kept. See Speculative Decoding Visualized for an in-depth explanation.Train a draft model
<target>_accelerated by default.
Deploy with speculative decoding
num_draft_steps sets how many tokens the draft proposes per step (1 to 8). The recipe logs acceptance rate during training: the fraction of draft tokens the target accepts. Higher acceptance rate means more speedup.
Timeouts
Three timeout controls for 408/timeout errors:| Setting | Default | Description |
|---|---|---|
ADAPTIVE_SERVER__TIMEOUT_SECONDS | 60s | HTTP request timeout in control plane |
ADAPTIVE_HTTP_CLIENT__TIMEOUT_SECS | 60s | Control plane to dependencies timeout |
max_ttft | 60s | Max time before first token (includes queue time) |
max_ttft via SDK:
Compute pools
Compute pools are isolated partitions of your cluster. Deploy models to multiple pools for reliability.Additional tips
Models:- Use FP8 quantization
- Use smaller models - they often outperform frontier models after fine-tuning
- Use a properly-sized database near your inference machines (permission checks happen on every inference call)

