> ## Documentation Index
> Fetch the complete documentation index at: https://docs.adaptive-ml.com/llms.txt
> Use this file to discover all available pages before exploring further.

# GPU management

> Compute pools, inference tuning, and GPU utilization

Tune inference throughput, latency, and GPU utilization.

## Adapters

Adapters share the same backbone model, increasing batching opportunities and reducing memory. The default limit is 4 adapters per base model. Contact support to increase this.

## KV-cache

The stateful KV-cache reuses computations from previous tokens and chat turns in the same session.

```python theme={null}
adaptive.models.update_compute_config(
    model="<model key>",
    compute_config={
        "kv_cache_len": 300000
    }
)
```

Set the cache size in tokens. The combined memory of model weights and KV-cache must fit in GPU memory.

**Tuning**: Larger cache enables more batching and computation reuse. Tune iteratively based on your traffic patterns.

## Tensor parallelism

TP (tensor parallelism) is the number of GPUs over which a model replica is partitioned.

```python theme={null}
adaptive.models.update_compute_config(
    model="<model key>",
    compute_config={
        "tp": 4
    }
)
```

**Trade-offs:**

* Higher TP → lower per-GPU memory → more room for KV-cache
* Higher TP → lower latency
* Adaptive Engine automatically creates replicas based on TP, available GPUs, and traffic

## Speculative decoding

Reduce inference latency without changing output quality. A small "draft" model proposes multiple tokens, the target model verifies them in one forward pass, and accepted tokens are kept. See [Speculative Decoding Visualized](https://www.adaptive-ml.com/post/speculative-decoding-visualized) for an in-depth explanation.

### Train a draft model

```python theme={null}
adaptive.jobs.run(
    recipe_key="specdec_alignment",
    num_gpus=2,
    args={
        "target_model": "qwen-3-8b",
        "draft_model": "qwen-3-500m",
        "dataset": "my-prompts",
    },
)
```

The draft must share the target's tokenizer (same model family). The output is an accelerated model named `<target>_accelerated` by default.

### Deploy with speculative decoding

```python theme={null}
adaptive.models.deploy(
    model="qwen-3-8b_accelerated",
    num_draft_steps=4,
    wait=True,
)
```

`num_draft_steps` sets how many tokens the draft proposes per step (1 to 8). The recipe logs **acceptance rate** during training: the fraction of draft tokens the target accepts. Higher acceptance rate means more speedup.

## Timeouts

Three timeout controls for 408/timeout errors:

| Setting                              | Default | Description                                       |
| ------------------------------------ | ------- | ------------------------------------------------- |
| `ADAPTIVE_SERVER__TIMEOUT_SECONDS`   | 60s     | HTTP request timeout in control plane             |
| `ADAPTIVE_HTTP_CLIENT__TIMEOUT_SECS` | 60s     | Control plane to dependencies timeout             |
| `max_ttft`                           | 60s     | Max time before first token (includes queue time) |

Set `max_ttft` via SDK:

```python theme={null}
adaptive.models.update(
    model="<model key>",
    placement={
        "max_ttft_ms": 60000
    }
)
```

## Compute pools

Compute pools are isolated partitions of your cluster. Deploy models to multiple pools for reliability.

```python theme={null}
adaptive.models.update(
    model="<model key>",
    placement={
        "compute_pools": ["pool-1", "pool-2"],
    }
)
```

Configure autoscaling via the Helm chart. See [self-hosting](/v0.14/deploy/self-hosting).

## Additional tips

**Models:**

* Use FP8 quantization
* Use smaller models - they often outperform frontier models after fine-tuning

**Database:**

* Use a properly-sized database near your inference machines (permission checks happen on every inference call)

## FAQ

**Why does TP=1 use more per-GPU memory than TP=N?**

With TP=N, the model is sharded across N GPUs, so each GPU holds 1/N of the weights.

**What's the connection between TP and KV-cache?**

Higher TP frees more memory per GPU, leaving more room for KV-cache.

**What if there's not enough memory for model + KV-cache?**

Model loading fails with an insufficient memory error.

**Why is GPU utilization low?**

If traffic reuses the KV-cache, Adaptive Engine returns cached results instead of recomputing. This means lower GPU usage but also lower latency.
