> ## Documentation Index
> Fetch the complete documentation index at: https://docs.adaptive-ml.com/llms.txt
> Use this file to discover all available pages before exploring further.

# GPU Management

> Compute pools, inference tuning, and GPU utilization

Tune inference throughput, latency, and GPU utilization.

## Adapters

Adapters share the same backbone model, increasing batching opportunities and reducing memory. The default limit is 4 adapters per base model. Contact support to increase this.

## KV-Cache

The stateful KV-cache reuses computations from previous tokens and chat turns in the same session.

```python theme={null}
adaptive.models.update_compute_config(
    model="<model key>",
    compute_config={
        "kv_cache_len": 300000
    }
)
```

Set the cache size in tokens. The combined memory of model weights and KV-cache must fit in GPU memory.

**Tuning**: Larger cache enables more batching and computation reuse. Tune iteratively based on your traffic patterns.

## Tensor Parallelism

TP (tensor parallelism) is the number of GPUs over which a model replica is partitioned.

```python theme={null}
adaptive.models.update_compute_config(
    model="<model key>",
    compute_config={
        "tp": 4
    }
)
```

**Trade-offs:**

* Higher TP → lower per-GPU memory → more room for KV-cache
* Higher TP → lower latency
* Adaptive Engine automatically creates replicas based on TP, available GPUs, and traffic

## Timeouts

Three timeout controls for 408/timeout errors:

| Setting                              | Default | Description                                       |
| ------------------------------------ | ------- | ------------------------------------------------- |
| `ADAPTIVE_SERVER__TIMEOUT_SECONDS`   | 60s     | HTTP request timeout in control plane             |
| `ADAPTIVE_HTTP_CLIENT__TIMEOUT_SECS` | 60s     | Control plane to dependencies timeout             |
| `max_ttft`                           | 60s     | Max time before first token (includes queue time) |

Set `max_ttft` via SDK:

```python theme={null}
adaptive.models.update(
    model="<model key>",
    placement={
        "max_ttft_ms": 60000
    }
)
```

## Compute Pools

Compute pools are isolated partitions of your cluster. Deploy models to multiple pools for reliability.

```python theme={null}
adaptive.models.update(
    model="<model key>",
    placement={
        "compute_pools": ["pool-1", "pool-2"],
    }
)
```

Configure autoscaling via the Helm chart. See [self-hosting](/v0.12/deploy/self-hosting).

## Additional Tips

**Models:**

* Use FP8 quantization
* Use smaller models - they often outperform frontier models after fine-tuning

**Database:**

* Use a properly-sized database near your inference machines (permission checks happen on every inference call)

## FAQ

**Why does TP=1 use more per-GPU memory than TP=N?**

With TP=N, the model is sharded across N GPUs, so each GPU holds 1/N of the weights.

**What's the connection between TP and KV-cache?**

Higher TP frees more memory per GPU, leaving more room for KV-cache.

**What if there's not enough memory for model + KV-cache?**

Model loading fails with an insufficient memory error.

**Why is GPU utilization low?**

If traffic reuses the KV-cache, Adaptive Engine returns cached results instead of recomputing. This means lower GPU usage but also lower latency.
