Inference optimization tips

Adaptive Engine embeds a fully proprietary, optimized LLM inference server purpose-built for reinforcement learning and high-throughput production inference. In this section, we explain the various levers in place that allow users to enjoy extreme inference performance while keeping GPU footprint under control.

Using adapters

By default, Adaptive Engine trains and serves adapters, even though users can decide to run full-parameter trainings as well. Adapters share the same backbone, so they increase the batching opportunity and reduce the quantity of weights kept up in memory in the cluster under load. The default number of adapters deployed in parallel is 4. Contact Adaptive ML support to increase that limit.

Two adapters trained on a base model.

Intelligent stateful KV-cache

What is it?

Using a KV-cache allows to reuse some of the computations used to decode token N, namely the key-value output of self-attention layers, for the decoding of token N+1. The stateful KV-cache extends that concept to reusing computations from previously seen prompts, such as previous chat turns, in the decoding phase associated to a subsequent user turn in the same session. Increasing the KV-cache increases the likelihood of computation reuse, and also increases the number of sequences that can be batched together. Adaptive Engine is equipped with an intelligent stateful KV-cache, that decides, based on GPU occupancy and incoming traffic shape, whether to keep incoming session prompts pinned to a session-assigned model replica, or to rebalance the prompts to a fresh replica, that may return results faster even though if it comes at the cost of extra computation.

Setting and tuning the Adaptive Engine stateful KV-cache

Adaptive Engine stateful KV-cache can be set in the model registry

Setting a model's KV-cache size in the Adaptive Engine model registry.

Setting a model's KV-cache size in the Adaptive Engine model registry.

You can also set the KV-cache with the following SDK call:

Adaptive SDK

response = adaptive.models.update_compute_config(
    model="<model key>",
		compute_config={
				"kv_cache_len": "300000"
		}
)

The KV cache size can be set to an arbitrarily large value, in number of tokens to keep cached, under the constraint that the cumulative memory footprint of models and KV cache must be smaller than the available GPU memory. We recommend to tune KV cache size iteratively, to a value that optimizes your output metrics.

Tensor parallelism

Adaptive Engine allows you to customize the tensor parallelism (TP) of your models. TP represents the number of GPUs over which a model replica is partitioned.

Changing the tensor parallelism

You can change the tensor parallelism in the Adaptive Engine model registry UI:

Setting a model's TP configuration in the Adaptive Engine model registry.

Setting a model's TP configuration in the Adaptive Engine model registry.

You can also change tensor parallelism with the following SDK call:

Adaptive SDK

adaptive.models.update_compute_config(
    model="<model key>",
    compute_config={
        "tp": "4"
    }
)

Optimizing tensor parallelism

It is recommended to try different TP configuration, while being aware of the following considerations:

Increasing TP reduces the per-GPU memory footprint, hence allows use more KV-cache and thus more batching and computation reuse.
Increasing TP generally leads to lower latency.

Automated replication

Adaptive Engine automatically creates more inference replicas as needed and if possible, based on your TP configuration, available GPUs and incoming traffic.

You do not need to set or tune batching and data parallelism, Adaptive Engine does it automatically for you.

Timeouts

Adaptive Engine exposes 3 timeout controls. If you experience 408 or timeout errors we recommend to adjust them:

ADAPTIVE_SERVER__TIMEOUT_SECONDS (default 60s): timeout for the HTTP requests handled inside the control plane. Set this timeout as a control plane environment variable.
ADAPTIVE_HTTP_CLIENT__TIMEOUT_SECS (default 60s): timeout for the HTTP requests made from control plane to its dependencies such as Harmony and external models. Set this timeout as a control plane environment variable.
max_ttft (default 60s): how long can a request wait before generating its first token.
- If this is exceeded, the request will be dropped and a 408 error returned.
- It includes queueing time.
- It is used as a trigger metric when inference autoscaling is activated.
- It can be set via SDK, using:

Adaptive SDK

 adaptive.models.update(
     model="<model key>",
     placement={
         "max_ttft_ms": "enter desired TTFT value"
     }
 )

Compute pools and autoscaling

Adaptive Engine allows the creation of compute pools, which are isolated partitions of your cluster or node that can be assigned training, evaluation and inference tasks. You can adjust the number of compute pools hosting a model endpoint to increase its reliability and hardware footprint. Compute pools can be adjusted via UI and SDK: In the UI it is done in the use case Overview section:

Choosing on which compute pool(s) to deploy and load balance a model endpoint.

Choosing on which compute pool(s) to deploy and load balance a model endpoint.

With the SDK, with the following call:

Adaptive SDK

 adaptive.models.update(
     model="<model key>",
     placement={
         "compute_pools": ["list of compute pools"],
     }
 )

If you deploy Adaptive Engine via our Helm Chart, you can configure inference autoscaling.

More tips

Models

To further reduce compute footprint and latency, consider using:

fp8 quantization support in Adaptive Engine.
Smaller-size LLMs, that we often see outperform frontier models after their reinforcement fine-tuning on Adaptive Engine.

Database

Adaptive Engine implements a rigorous permission check that verifies permissions server-side for every inference call. To minimize database overhead, make sure to use a properly-sized database, located sufficiently close to your inference machines.

Summary table

Overview of inference optimization avenues and their impact.

Frequently asked questions

Why TP=1 may lead to higher per-GPU memory utilization than TP=N ?

Because in TP=N, the model is sharded over N multiple GPUs so each GPU has 1/N weights to carry, compared to the TP=1 case.

What is the connection between TP and KV-cache?

Increasing TP frees up more memory on each GPU, so there is more room left for KV-cache. Bigger TP = bigger possible KV-cache.

What happens if there is not enough room for both the model and the requested KV-cache size?

During model loading attempt Adaptive Engine will report that memory is insufficient and fail to load the model.

Why is my GPU not fully utilized? Is it a problem?

If your traffic distribution triggers a frequent use of the KV-cache, Adaptive Engine will reuse cached results instead of recomputing them, leading to less GPU utilization, but also lower latency! Adaptive Engine intelligent KV-cache permanently evaluates whether routing traffic to inference replicas with populated KV-cache or to fresh, uncached replicas. Our goal is to maximize your throughput and minimize your latency, while using the least amount of incremental GPU flops. Which is not necessarily correlated with high GPU utilization.

Platform

Inference and Feedback

Datasets

Evaluation

Fine-tuning

Integrations

Deployment

Using adapters

Intelligent stateful KV-cache

What is it?

Setting and tuning the Adaptive Engine stateful KV-cache

Tensor parallelism

Changing the tensor parallelism

Optimizing tensor parallelism

Automated replication

Timeouts

Compute pools and autoscaling

More tips

Models

Database

Summary table

Frequently asked questions

Why TP=1 may lead to higher per-GPU memory utilization than TP=N ?

What is the connection between TP and KV-cache?

What happens if there is not enough room for both the model and the requested KV-cache size?

Why is my GPU not fully utilized? Is it a problem?

Platform

Inference and Feedback

Datasets

Evaluation

Fine-tuning

Integrations

Deployment

​Using adapters

​Intelligent stateful KV-cache

​What is it?

​Setting and tuning the Adaptive Engine stateful KV-cache

​Tensor parallelism

​Changing the tensor parallelism

​Optimizing tensor parallelism

​Automated replication

​Timeouts

​Compute pools and autoscaling

​More tips

​Models

​Database

​Summary table

​Frequently asked questions

​Why TP=1 may lead to higher per-GPU memory utilization than TP=N ?

​What is the connection between TP and KV-cache?

​What happens if there is not enough room for both the model and the requested KV-cache size?

​Why is my GPU not fully utilized? Is it a problem?

Using adapters

Intelligent stateful KV-cache

What is it?

Setting and tuning the Adaptive Engine stateful KV-cache

Tensor parallelism

Changing the tensor parallelism

Optimizing tensor parallelism

Automated replication

Timeouts

Compute pools and autoscaling

More tips

Models

Database

Summary table

Frequently asked questions

Why TP=1 may lead to higher per-GPU memory utilization than TP=N ?

What is the connection between TP and KV-cache?

What happens if there is not enough room for both the model and the requested KV-cache size?

Why is my GPU not fully utilized? Is it a problem?