Boost throughput, reduce latency and GPU footprint !
Two adapters trained on a base model.
Setting a model's KV-cache size in the Adaptive Engine model registry.
Setting a model's KV-cache size in the Adaptive Engine model registry.
Setting a model's TP configuration in the Adaptive Engine model registry.
Setting a model's TP configuration in the Adaptive Engine model registry.
ADAPTIVE_SERVER__TIMEOUT_SECONDS
(default 60s): timeout for the HTTP requests handled inside the control plane. Set this timeout as a control plane environment variable.ADAPTIVE_HTTP_CLIENT__TIMEOUT_SECS
(default 60s): timeout for the HTTP requests made from control plane to its dependencies such as Harmony and external models. Set this
timeout as a control plane environment variable.max_ttft
(default 60s): how long can a request wait before generating its first token.
Choosing on which compute pool(s) to deploy and load balance a model endpoint.
Choosing on which compute pool(s) to deploy and load balance a model endpoint.