Kubernetes

Adaptive Engine can be easily configured and deployed on Kubernetes. This guide will walk you through the deployment process.

Requirements

A Kubernetes cluster.
Helm version v3 or higher.
NVIDIA Device Plugin pre-installed (required for GPU resource discovery, see installation guide).
(Optional, if using external secrets) External Secrets Operator pre-installed (see installation guide).

Adaptive Helm Chart

To install the Adaptive Helm chart on your Kubernetes cluster, follow these steps:

Verify tool setup and cluster access

# check if helm is installed
helm version
# make sure `kubectl` is configured correctly and you can access the cluster
kubectl get pods

Get the Helm Chart

Add the Adaptive helm repo and update it:

helm repo add adaptive https://raw.githubusercontent.com/adaptive-ml/adaptive-helm-chart/main/charts
helm repo update adaptive

Get the default values.yaml configuration file:

helm show values adaptive/adaptive > values.yaml

Modify values.yaml

Edit the values.yaml file to customize the Helm chart for your environment Here are the relevant values you should modify:

Container registry information

Add details for the Adaptive container registry you are subscribed and have been granted access to.

containerRegistry: <aws_account_id>.dkr.ecr.<region>.amazonaws.com
harmony:
  image:
    repository: adaptive-repository # Adaptive Repository you have been granted access to
    tag: harmony:latest # Harmony image tag

controlPlane:
  image:
    repository: adaptive-repository # Adaptive Repository you have been granted access to
    tag: control-plane:latest # Control plane image tag

Resource limits

Adjust the resource limits based on your cluster’s capabilities and workload/model requirements. harmony.gpusPerNode should match the available GPU resources for each node in the cluster where Adaptive Harmony will be deployed. For example:

harmony:
    replicaCount: 1
    # Should be equal to, or a divisor of the # of GPUs on each node
    gpusPerReplica: 8
    resources:
        limits:
            cpu: 8
            memory: 64Gi
        requests:
            cpu: 8
            memory: 60Gi

Configuration secrets

Add values for the required configuration secrets:

secrets:
  # S3 bucket for model registry
  modelRegistryUrl: "s3://bucket-name/model_registry"
  # Use same bucket as above and can use a different prefix
  sharedDirectoryUrl: "s3://bucket-name/shared"

  # Postgres database connection string
  dbUrl: "postgres://username:password@db_address:5432/db_name"
  # Secret used to sign cookies. Must be the same on all servers of a cluster and >= 64 chars
  cookiesSecret: "change-me-secret-db40431e-c2fd-48a6-acd6-854232c2ed94-01dd4d01-dr7b-4315" # Must be >= 64 chars

  auth:
    oidc:
      providers:
        # Name of your OpenId provider displayed in the ui
        - name: "Google"
          # Key of your provider, the callback url will be '<rootUrl>/api/v1/auth/login/<key>/callback'
          key: "google"
          issuer_url: "https://accounts.google.com" # openid connect issuer url
          client_id: "replace_client_id" # client id
          client_secret: "replace_client_secret" # client_secret, optional
          scopes: ["email", "profile"] # scopes required for auth, requires email and profile
          # true if your provider supports pkce (recommended)
          pkce: true
          # if true, user account will be created if it does not exist
          allow_sign_up: true

If you do not want to create Kubernetes secrets from values.yaml and prefer to integrate secrets stored in an external/cloud secrets manager, see Using external secrets.

Install the Helm chart

Deploy the Adaptive Helm chart:

helm install adaptive \
    adaptive/adaptive \
    --values ./values.yaml

Using external secrets

The Adaptive Helm chart supports integration with external secret stores through External Secrets Operator. The chart implements an example where secrets are hosted on AWS Secrets Manager.

To use secrets stored in an external/cloud secrets manager, you first need to install External Secrets Operator:

helm repo add external-secrets https://charts.external-secrets.io

helm install external-secrets \
    external-secrets/external-secrets \
    -n external-secrets \
    --create-namespace

Then, download the alternative values_external_secret.yaml file:

wget https://raw.githubusercontent.com/adaptive-ml/adaptive-helm-chart/main/charts/adaptive/values_external_secret.yaml

Customize the new values file, adding the details of your external secret’s name and properties. You can replace AWS Secrets Manager with your secrets manager of choice; please check out the documentation on this topic.

Finally, deploy the Adaptive Helm chart using the new values file:

helm install adaptive \
    adaptive/adaptive \
    --values ./values_external_secret.yaml

Considerations for deployment on shared clusters

When deploying Adaptive Engine in a shared cluster where other workloads are running, there are a few best practices you can implement to enforce resource isolation:

Deploy Adaptive in a separate namespace

When installing the Adaptive Helm chart, you can do so in a separate namespace by passing the --namespace option. Example:

helm install adaptive \
  adaptive/adaptive \
  --values ./values.yaml
  --namespace adaptive-engine

You can also pass the --create-namespace if the namespace does not exist yet.

Use Node Selectors to schedule Adaptive on specific GPU nodes

You can use the harmony.nodeSelector value in values.yaml to schedule Adaptive Harmony only on a specific node group. For example, if you are deploying Adaptive on an Amazon EKS cluster, you might add:

harmony:
  nodeSelector: 
    eks.amazonaws.com/nodegroup: p5-h100

Dedicated GPU node tenancy

Although the Adaptive control plane can run on any node where there are available CPU and memory resources, it is recommended that Harmony is scheduled to request and take ownership of all of the GPUs available on each GPU-enabled node. Although you might have already made sure Adaptive Harmony is only scheduled on a designated GPU node group using the instructions in the step above, you might want to guarantee no other workloads can be scheduled on those nodes.

To dedicate a set of GPU nodes for Adaptive Harmony, you can use a combination of:

Adding a taint to the GPU nodes
Adding a corresponding toleration to Harmony in the values.yaml of the Adaptive Helm Chart

To add a taint to a node, you can first run kubectl get nodes -o name to see all the existing node names, and then taint them as exemplified below (replacing node_name):

kubectl taint nodes node_name dedicated=adaptive-engine:NoSchedule

You can then add a matching toleration to Harmony in the values.yaml file (harmony.tolerations) which will allow it to be scheduled on the tainted nodes:

harmony:
  tolerations:
  - key: dedicated
    operator: Equal
    value: adaptive-engine
    effect: NoSchedule

You can find more about taints and tolerations in the official Kubernetes documentation.

Advanced configuration

Database SSL/TLS configuration

Adaptive Engine supports secure TLS connections between the database and control plane.

Basic setting

If your PostgreSQL database supports TLS, you can enforce encrypted connections by adding the parameter sslmode=require to your PostgreSQL connection string dbUrl in the Helm chart’s values.yaml file:

  dbUrl: "postgres://<user>:<password>@<host>/<db>?sslmode=require"

Although sslmode=require encrypts the database connection, it does not verify the server’s identity.

Server certificate verification

In order for the application to be able to verify the server certificate, you must set sslmode to verify-ca or -verify-full.

verify-ca will verify the server certificate
verify-full will verify the server certificate and also that the server host name matches the name stored in the server certificate

verify-full is the recommended option for maximum security.

You will need to provide the application with a root certificate to make server certification possible. You can do so by following these steps:

Download the db server certificate (if you’re using AWS RDS for example, refer to this page), for instance rds-ca-rsa2048-g1.pem
Upload the pem file to your k8s cluster. As the certificate is non-critical, public information, it can uploaded as a ConfigMap

kubectl create configmap -n <namespace> db-ca --from-file=rds-ca-rsa2048-g1.pem

Mount the file as a volume to the control plane deployment by editing values.yaml:

...

volumes:
  - name: db-ca
    configMap:
      name: db-ca

volumeMounts:
  - name: db-ca
    mountPath: /mnt/db-ca/
    readOnly: true

Use the sslrootcert parameter to refer to the certificate in the PostgresDB connection url, specifying mountPath + filename:

  dbUrl: "postgres://<user>:<password>@<host>/<db>?sslmode=verify-full&sslrootcert=/mnt/db-ca/rds-ca-rsa2048-g1.pem"

Refer to the official documentation for SSL support on PostgresSQL for more information.

.

Platform

Inference and Feedback

Datasets

Evaluation

Fine-tuning

Custom Recipes

Integrations

Deployment

Requirements

Adaptive Helm Chart

Verify tool setup and cluster access

Get the Helm Chart

Modify values.yaml

Container registry information

Resource limits

Configuration secrets

Install the Helm chart

Using external secrets

Considerations for deployment on shared clusters

Deploy Adaptive in a separate namespace

Use Node Selectors to schedule Adaptive on specific GPU nodes

Dedicated GPU node tenancy

Advanced configuration

Database SSL/TLS configuration

Basic setting

Server certificate verification

.

Platform

Inference and Feedback

Datasets

Evaluation

Fine-tuning

Custom Recipes

Integrations

Deployment

​Requirements

​Adaptive Helm Chart

​Verify tool setup and cluster access

​Get the Helm Chart

​Modify values.yaml

​Container registry information

​Resource limits

​Configuration secrets

​Install the Helm chart

​Using external secrets

​Considerations for deployment on shared clusters

​Deploy Adaptive in a separate namespace

​Use Node Selectors to schedule Adaptive on specific GPU nodes

​Dedicated GPU node tenancy

​Advanced configuration

​Database SSL/TLS configuration

​Basic setting

​Server certificate verification

Requirements

Adaptive Helm Chart

Verify tool setup and cluster access

Get the Helm Chart

Modify values.yaml

Container registry information

Resource limits

Configuration secrets

Install the Helm chart

Using external secrets

Considerations for deployment on shared clusters

Deploy Adaptive in a separate namespace

Use Node Selectors to schedule Adaptive on specific GPU nodes

Dedicated GPU node tenancy

Advanced configuration

Database SSL/TLS configuration

Basic setting

Server certificate verification