Skip to main content
This guide covers the AWS-specific infrastructure and Helm configuration needed to deploy Adaptive Engine on Amazon EKS with EC2 Capacity Blocks. It complements the general self-hosting guide and references the Adaptive Helm chart as the source of truth for all Helm values. This page assumes P-family instance types (p5, p5e, p5en, p6-b200, p6-b300, p6e-b200, p6e-b300).

What are EC2 Capacity Blocks?

EC2 Capacity Blocks (“capacity blocks”) are region-specific, instance-specific reservations that last between 1 and 180 days. A reservation can start immediately (subject to availability) or at a future date, and spans one or more EC2 instances. See the capacity blocks documentation for precise reservation mechanics and pricing. The capacity block is a fairly unique GPU procurement mechanism, available only on AWS.
Capacity blocks are different than On-Demand Capacity Reservations (ODCR), Reserved Instances (RI) or EC2 Savings Plan (SP).
Adaptive ML recommends capacity blocks with Adaptive Engine for the following reasons:
  1. GPU availability — P-family instances have practically no on-demand capacity. Capacity blocks provide reserved GPU instances with predictable lead times.
  2. Cost — Capacity blocks are priced below on-demand rates, and in some cases below annual reservations.
  3. Elasticity — An Adaptive Engine cluster can vary its GPU count on a daily basis as capacity blocks are created or expire. This allows organizations to grow capacity with workload demand and reduce cost when GPUs are not needed.
Capacity blocks are not mandatory to use Adaptive Engine on AWS. For low-volume inference of small-sized models, g6 and g6e instances can be used, and have satisfactory on-demand capacity.

Architecture overview

A production deployment on AWS uses the following services:
ComponentAWS service
KubernetesAmazon EKS
GPU VMsEC2 Capacity Blocks
Postgres databaseAmazon RDS (PostgreSQL)
Redis datastoreAmazon ElastiCache for Redis
Model registryAmazon S3
SecretsAWS Secrets Manager
LoggingAmazon CloudWatch
DNS/TLSACM + your DNS provider
IngressAWS Load Balancer Controller
Refer to the Adaptive Engine architecture section to see how these services interact.

Deployment checklist

Here are the steps to deploy an Adaptive Engine cluster on Amazon EKS with EC2 Capacity Blocks. Subsequent sections provide details and code snippets.
  1. Provision AWS infrastructure — Create VPC, EKS cluster, RDS, ElastiCache, and S3 bucket.
  2. Install cluster dependencies — Deploy the NVIDIA GPU Operator, AWS Load Balancer Controller, External Secrets Operator, and optionally the CloudWatch agent.
  3. Populate Secrets Manager — Store database connection details, S3 bucket paths, Redis authentication token, OIDC provider configuration, and cookies secret. Deploy the ClusterSecretStore and ExternalSecret resources.
  4. Configure Helm values and deploy the control plane — Customize values.yaml and install the Helm chart with replicaCount: 0 for Harmony (no GPUs yet). Login to your domain name and go through the OIDC authentication flow to verify that the control plane is online.
  5. Purchase capacity blocks — Reserve GPU instances via the EC2 console or CLI. Tag them for Karpenter discovery.
  6. Configure Karpenter node pools — Deploy EC2NodeClass and NodePool resources targeting your capacity block.
  7. Add compute pools and upgrade — Set Harmony compute pools in values.yaml with node selectors and tolerations pointing to your capacity block nodes. Run helm upgrade.
  8. Verify GPUs — Confirm Harmony pods are scheduled on capacity block nodes. Open the control plane UI and check that your GPUs appear in the compute pools section.

Prerequisites

Before starting, complete the general self-hosting prerequisites and ensure you have: Choose one of the following EKS cluster configurations:
This option gives you full control over node provisioning, instance selection, and load balancer configuration.

Placeholders

Code snippets in this guide use the following placeholders. Replace them with your values.
PlaceholderValue
REGIONAWS region (e.g., us-east-1)
ACCOUNT_IDYour 12-digit AWS account ID
MY_DEPLOYMENTA name for your deployment, used as a Secrets Manager path prefix and capacity block tag (e.g., prod-adaptive)
MY_CLUSTERYour EKS cluster name
MY_HOSTNAMEYour Adaptive Engine domain (e.g., adaptive.example.com)
MY_EKS_NODE_ROLEIAM role name for EKS worker nodes
MY_BUCKETYour S3 bucket name
ACM_CERTIFICATE_ARNARN of your ACM TLS certificate

Helm configuration for EKS

Start from the base values.yaml and apply the overrides below. Deploy the control plane first without GPU compute pools to validate your infrastructure (OIDC, secrets, database, Redis) before purchasing capacity blocks.

Container registry (Amazon ECR)

containerRegistry: ACCOUNT_ID.dkr.ecr.REGION.amazonaws.com

harmony:
  image:
    repository: adaptive/harmony
    tag: "vX.Y.Z"
    pullPolicy: Always

controlPlane:
  image:
    repository: adaptive/control-plane
    tag: "vX.Y.Z"
    pullPolicy: Always
Ensure EKS nodes have an IAM instance profile with ecr:GetDownloadUrlForLayer, ecr:BatchGetImage, and ecr:GetAuthorizationToken permissions on the registry.

Control plane

controlPlane:
  replicaCount: 3
  rootUrl: "https://MY_HOSTNAME"
  resources:
    requests:
      cpu: 2
      memory: 8Gi
    limits:
      memory: 8Gi
  podDisruptionBudget:
    enabled: true
    minAvailable: 1
  extraEnvVars:
    NO_COLOR: "1"

Initial deployment (control plane only)

Deploy without compute pools to verify the infrastructure works end to end:
helm install adaptive oci://ghcr.io/adaptive-ml/adaptive \
  --values ./values.yaml \
  --namespace adaptive \
  --create-namespace
Once the control plane pods are running, confirm:
  • You can log in via your OIDC provider
  • Secrets are synced (check the Kubernetes secrets in the adaptive namespace)
  • The control plane connects to RDS and Redis (check pod logs for connection errors)
After validation, proceed to purchase capacity blocks and add compute pools.

AWS Secrets Manager

Store all sensitive configuration (database credentials, S3 paths, Redis authentication token, OIDC providers, and the cookies secret) in AWS Secrets Manager. The External Secrets Operator syncs these values into Kubernetes secrets.

IAM policy

The External Secrets Operator needs an IAM role with permission to read from Secrets Manager. Create the role using EKS Pod Identity and attach the following policy:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetSecretValue",
        "secretsmanager:DescribeSecret"
      ],
      "Resource": [
        "arn:aws:secretsmanager:REGION:ACCOUNT_ID:secret:rds!db-*",
        "arn:aws:secretsmanager:REGION:ACCOUNT_ID:secret:MY_DEPLOYMENT/rds/connection-*",
        "arn:aws:secretsmanager:REGION:ACCOUNT_ID:secret:MY_DEPLOYMENT/s3/storage-*",
        "arn:aws:secretsmanager:REGION:ACCOUNT_ID:secret:MY_DEPLOYMENT/redis/auth-token-*",
        "arn:aws:secretsmanager:REGION:ACCOUNT_ID:secret:MY_DEPLOYMENT/oidc_secret-*",
        "arn:aws:secretsmanager:REGION:ACCOUNT_ID:secret:MY_DEPLOYMENT/cookies-secret-*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "secretsmanager:ListSecrets"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "secretsmanager:BatchGetSecretValue"
      ],
      "Resource": [
        "arn:aws:secretsmanager:REGION:ACCOUNT_ID:secret:rds!db-*",
        "arn:aws:secretsmanager:REGION:ACCOUNT_ID:secret:MY_DEPLOYMENT/rds/connection-*",
        "arn:aws:secretsmanager:REGION:ACCOUNT_ID:secret:MY_DEPLOYMENT/s3/storage-*",
        "arn:aws:secretsmanager:REGION:ACCOUNT_ID:secret:MY_DEPLOYMENT/redis/auth-token-*",
        "arn:aws:secretsmanager:REGION:ACCOUNT_ID:secret:MY_DEPLOYMENT/oidc_secret-*",
        "arn:aws:secretsmanager:REGION:ACCOUNT_ID:secret:MY_DEPLOYMENT/cookies-secret-*"
      ]
    }
  ]
}
Then associate the role with the external-secrets service account in the external-secrets namespace:
aws eks create-pod-identity-association \
  --cluster-name MY_CLUSTER \
  --namespace external-secrets \
  --service-account external-secrets \
  --role-arn arn:aws:iam::ACCOUNT_ID:role/MY_CLUSTER-external-secrets-role

Secret layout

Organize secrets under a deployment prefix in Secrets Manager:
Secret pathFormatContents
rds!db-<UUID>JSONusername, password — RDS-managed, supports automatic rotation
MY_DEPLOYMENT/rds/connectionJSONendpoint, database_name
MY_DEPLOYMENT/s3/storageJSONmodel_registry (e.g., s3://my-bucket/model_registry), workdir (e.g., s3://my-bucket/shared)
MY_DEPLOYMENT/redis/auth-tokenJSONurl (e.g., redis://:AUTH_TOKEN@ELASTICACHE_ENDPOINT:6379)
MY_DEPLOYMENT/oidc_secretStringJSON array of OIDC provider configurations (see Authentication)
MY_DEPLOYMENT/cookies-secretStringRandom string, 64+ characters, used to sign session cookies

ClusterSecretStore

Create a ClusterSecretStore so ExternalSecret resources in any namespace can pull from Secrets Manager:
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
  name: aws-secrets-manager
spec:
  provider:
    aws:
      service: SecretsManager
      region: REGION
      auth:
        jwt:
          serviceAccountRef:
            name: external-secrets-sa
            namespace: external-secrets
The serviceAccountRef must point to a service account bound to the IAM role created in the IAM policy section above.

Control plane ExternalSecret

This ExternalSecret assembles the control plane Kubernetes secret from multiple Secrets Manager entries. A refreshInterval of 3m ensures rotated database passwords are picked up within minutes.
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: adaptive-controlplane
  namespace: adaptive
spec:
  refreshInterval: 3m
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: adaptive-controlplane
  data:
    - secretKey: dbUsername
      remoteRef:
        key: rds!db-XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
        property: username
    - secretKey: dbPassword
      remoteRef:
        key: rds!db-XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
        property: password
    - secretKey: dbHost
      remoteRef:
        key: MY_DEPLOYMENT/rds/connection
        property: endpoint
    - secretKey: dbName
      remoteRef:
        key: MY_DEPLOYMENT/rds/connection
        property: database_name
    - secretKey: cookiesSecret
      remoteRef:
        key: MY_DEPLOYMENT/cookies-secret
    - secretKey: oidcProviders
      remoteRef:
        key: MY_DEPLOYMENT/oidc_secret

Harmony ExternalSecret

S3 paths and Redis credentials change less frequently and refresh every hour:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: adaptive-harmony
  namespace: adaptive
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: adaptive-harmony
  data:
    - secretKey: modelRegistryUrl
      remoteRef:
        key: MY_DEPLOYMENT/s3/storage
        property: model_registry
    - secretKey: sharedDirectoryUrl
      remoteRef:
        key: MY_DEPLOYMENT/s3/storage
        property: workdir

Redis ExternalSecret

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: adaptive-redis
  namespace: adaptive
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: adaptive-redis
  data:
    - secretKey: redisUrl
      remoteRef:
        key: MY_DEPLOYMENT/redis/auth-token
        property: url

RDS password rotation

The rds!db-<UUID> secret is managed by RDS and supports automatic password rotation. When rotation is enabled:
  1. RDS rotates the database password on a schedule you configure (e.g., every 30 days).
  2. RDS writes the new credentials to the rds!db-<UUID> secret in Secrets Manager.
  3. The External Secrets Operator polls every 3 minutes (refreshInterval: 3m) and updates the Kubernetes secret.
  4. Adaptive Engine picks up the new credentials on the next database connection attempt.
To ensure pods restart automatically when secrets change, deploy Stakater Reloader and add the following annotation to the control plane deployment in your values.yaml:
controlPlane:
  annotations:
    reloader.stakater.com/auto: "true"
Reloader watches for changes to secrets referenced by the deployment and triggers a rolling restart when they update. Enable rotation in the RDS console or via AWS CLI:
aws secretsmanager rotate-secret \
  --secret-id "rds!db-XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX" \
  --rotation-rules AutomaticallyAfterDays=30
Use refreshInterval: 3m (or shorter) for database credentials. A longer interval risks the application using stale credentials after a rotation event.

Updating secrets

To update a secret value (e.g., changing the S3 bucket or OIDC configuration):
# Update a JSON secret
aws secretsmanager put-secret-value \
  --secret-id "MY_DEPLOYMENT/s3/storage" \
  --secret-string '{"model_registry": "s3://new-bucket/model_registry", "workdir": "s3://new-bucket/shared"}'

# Update a plain string secret (e.g., OIDC providers)
aws secretsmanager put-secret-value \
  --secret-id "MY_DEPLOYMENT/oidc_secret" \
  --secret-string '[{"name": "Cognito", "key": "cognito", "issuer_url": "https://cognito-idp.REGION.amazonaws.com/USER_POOL_ID", "client_id": "CLIENT_ID", "client_secret": "CLIENT_SECRET", "scopes": ["email", "profile", "openid"], "pkce": true, "allow_sign_up": true}]'
The External Secrets Operator syncs changes on the next refresh cycle. To force an immediate sync, annotate the ExternalSecret:
kubectl annotate externalsecret EXTERNAL_SECRET_NAME \
  force-sync=$(date +%s) --overwrite -n adaptive

Database (RDS)

Disable the in-chart PostgreSQL. Database credentials are sourced from Secrets Manager via the control plane ExternalSecret. Do not place credentials in values.yaml. Recommended RDS settings:
SettingValue
EnginePostgreSQL 17+
Instance classdb.m8g.2xlarge (or larger for high-throughput workloads)
Multi-AZEnabled for production
Storagegp3 with encryption enabled
Backup retention30 days
Cross-region replicationRecommended for disaster recovery
Password rotationEnabled via Secrets Manager (see RDS password rotation)
See database TLS configuration for certificate verification setup.

Cache (Amazon ElastiCache)

Disable the in-chart Redis. The ElastiCache endpoint and authentication token come from Secrets Manager via the Redis ExternalSecret.
redis:
  enabled: false
Recommended ElastiCache settings:
SettingValue
EngineRedis 7
Node typecache.r7g.large (or larger)
Encryption in transitEnabled
Authentication tokenEnabled, stored in Secrets Manager

Object storage (Amazon S3)

S3 bucket paths are sourced from Secrets Manager via the Harmony ExternalSecret. Grant the Harmony and control plane service accounts access to the bucket via IAM Roles for Service Accounts (IRSA) or EKS Pod Identity. The pods require the following S3 permissions on the bucket and its contents. Multipart upload actions are needed because Harmony uses multipart uploads for large model files.
{
  "Effect": "Allow",
  "Action": [
    "s3:AbortMultipartUpload",
    "s3:CompleteMultipartUpload",
    "s3:CreateMultipartUpload",
    "s3:DeleteObject",
    "s3:GetBucketLocation",
    "s3:GetObject",
    "s3:GetObjectTagging",
    "s3:ListBucket",
    "s3:ListBucketMultipartUploads",
    "s3:ListMultipartUploadParts",
    "s3:PutObject",
    "s3:PutObjectTagging",
    "s3:UploadPart"
  ],
  "Resource": [
    "arn:aws:s3:::MY_BUCKET",
    "arn:aws:s3:::MY_BUCKET/*"
  ]
}

Logging (Amazon CloudWatch)

Forward logs from the control plane and Harmony pods to CloudWatch using the CloudWatch Observability add-on or a standalone CloudWatch agent deployed as a DaemonSet. The Adaptive Helm chart includes a built-in OpenTelemetry Collector (enabled by default). Configure it to export logs to your CloudWatch agent:
otelCollector:
  enabled: true
  exporters:
    otlphttp/cw:
      endpoint: "CLOUDWATCH_OTLP_ENDPOINT"
  pipelines:
    logs:
      exporters:
        - otlphttp/cw
Replace CLOUDWATCH_OTLP_ENDPOINT with your CloudWatch agent’s OTLP endpoint (e.g., http://cloudwatch-agent.amazon-cloudwatch.svc.cluster.local:4318). Set a log retention policy that meets your compliance requirements (e.g., 400 days).

Ingress (ALB)

Configure the AWS Load Balancer Controller ingress:
ingress:
  enabled: true
  className: alb
  hostname: MY_HOSTNAME
  tls:
    enabled: true
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS": 443}]'
    alb.ingress.kubernetes.io/certificate-arn: ACM_CERTIFICATE_ARN
    alb.ingress.kubernetes.io/ssl-policy: ELBSecurityPolicy-TLS13-1-2-2021-06
If you front the ALB with a CDN or VPN, restrict inbound traffic using security group prefix lists:
ingress:
  annotations:
    alb.ingress.kubernetes.io/security-group-prefixes: "pl-XXXXXXXXXXXXXXXX,pl-YYYYYYYYYYYYYYYY"

Authentication (Amazon Cognito)

If you use Amazon Cognito as your OIDC provider, store the provider configuration in Secrets Manager at MY_DEPLOYMENT/oidc_secret as a JSON array:
[
  {
    "name": "Cognito",
    "key": "cognito",
    "issuer_url": "https://cognito-idp.REGION.amazonaws.com/USER_POOL_ID",
    "client_id": "COGNITO_CLIENT_ID",
    "client_secret": "COGNITO_CLIENT_SECRET",
    "scopes": ["email", "profile", "openid"],
    "pkce": true,
    "allow_sign_up": true
  }
]
The use of Amazon Cognito is optional, you can use an alternative OIDC client, such as Azure Entra ID, Okta, Google, Keycloak. Configure auth settings in values.yaml:
auth:
  default_role: read-only
  session:
    secure: true
    expiration_seconds: 86400
  admins:
    - "admin@example.com"
If allow_sign_up is true, any user in your Cognito user pool can access Adaptive Engine. Set to false and create users via SDK to restrict access.

AMI requirements

EKS nodes that run GPU workloads must use an AMI with the NVIDIA drivers pre-installed. Use one of:
  • EKS-optimized accelerated AMI — the default for GPU node groups managed by EKS Auto or Karpenter. Includes NVIDIA drivers and the nvidia-container-toolkit. This is the recommended option.
  • Custom AMI — required only if you need a specific driver version or kernel configuration. Build from the EKS-optimized AMI and ensure CUDA 12.8+ with driver 570.172.08+.
If you use Karpenter, set the AMI family in your EC2NodeClass to AL2023 or Bottlerocket. Karpenter then selects the correct accelerated AMI automatically. Do not override the AMI ID unless you maintain a custom image pipeline.

Capacity block reservation

Book a capacity block

Purchase capacity blocks through the AWS Console or CLI. Each reservation returns a Capacity Reservation ID (cr-XXXXXXXXXXXXXXXXX). Capacity block availability varies by region, instance type, and duration. There may be no offerings for your target configuration at any given time — check availability before attempting a purchase. Search available offerings:
aws ec2 describe-capacity-block-offerings \
  --instance-type p5.48xlarge \
  --instance-count 1 \
  --capacity-duration-hours 720
This returns a list of offerings with prices, start times, and an CapacityBlockOfferingId for each. Adjust --capacity-duration-hours (24–4320) and --instance-count to find available slots. If the response is empty, try a different duration or instance type. Purchase an offering:
aws ec2 purchase-capacity-block \
  --instance-type p5.48xlarge \
  --instance-count 1 \
  --instance-platform Linux/UNIX \
  --capacity-block-offering-id OFFERING_ID \
  --tag-specifications 'ResourceType=capacity-reservation,Tags=[{Key=deployment,Value=MY_DEPLOYMENT}]'
Replace OFFERING_ID with a CapacityBlockOfferingId from the search results. Tag reservations with a consistent key (e.g., deployment: MY_DEPLOYMENT) so Karpenter can discover them automatically.

Karpenter integration

Create a dedicated EC2NodeClass and NodePool that target your capacity block reservations. EC2NodeClass — defines the launch template for capacity block nodes:
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-capacity-block
spec:
  role: MY_EKS_NODE_ROLE
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: MY_CLUSTER
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: MY_CLUSTER
  capacityReservationSelectorTerms:
    - tags:
        deployment: MY_DEPLOYMENT
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 200Gi
        volumeType: gp3
        iops: 3000
        throughput: 125
        encrypted: true
NodePool — schedules only onto reserved capacity:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-capacity-block
spec:
  template:
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-capacity-block
      taints:
        - key: nvidia.com/gpu
          effect: NoSchedule
        - key: adaptive.com/gpu-capacity-block
          effect: NoSchedule
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["reserved"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
      terminationGracePeriod: 24h
  disruption:
    budgets:
      - nodes: "0"
    consolidationPolicy: WhenEmpty
    consolidateAfter: 1h
For workloads that can overflow to on-demand instances when capacity block nodes are full, create a second node pool:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-on-demand
spec:
  template:
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-default
      taints:
        - key: nvidia.com/gpu
          effect: NoSchedule
        - key: adaptive.com/on-demand
          effect: NoSchedule
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["p5.48xlarge", "p5e.48xlarge"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
  disruption:
    budgets:
      - nodes: "10%"
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30m

Compute pools targeting capacity blocks

Each compute pool maps to a set of GPU replicas. Use nodeSelector to pin pools to specific capacity block reservations. Add these to your values.yaml and run helm upgrade:
harmony:
  image:
    repository: adaptive/harmony
    tag: "vX.Y.Z"
    pullPolicy: Always
  extraEnvVars:
    HARMONY_SETTING_ALLOW_NCCL_TP: "1"
    NCCL_IGNORE_DISABLED_P2P: "1"
  computePool:
    - name: inference-pool-a
      gpusPerReplica: 2
      replicaCount: 1
      nodeSelector:
        eks.amazonaws.com/capacity-reservation-id: cr-XXXXXXXXXXXXXXXXX
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
        - key: adaptive.com/gpu-capacity-block
          operator: Exists
          effect: NoSchedule
      resources:
        requests:
          cpu: 45
          memory: 425Gi
        limits:
          cpu: 45
          memory: 425Gi

    - name: training-pool
      gpusPerReplica: 8
      replicaCount: 1
      nodeSelector:
        eks.amazonaws.com/capacity-reservation-id: cr-YYYYYYYYYYYYYYYYY
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
        - key: adaptive.com/gpu-capacity-block
          operator: Exists
          effect: NoSchedule
      resources:
        requests:
          cpu: 180
          memory: 1700Gi
        limits:
          cpu: 180
          memory: 1700Gi
If your Adaptive Engine cluster has multi-node compute pools, you must set Elastic Fabric Adapter (EFA):
  • add EFA device requests to your pod
  • Use the harmony-efa image variant (provided by Adaptive) which includes the EFA libraries.
harmony:
  extraEnvVars:
    HARMONY_SETTING_ALLOW_NCCL_TP: "1"
    NCCL_IGNORE_DISABLED_P2P: "1"
    NCCL_PROTO: "LL,LL128,Simple"
  computePool:
    - name: efa-training-pool
      gpusPerReplica: 8
      replicaCount: 1
      image:
        repository: adaptive/harmony-efa
        tag: "vX.Y.Z"
      nodeSelector:
        eks.amazonaws.com/capacity-reservation-id: cr-YYYYYYYYYYYYYYYYY
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
        - key: adaptive.com/gpu-capacity-block
          operator: Exists
          effect: NoSchedule
      resources:
        requests:
          cpu: 180
          memory: 1700Gi
          vpc.amazonaws.com/efa: 32
        limits:
          cpu: 180
          memory: 1700Gi
          vpc.amazonaws.com/efa: 32
Keep ~10% vCPU and memory for kubelet and system pods. For example, on p5.48xlarge (192 vCPUs, 2048 GiB), request 180 CPU and 1700GiB per 8-GPU replica.

Capacity block lifecycle

Capacity blocks have fixed start and end times. Plan for reservation renewals:
  • Extend before rebooking — Capacity blocks can be extended if capacity allows. Check whether an existing block can be extended before purchasing a new one.
  • Before expiry — Purchase the next capacity block and update the deployment tag (or reservation ID in nodeSelector) to include the new reservation.
  • Zero-downtime transition — To avoid downtime when a capacity block expires, create a new compute pool backed by the replacement block and load-balance inference across both pools until the old block expires.
If a capacity block expires without a replacement, GPU pods enter Pending state. Maintain an on-demand fallback node pool (with replicaCount: 0 in Helm) that can be scaled up as a safety net.