GPU utilization-based horizontal autoscaling based on Prometheus custom metric. This guideline provides complete steps for GPU auto scaling on AWS EKS.
There are differences between CPU scaling and GPU scaling below:
CPU scaling vs. GPU scaling
CPU | GPU | Description | |
---|---|---|---|
Metric | Supported | Not Supported | NVIDIA DCGM exporter daemonset is required to collect GPU metrics because it is not collected through the Metric Server by default. |
HPA | Supported | Not Supported | Horizontal Pod Autoscaling(HPA) for GPU can be working based on Prometheus custom metric. |
Fraction | Supported | Not Supported | GPU resource fraction is not supported such as 'nvidia.com/gpu: 0.5'. |
- Collect the GPU metrics through Data Center GPU Manager(DCGM) exporter and scale pods through HPA, which works based on Prometheus custom metric.
- GPU cluster autoscaling with CA or Karpenter.
Pod-level
GPU autoscaling.One shared GPU node group
: Two node groups are required for CPU and GPU, and GPU node group hasaccelerator: nvidia-gpu
label. Inference API applications run inone shared GPU node group
to not create clusters per GPU application.
SW | Version | Desc |
---|---|---|
EKS | 1.21 | version |
DCGM exporter | 2.65 | |
Prometheus | 2.34.0 | |
Prometheus Adapter | 3.2.2 | |
Grafana | 6.24.1 | |
CDK | 2.20.0 |
NAME | CHART | APP VERSION |
---|---|---|
kube-prometheus-stack | 35.0.3 | 0.56.0 |
prometheus-adapter | 3.2.2 | v0.9.1 |
The EKS Blueprint was used to minimize the installation steps of EKS cluster and add-on.
Create a cluster with EKS Blueprint:
- VPC
- EKS cluster & nodegroup
- Cluster AutoScaler(CA) Addon
- AWS Load Balancer Controller Addon
- Kubernetes Dashboard
If you want to use the existing cluster or create a new cluster by using eksctl
, refer to the ref-eksctl/README.md page.
- Install Prometheus Stack
- Deploy NVIDIA DCGM exporter as daemonset
- Install Prometheus Adapter with custom metric configuration
- Create Grafana Dashboards
- Deploy inference API and GPU HPA
- AutoScaling Test
Six components are included in the kube-prometheus-stack
stack:
- prometheus (prometheus-kube-prometheus-stack-prometheus-0)
- prometheus-operator
- alertmanager
- node-exporter
- kube-state-metrics
- grafana
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm upgrade --install --version=35.0.3 kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--create-namespace --namespace monitoring \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
Port forward for http://localhost:9090/targets
kubectl port-forward svc/kube-prometheus-stack-prometheus 9090:9090 -n monitoring
kubectl apply -f dcgm-exporter.yaml
kubectl apply -f dcgm-exporter-karpenter.yaml
Deploy with a local YAML file instead of Helm Chart to use the ServiceMonitor for Service Discovery. Scrape configurations can be added in additionalScrapeConfigs element when installing Prometheus, but we will use ServiceMonitor to deploy configuration per K8s Service.
kubectl get servicemonitor dcgm-exporter -o yaml
---
kind: Service
apiVersion: v1
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.6.5"
spec:
selector:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.6.5"
ports:
- name: "metrics"
port: 9400
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: "dcgm-exporter"
spec:
selector:
matchLabels:
app.kubernetes.io/name: "dcgm-exporter"
endpoints:
- port: "metrics"
Cluster Autoscaler(CA)
spec:
nodeSelector:
accelerator: nvidia-gpu
Karpenter
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: beta.kubernetes.io/instance-type
operator: In
values:
- p2.xlarge
- p2.4xlarge
- p2.8xlarge
- g4dn.xlarge
After deployment, you can see serviceMonitor/default/dcgm-exporter
in Status > Targets menu like the following:
Port forward for 'http://localhost:9400/metrics':
kubectl port-forward svc/dcgm-exporter 9400:9400
Retrieve DCGM_FI_DEV_GPU_UTIL metric:
curl http://localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL
Response example:
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-74f7fe3b-48f2-6d8b-3cb4-e70426fb669c",device="nvidia0",modelName="Tesla K80",Hostname="dcgm-exporter-cmhft",container="",namespace="",pod=""} 0
DCGM_FI_DEV_GPU_UTIL{gpu="1",UUID="GPU-f3a2185e-464d-c671-4057-0d056df64b6e",device="nvidia1",modelName="Tesla K80",Hostname="dcgm-exporter-cmhft",container="",namespace="",pod=""} 0
DCGM_FI_DEV_GPU_UTIL{gpu="2",UUID="GPU-6ae74b72-48d0-f09f-14e2-4e09ceebda63",device="nvidia2",modelName="Tesla K80",Hostname="dcgm-exporter-cmhft",container="",namespace="",pod=""} 0
Check a service name to configure the internal DNS setting of prometheus.url
parameter:
kubectl get svc -lapp=kube-prometheus-stack-prometheus -n monitoring
prometheus.url format: http://<service-name>.<namespace>.svc.cluster.local
e.g.,
http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local
http://kube-prometheus-stack-prometheus.prometheus.svc.cluster.local
Install prometheus-adapter:
helm install --version=3.2.2 prometheus-adapter prometheus-community/prometheus-adapter -f prometheus-adapter-values.yaml
prometheus-adapter-values.yaml
custom:
- seriesQuery: 'DCGM_FI_DEV_GPU_UTIL{exported_namespace!="",exported_container!="",exported_pod!=""}'
name:
as: "DCGM_FI_DEV_GPU_UTIL_AVG"
resources:
overrides:
exported_namespace: {resource: "namespace"}
exported_container: {resource: "service"}
exported_pod: {resource: "pod"}
metricsQuery: avg by (exported_namespace, exported_container) (round(avg_over_time(<<.Series>>[1m])))
Override the label to retrieve with DCGM_FI_DEV_GPU_UTIL_AVG{service="gpu-api"
} that saved as DCGM_FI_DEV_GPU_UTIL{exported_container="gpu-api"
}. For details about prometheus-adapter rule, how it works, and how to check the /api/v1/query API logs, refer to the CustomMetric.md page.
IMPORTANT
You can retrieve the DCGM_FI_DEV_GPU_UTIL_AVG metric after deployment of gpu-api(Step 5) because it's collected based on K8s Service.
Retrieve DCGM metrics:
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep DCGM_FI_DEV_GPU_UTIL
"name": "services/DCGM_FI_DEV_GPU_UTIL",
"name": "namespaces/DCGM_FI_DEV_GPU_UTIL",
"name": "jobs.batch/DCGM_FI_DEV_GPU_UTIL",
"name": "pods/DCGM_FI_DEV_GPU_UTIL",
...
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/DCGM_FI_DEV_GPU_UTIL" | jq .
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/services/dcgm-exporter/DCGM_FI_DEV_GPU_UTIL" | jq .
If there is no value, connect to the DCGM exporter pod, and check connectivity with wget http://<prometheus-url>:<port>
. Refer to the CustomMetric.md page to check API response.
Port forward for http://localhost:8081
kubectl port-forward svc/kube-prometheus-stack-grafana 8081:80 -n monitoring
Command for retrieve the password of Grafana:
kubectl get secret --namespace prometheus kube-prometheus-stack-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
Import dashboards
- NVIDIA DCGM Exporter Dashboard ID: 12239
- 1 Kubernetes All-in-one Cluster Monitoring KR ID: 13770
The metrics-server installation is required only if a cluster is created by eksctl
. It's included in EKS blueprints.
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm upgrade --install metrics-server metrics-server/metrics-server -n monitoring
Create two repositories:
REGION=$(aws configure get default.region)
aws ecr create-repository --repository-name cpu-api --region ${REGION}
aws ecr create-repository --repository-name gpu-api --region ${REGION}
We will deploy Deployment, Service, HorizontalPodAutoscaler, and Ingress for cpu-api and gpu-api:
cd cpu-api
./build.sh
REGION=$(aws configure get default.region)
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
sed -e "s|<account-id>|${ACCOUNT_ID}|g" cpu-api-template.yaml | sed -e "s|<region>|${REGION}|g" > cpu-api.yaml
kubectl apply -f cpu-api.yaml
cd ../gpu-api
./build.sh
sed -e "s|<account-id>|${ACCOUNT_ID}|g" gpu-api-template.yaml | sed -e "s|<region>|${REGION}|g" > gpu-api.yaml
kubectl apply -f gpu-api.yaml
sed -e "s|<account-id>|${ACCOUNT_ID}|g" gpu-api2-template.yaml | sed -e "s|<region>|${REGION}|g" > gpu-api2.yaml
kubectl apply -f gpu-api2.yaml
image size: 3.33GB, image pull: 39.50s
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: gpu-api-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: gpu-api # <-- service name
minReplicas: 2
maxReplicas: 10
metrics:
- type: Object
object:
metric:
name: DCGM_FI_DEV_GPU_UTIL_AVG
describedObject:
kind: Service
name: gpu-api # <-- service name
target:
type: Value
value: '30'
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-api
namespace: default
annotations:
app: 'gpu-api'
spec:
replicas: 2
selector:
matchLabels:
app: gpu-api
template:
metadata:
labels:
app: gpu-api
spec:
containers:
- name: gpu-api # Set container name and service name with same name
image: 123456789.dkr.ecr.ap-northeast-2.amazonaws.com/gpu-api:latest
imagePullPolicy: Always
Retrieve custom metrics:
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep DCGM_FI_DEV_GPU_UTIL
"name": "services/DCGM_FI_DEV_GPU_UTIL",
"name": "namespaces/DCGM_FI_DEV_GPU_UTIL",
"name": "jobs.batch/DCGM_FI_DEV_GPU_UTIL",
"name": "pods/DCGM_FI_DEV_GPU_UTIL",
"name": "services/DCGM_FI_DEV_GPU_UTIL_AVG",
"name": "namespaces/DCGM_FI_DEV_GPU_UTIL_AVG",
"name": "jobs.batch/DCGM_FI_DEV_GPU_UTIL_AVG",
"name": "pods/DCGM_FI_DEV_GPU_UTIL_AVG",
...
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/DCGM_FI_DEV_GPU_UTIL_AVG" | jq .
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/services/gpu-api/DCGM_FI_DEV_GPU_UTIL_AVG" | jq .
{
"kind": "MetricValueList",
"apiVersion": "custom.metrics.k8s.io/v1beta1",
"metadata": {
"selfLink": "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/services/gpu-api/DCGM_FI_DEV_GPU_UTIL_AVG"
},
"items": [
{
"describedObject": {
"kind": "Service",
"namespace": "default",
"name": "gpu-api",
"apiVersion": "/v1"
},
"metricName": "DCGM_FI_DEV_GPU_UTIL_AVG",
"timestamp": "2022-05-03T02:27:08Z",
"value": "0",
"selector": null
}
]
}
# aws-load-balancer-controller logs
kubectl logs -f $(kubectl get po -n kube-system | egrep -o 'aws-load-balancer-controller-[A-Za-z0-9-]+') -n kube-system
kubectl get hpa cpu-api-hpa -w
kubectl get hpa gpu-api-hpa -w
kubectl get hpa gpu-api2-hpa -w
# kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
cpu-api-hpa Deployment/cpu-api 0%/50% 2 10 2 30s
gpu-api-hpa Deployment/gpu-api 0/20 2 10 2 30s
gpu-api2-hpa Deployment/gpu-api2 0/20 2 10 2 30s
cd test
bzt gpu-api-bzt.yaml
# $JMETER_HOME/bin
./jmeter -t gpu-api.jmx -n -j ../log/jmeter.log
kubectl describe hpa gpu-api-hpa
Name: gpu-api-hpa
Namespace: default
Labels: <none>
Annotations: app: gpu-api
CreationTimestamp: Sun, 10 Apr 2022 21:09:59 +0900
Reference: Deployment/gpu-api
Metrics: ( current / target )
"DCGM_FI_DEV_GPU_UTIL_AVG" on Service/gpu-api (target value): 11 / 20
Min replicas: 2
Max replicas: 12
Deployment pods: 8 current / 8 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ScaleDownStabilized recent recommendations were higher than current one, applying the highest recent recommendation
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from Service metric DCGM_FI_DEV_GPU_UTIL_AVG
ScalingLimited False DesiredWithinRange the desired count is within the acceptable range
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulRescale 53m horizontal-pod-autoscaler New size: 5; reason: Service metric DCGM_FI_DEV_GPU_UTIL_AVG above target
Normal SuccessfulRescale 44m horizontal-pod-autoscaler New size: 6; reason: Service metric DCGM_FI_DEV_GPU_UTIL_AVG above target
Normal SuccessfulRescale 41m horizontal-pod-autoscaler New size: 8; reason: Service metric DCGM_FI_DEV_GPU_UTIL_AVG above target
kubectl describe deploy gpu-api
kubectl describe apiservices v1beta1.metrics.k8s.io
kubectl logs -n kube-system -l k8s-app=metrics-server
kubectl delete -f cpu-api/cpu-api.yaml
kubectl delete -f gpu-api/gpu-api.yaml
kubectl delete -f gpu-api/gpu-api2.yaml
kubectl delete -f dcgm-exporter.yaml
kubectl delete -f dcgm-exporter-karpenter.yaml
helm uninstall prometheus-adapter
helm uninstall kube-prometheus-stack -n monitoring
helm uninstall metrics-server -n monitoring
-
You can check all event logs with
kubectl get events -w
command. -
Error from server (NotFound): the server could not find the metric DCGM_FI_DEV_GPU_UTIL_AVG for services
gpu-api or your application should be deployed as a K8s Service.
-
How to check PromQL logs?
You can see access log of /api/v1/query API with
logLevel: 6
. Refer to the CustomMetric.md page.