nvidia.com/gpu
should comply with the number specified in the Bare Metal flavor.helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm inspect values prometheus-community/kube-prometheus-stack > values.yaml
job
that scrapes DCGM metrics from the gpu-operator
namespace.
DCGM_FI_DEV_GPU_UTIL
represents the real-time utilization of GPU cores. This is collected by the nvidia-dcgm-exporter
, which is part of the GPU Operator setup.
You may adjust the query
expression or choose other metrics depending on your use case.
DCGM_FI_DEV_GPU_UTIL
metrics (GPU utilization). However, you can use any of available metric from DCGM exporter.k get hpa -w
.k get events -w
.k get pods
.maxReplicaCount
in Step 8.
k get nodes
.k get hpa
.maxReplicaCount
limit.
If all GPUs are in use, and there are not enough available resources, Kubernetes will keep the pods in Pending
state.