Monitoring GPUs in Managed Service for Kubernetes clusters
You can set up your Managed Service for Kubernetes nodes with GPUs to collect GPU usage data and visualize it on a GPU dashboard.
You can use the dashboard to monitor current resource utilization, schedule quota increases, and quickly identify anomalies. It also helps the Nebius AI support team investigate any GPU issues you may be experiencing.
Start collecting metrics
You can set up the nodes either via the free Nebius AI Marketplace product or via the tools configurations.
Install NVIDIA® GPU Operator. This product contains the tools optimized to collect the monitoring GPU metrics.
Install the tools that will collect and export the metrics:
-
Prepare the Managed Service for Kubernetes cluster with GPUs.
-
Create the custom metrics configuration file:
Configurationcat <<EOF > dcgm-metrics.csv # Format # If line starts with a '#' it is considered a comment # DCGM FIELD , Prometheus metric type ,help message # Clocks DCGM_FI_DEV_SM_CLOCK , gauge ,SM clock frequency (in MHz). DCGM_FI_DEV_MEM_CLOCK , gauge ,Memory clock frequency (in MHz). # Temperature DCGM_FI_DEV_MEMORY_TEMP , gauge ,Memory temperature (in C). DCGM_FI_DEV_GPU_TEMP , gauge ,GPU temperature (in C). # Power DCGM_FI_DEV_POWER_USAGE , gauge ,Power draw (in W). DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION , counter ,Total energy consumption since boot (in mJ). # PCIE DCGM_FI_DEV_PCIE_REPLAY_COUNTER , counter ,Total number of PCIe retries. # Utilization (the sample period varies depending on the product) DCGM_FI_DEV_GPU_UTIL , gauge ,GPU utilization (in %). DCGM_FI_DEV_MEM_COPY_UTIL , gauge ,Memory utilization (in %). DCGM_FI_DEV_ENC_UTIL , gauge ,Encoder utilization (in %). DCGM_FI_DEV_DEC_UTIL , gauge ,Decoder utilization (in %). # Errors and violations DCGM_FI_DEV_XID_ERRORS , gauge ,Value of the last XID error encountered. # Memory usage DCGM_FI_DEV_FB_FREE , gauge ,Framebuffer memory free (in MiB). DCGM_FI_DEV_FB_USED , gauge ,Framebuffer memory used (in MiB). # NVLink DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL , counter ,Total number of NVLink bandwidth counters for all lanes. # VGPU License status DCGM_FI_DEV_VGPU_LICENSE_STATUS , gauge ,vGPU License status # Remapped rows DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS , counter ,Number of remapped rows for uncorrectable errors DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS , counter ,Number of remapped rows for correctable errors DCGM_FI_DEV_ROW_REMAP_FAILURE , gauge ,Whether remapping of rows has failed # DCP metrics DCGM_FI_PROF_PCIE_TX_BYTES , counter ,The number of bytes of active pcie tx data including both header and payload. DCGM_FI_PROF_PCIE_RX_BYTES , counter ,The number of bytes of active pcie rx data including both header and payload. DCGM_FI_PROF_GR_ENGINE_ACTIVE , gauge ,Ratio of time the graphics engine is active (in %). DCGM_FI_PROF_SM_ACTIVE , gauge ,The ratio of cycles an SM has at least 1 warp assigned (in %). DCGM_FI_PROF_SM_OCCUPANCY , gauge ,The ratio of number of warps resident on an SM (in %). DCGM_FI_PROF_PIPE_TENSOR_ACTIVE , gauge ,Ratio of cycles the tensor (HMMA) pipe is active (in %). DCGM_FI_PROF_DRAM_ACTIVE , gauge ,Ratio of cycles the device memory interface is active sending or receiving data (in %). DCGM_FI_PROF_PIPE_FP64_ACTIVE , gauge ,Ratio of cycles the fp64 pipes are active (in %). DCGM_FI_PROF_PIPE_FP32_ACTIVE , gauge ,Ratio of cycles the fp32 pipes are active (in %). DCGM_FI_PROF_PIPE_FP16_ACTIVE , gauge ,Ratio of cycles the fp16 pipes are active (in %). # Datadog additional recommended fields DCGM_FI_DEV_COUNT , counter ,Number of Devices on the node. DCGM_FI_DEV_FAN_SPEED , gauge ,Fan speed for the device in percent 0-100. DCGM_FI_DEV_SLOWDOWN_TEMP , gauge ,Slowdown temperature for the device. DCGM_FI_DEV_POWER_MGMT_LIMIT , gauge ,Current power limit for the device. DCGM_FI_DEV_PSTATE , gauge ,Performance state (P-State) 0-15. 0=highest DCGM_FI_DEV_FB_TOTAL , gauge , DCGM_FI_DEV_FB_RESERVED , gauge , DCGM_FI_DEV_FB_USED_PERCENT , gauge , DCGM_FI_DEV_CLOCK_THROTTLE_REASONS , gauge ,Current clock throttle reasons (bitmask of DCGM_CLOCKS_THROTTLE_REASON_*) DCGM_FI_PROCESS_NAME , label ,The Process Name. DCGM_FI_CUDA_DRIVER_VERSION , label , DCGM_FI_DEV_NAME , label , DCGM_FI_DEV_MINOR_NUMBER , label , DCGM_FI_DRIVER_VERSION , label , DCGM_FI_DEV_BRAND , label , DCGM_FI_DEV_SERIAL , label , EOF
-
Create ConfigMap with the metrics configuration:
kubectl create configmap metrics-config -n gpu-operator --from-file=dcgm-metrics.csv
-
Update the
dcgm-exporter
configuration to usedcgm-metrics.csv
:kubectl patch clusterpolicy/cluster-policy --type='json' -p='[{"op": "add", "path": "/spec/dcgmExporter/config", "value": {"name": "metrics-config"}}]'
kubectl patch clusterpolicy/cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/dcgmExporter/env", "value": [ {"name": "DCGM_EXPORTER_LISTEN", "value": ":9400"}, {"name": "DCGM_EXPORTER_KUBERNETES", "value": "true"}, {"name": "DCGM_EXPORTER_COLLECTORS", "value": "/etc/dcgm-exporter/dcgm-metrics.csv"}, ]}]'
-
Restart
dcgm-exporter
:kubectl get pods -n gpu-operator | grep nvidia-dcgm-exporter | awk '{ print $1 }' | xargs kubectl delete pod -n gpu-operator
-
Create the DaemonSet configuration file that installs and enables the metrics collection agent:
DaemonSet configurationcat <<'EODAEMONSET' > unified_agent_daemonset.yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: unified-agent-installer spec: selector: matchLabels: name: unified-agent-installer template: metadata: labels: name: unified-agent-installer spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: nvidia.com/gpu.count operator: Exists hostNetwork: true hostPID: true hostIPC: true containers: - name: unified-agent-installer image: busybox securityContext: privileged: true volumeMounts: - mountPath: /host name: noderoot command: [ "chroot", "/host", "/bin/sh", "-c" ] args: - | MONITORING_API_ADDRESS="https://monitoring.api.nemax.nebius.cloud" folder_id="$(curl -H 'Metadata-Flavor:Google' -s http://169.254.169.254/computeMetadata/v1/instance/vendor/folder-id)" vm_id="$(curl -H 'Metadata-Flavor:Google' -s http://169.254.169.254/computeMetadata/v1/instance/id)" host="$(hostname)" k8s_cluster_id="$(grep -E "cluster: \w+" /etc/kubernetes/bootstrap-kubeconfig.conf | awk '{print $2}')" rm -rf /etc/unified_agent/config.yaml echo "Installing unified agent config" while true; do dcgm_ip="$(/home/kubernetes/bin/crictl inspectp -o json $(/home/kubernetes/bin/crictl pods | grep "nvidia-dcgm-exporter" | awk '{print $1}') | jq -r '.status.network.ip')" if [ ! -z "${dcgm_ip}" ]; then break fi echo "waiting for nvidia-dcgm-exporter container to be ready" sleep 5 done mkdir -p /etc/unified_agent cat <<EOF > /etc/unified_agent/config.yaml storages: - name: metrics_storage plugin: fs config: directory: /var/lib/unified_agent/metrics_storage max_partition_size: 100mb max_segment_size: 10mb channels: - name: metrics_channel channel: pipe: - filter: plugin: add_metric_labels config: labels: vm_id: "${vm_id}" host: "${host}" k8s_cluster_id: "${k8s_cluster_id}" - filter: plugin: transform_metric_labels config: labels: - gpu: "-" - device: "-" - modelName: "-" - Hostname: "-" - pod: "-" - container: "-" - namespace: "-" - DCGM_FI_CUDA_DRIVER_VERSION: "-" - DCGM_FI_DRIVER_VERSION: "-" - DCGM_FI_PROCESS_NAME: "-" - DCGM_FI_DEV_BRAND: "-" - DCGM_FI_DEV_SERIAL: "-" - DCGM_FI_DEV_MINOR_NUMBER: "-" - DCGM_FI_DEV_NAME: "-" - storage_ref: name: metrics_storage output: plugin: yc_metrics config: url: "${MONITORING_API_ADDRESS}/monitoring/v2/data/write" folder_id: "${folder_id}" service: compute iam: cloud_meta: { } routes: - input: plugin: metrics_pull config: url: "http://${dcgm_ip}:9400/metrics" format: prometheus: {} metric_name_label: name channel: channel_ref: name: metrics_channel EOF if [ ! -f /usr/local/bin/unified_agent ]; then echo "Installing unified agent binary" s3_bucket_address="https://storage.il.nebius.cloud/yc-unified-agent" ua_version="$(curl -s ${s3_bucket_address}/latest-version)" curl -s "${s3_bucket_address}/releases/${ua_version}/unified_agent" -o unified_agent chmod +x ./unified_agent mkdir -p /usr/local/bin mv unified_agent /usr/local/bin/unified_agent fi if [ ! -f /etc/systemd/system/unified_agent.service ]; then echo "Installing unified agent.service" cat <<'EOF' > /usr/local/bin/unified_agent_pre #!/usr/bin/env bash set -eu set -o pipefail dcgm_cur_ip="$(grep "9400/metrics" /etc/unified_agent/config.yaml | grep -Eo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}')" dcgm_new_ip="$(/home/kubernetes/bin/crictl inspectp -o json $(/home/kubernetes/bin/crictl pods | grep "nvidia-dcgm-exporter" | awk '{print $1}') | jq -r '.status.network.ip')" if [ ! -z "${dcgm_new_ip}" ] && [ "${dcgm_cur_ip}" == "${dcgm_new_ip}" ]; then exit 0 fi sed -i "s/${dcgm_cur_ip}/${dcgm_new_ip}/g" /etc/unified_agent/config.yaml EOF chmod +x /usr/local/bin/unified_agent_pre cat <<EOF > /etc/systemd/system/unified_agent.service [Unit] Description=unified agent After=network.target [Service] Type=simple MemoryLimit=500M ExecStartPre=/usr/local/bin/unified_agent_pre ExecStart=/usr/local/bin/unified_agent --config /etc/unified_agent/config.yaml KillMode=process Restart=always RestartSec=2s [Install] WantedBy=multi-user.target EOF systemctl daemon-reload echo "Enabling and starting unified_agent.service" systemctl enable unified_agent.service systemctl start unified_agent.service fi echo "unified agent is installed" echo "metrics will be available at https://console.nebius.ai/folders/${folder_id}/compute/instance/${vm_id}/monitoring?tab=gpu" sleep infinity volumes: - name: noderoot hostPath: path: / EODAEMONSET
-
Apply the configuration file:
kubectl create -f unified_agent_daemonset.yaml
Explore the dashboard
The agent delivers the metrics to the dashboard in 5–10 minutes after the VM creation or update. Then the GPU health status details are available in the management console. You can view them on the GPU tab of the Monitoring section of the Managed Service for Kubernetes cluster management page.
Metrics reference
The corresponding NVIDIA metrics
-
GPU Utilization
DCGM_FI_DEV_GPU_UTIL
Percentage of time a GPU spends executing tasks.
-
Memory Utilization
DCGM_FI_DEV_MEM_COPY_UTIL
Percentage of time GPU memory was in use (performing read or write tasks) in a dedicated period.
-
Free Frame Buffer in MB
DCGM_FI_DEV_FB_FREE
Amount of free frame buffer memory.
-
Used Frame Buffer in MB
DCGM_FI_DEV_FB_USED
Amount of used frame buffer memory.
-
Total Frame Buffer of the GPU in MB
DCGM_FI_DEV_FB_TOTAL
A constant. The total amount of frame buffer memory.
-
Reserved Frame buffer in MB
DCGM_FI_DEV_FB_RESERVED
A constant. Amount of frame buffer memory reserved for the internal use of the hardware: drivers, firmware, etc.
-
The number of bytes of active PCIe rx/tx
DCGM_FI_PROF_PCIE_RX_BYTES
,DCGM_FI_PROF_PCIE_TX_BYTES
Number of bytes a GPU received from (rx) or transmitted to (tx) its host VM and other devices over PCIe. Both header and payload of each PCIe packet are included.
-
SM Clock for the device
DCGM_FI_DEV_SM_CLOCK
Frequency of the main GPU clock.
-
Memory clock for the device
DCGM_FI_DEV_MEM_CLOCK
Frequency and total amount of operations in time spans.
-
Current clock throttle reasons
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS
A bitmask of possible reasons for GPU throttling
.For example, if the GPU is throttling because it has overheated and slowed down, the chart will show 72: code 0x40 for overheating (
DCGM_CLOCKS_THROTTLE_REASON_HW_THERMAL
) + code 0x8 for slowdown (DCGM_CLOCKS_THROTTLE_REASON_HW_SLOWDOWN
) = 72 in decimal. -
Power usage for the device
DCGM_FI_DEV_POWER_USAGE
Current energy consumption by a GPU in watts.
-
Total energy consumption for the GPU since the driver was last reloaded
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION
Cumulative energy consumption by a GPU since the recent driver reload in millijoules.
-
Memory temperature for the device
DCGM_FI_DEV_MEMORY_TEMP
Memory temperature in degrees Celsius.
-
Current temperature readings for the device
DCGM_FI_DEV_GPU_TEMP
GPU core temperature in degrees Celsius.
-
Current Power limit for the device
DCGM_FI_DEV_POWER_MGMT_LIMIT
A constant. Power consumption limit after which the GPU will be throttled.
-
Slowdown temperature for the device
DCGM_FI_DEV_SLOWDOWN_TEMP
A constant. Temperature threshold after which the GPU will be throttled until it cools down.
-
Rows remapping
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS
,DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS
,DCGM_FI_DEV_ROW_REMAP_FAILURE
Metrics related to row remapping in GPU memory: the numbers of rows remapped for
CORRECTABLE
andUNCORRECTABLE
errors, and the number of failed row remappings (REMAP_FAILURE
).For more details, see the NVIDIA guide
. -
XID errors
DCGM_FI_DEV_XID_ERRORS
An indication of a general GPU error that should be used for further investigation and debugging.
See the list of all possible errors in the NVIDIA guide
.