Monitoring GPUs in Managed Service for Kubernetes clusters

Start collecting metrics
Explore the dashboard
- Metrics reference

You can set up your Managed Service for Kubernetes nodes with GPUs to collect GPU usage data and visualize it on a GPU dashboard.

You can use the dashboard to monitor current resource utilization, schedule quota increases, and quickly identify anomalies. It also helps the Nebius AI support team investigate any GPU issues you may be experiencing.

Start collecting metrics

You can set up the nodes either via the free Nebius AI Marketplace product or via the tools configurations.

Marketplace product

Self tools installation

Install NVIDIA® GPU Operator. This product contains the tools optimized to collect the monitoring GPU metrics.

Install the tools that will collect and export the metrics:

Prepare the Managed Service for Kubernetes cluster with GPUs.

Create the custom metrics configuration file:

Configuration

cat <<EOF > dcgm-metrics.csv
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD                                                      ,     Prometheus metric type ,help message

# Clocks
DCGM_FI_DEV_SM_CLOCK                                              ,     gauge                  ,SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK                                             ,     gauge                  ,Memory clock frequency (in MHz).

# Temperature
DCGM_FI_DEV_MEMORY_TEMP                                           ,     gauge                  ,Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP                                              ,     gauge                  ,GPU temperature (in C).

# Power
DCGM_FI_DEV_POWER_USAGE                                           ,     gauge                  ,Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION                              ,     counter                ,Total energy consumption since boot (in mJ).

# PCIE
DCGM_FI_DEV_PCIE_REPLAY_COUNTER                                   ,     counter                ,Total number of PCIe retries.

# Utilization (the sample period varies depending on the product)
DCGM_FI_DEV_GPU_UTIL                                              ,     gauge                  ,GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL                                         ,     gauge                  ,Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL                                              ,     gauge                  ,Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL                                              ,     gauge                  ,Decoder utilization (in %).

# Errors and violations
DCGM_FI_DEV_XID_ERRORS                                            ,     gauge                  ,Value of the last XID error encountered.

# Memory usage
DCGM_FI_DEV_FB_FREE                                               ,     gauge                  ,Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED                                               ,     gauge                  ,Framebuffer memory used (in MiB).

# NVLink
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL                                ,     counter                ,Total number of NVLink bandwidth counters for all      lanes.

# VGPU License status
DCGM_FI_DEV_VGPU_LICENSE_STATUS                                   ,     gauge                  ,vGPU License status

# Remapped rows
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS                           ,     counter                ,Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS                             ,     counter                ,Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE                                     ,     gauge                  ,Whether remapping of rows has failed

# DCP metrics
DCGM_FI_PROF_PCIE_TX_BYTES                                        ,     counter                ,The number of bytes of active pcie tx data      including both header and payload.
DCGM_FI_PROF_PCIE_RX_BYTES                                        ,     counter                ,The number of bytes of active pcie rx data      including both header and payload.
DCGM_FI_PROF_GR_ENGINE_ACTIVE                                     ,     gauge                  ,Ratio of time the graphics engine is active (in %).
DCGM_FI_PROF_SM_ACTIVE                                            ,     gauge                  ,The ratio of cycles an SM has at least 1 warp      assigned (in %).
DCGM_FI_PROF_SM_OCCUPANCY                                         ,     gauge                  ,The ratio of number of warps resident on an SM (in      %).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE                                   ,     gauge                  ,Ratio of cycles the tensor (HMMA) pipe is active      (in %).
DCGM_FI_PROF_DRAM_ACTIVE                                          ,     gauge                  ,Ratio of cycles the device memory interface is      active sending or receiving data (in %).
DCGM_FI_PROF_PIPE_FP64_ACTIVE                                     ,     gauge                  ,Ratio of cycles the fp64 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP32_ACTIVE                                     ,     gauge                  ,Ratio of cycles the fp32 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP16_ACTIVE                                     ,     gauge                  ,Ratio of cycles the fp16 pipes are active (in %).

# Datadog additional recommended fields
DCGM_FI_DEV_COUNT                                                 ,     counter                ,Number of Devices on the node.
DCGM_FI_DEV_FAN_SPEED                                             ,     gauge                  ,Fan speed for the device in percent 0-100.
DCGM_FI_DEV_SLOWDOWN_TEMP                                         ,     gauge                  ,Slowdown temperature for the device.
DCGM_FI_DEV_POWER_MGMT_LIMIT                                      ,     gauge                  ,Current power limit for the device.
DCGM_FI_DEV_PSTATE                                                ,     gauge                  ,Performance state (P-State) 0-15. 0=highest
DCGM_FI_DEV_FB_TOTAL                                              ,     gauge                  ,
DCGM_FI_DEV_FB_RESERVED                                           ,     gauge                  ,
DCGM_FI_DEV_FB_USED_PERCENT                                       ,     gauge                  ,
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS                                ,     gauge                  ,Current clock throttle reasons (bitmask of      DCGM_CLOCKS_THROTTLE_REASON_*)

DCGM_FI_PROCESS_NAME                                              ,     label                  ,The Process Name.
DCGM_FI_CUDA_DRIVER_VERSION                                       ,     label                  ,
DCGM_FI_DEV_NAME                                                  ,     label                  ,
DCGM_FI_DEV_MINOR_NUMBER                                          ,     label                  ,
DCGM_FI_DRIVER_VERSION                                            ,     label                  ,
DCGM_FI_DEV_BRAND                                                 ,     label                  ,
DCGM_FI_DEV_SERIAL                                                ,     label                  ,

EOF

Create ConfigMap with the metrics configuration:

kubectl create configmap metrics-config -n gpu-operator --from-file=dcgm-metrics.csv

Update the dcgm-exporter configuration to use dcgm-metrics.csv:

kubectl patch clusterpolicy/cluster-policy --type='json' -p='[{"op": "add", "path": "/spec/dcgmExporter/config", "value": {"name": "metrics-config"}}]'

kubectl patch clusterpolicy/cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/dcgmExporter/env", "value": [
{"name": "DCGM_EXPORTER_LISTEN", "value": ":9400"},
{"name": "DCGM_EXPORTER_KUBERNETES", "value": "true"},
{"name": "DCGM_EXPORTER_COLLECTORS", "value": "/etc/dcgm-exporter/dcgm-metrics.csv"},
]}]'

Restart dcgm-exporter:

kubectl get pods -n gpu-operator | grep nvidia-dcgm-exporter | awk '{ print $1 }' | xargs kubectl delete pod -n gpu-operator

Create the DaemonSet configuration file that installs and enables the metrics collection agent:

DaemonSet configuration

cat <<'EODAEMONSET' > unified_agent_daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: unified-agent-installer
spec:
  selector:
    matchLabels:
      name: unified-agent-installer
  template:
    metadata:
      labels:
        name: unified-agent-installer
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: nvidia.com/gpu.count
                    operator: Exists
      hostNetwork: true
      hostPID: true
      hostIPC: true
      containers:
        - name: unified-agent-installer
          image: busybox
          securityContext:
            privileged: true
          volumeMounts:
            - mountPath: /host
              name: noderoot
          command: [ "chroot", "/host", "/bin/sh", "-c" ]
          args:
            - |
              MONITORING_API_ADDRESS="https://monitoring.api.nemax.nebius.cloud"
              
              folder_id="$(curl -H 'Metadata-Flavor:Google' -s http://169.254.169.254/computeMetadata/v1/instance/vendor/folder-id)"
              vm_id="$(curl -H 'Metadata-Flavor:Google' -s http://169.254.169.254/computeMetadata/v1/instance/id)"
              host="$(hostname)"
              k8s_cluster_id="$(grep -E "cluster: \w+" /etc/kubernetes/bootstrap-kubeconfig.conf | awk '{print $2}')"

              rm -rf /etc/unified_agent/config.yaml
              
              echo "Installing unified agent config"

              while true; do
                dcgm_ip="$(/home/kubernetes/bin/crictl inspectp -o json $(/home/kubernetes/bin/crictl pods | grep "nvidia-dcgm-exporter" | awk '{print $1}') | jq -r '.status.network.ip')"
                if [ ! -z "${dcgm_ip}" ]; then
                  break
                fi
                echo "waiting for nvidia-dcgm-exporter container to be ready"
                sleep 5
              done

              mkdir -p /etc/unified_agent
              cat <<EOF > /etc/unified_agent/config.yaml
              storages:
                - name: metrics_storage
                  plugin: fs
                  config:
                    directory: /var/lib/unified_agent/metrics_storage
                    max_partition_size: 100mb
                    max_segment_size: 10mb

              channels:
                - name: metrics_channel
                  channel:
                    pipe:
                      - filter:
                          plugin: add_metric_labels
                          config:
                            labels:
                              vm_id: "${vm_id}"
                              host: "${host}"
                              k8s_cluster_id: "${k8s_cluster_id}"
                      - filter:
                          plugin: transform_metric_labels
                          config:
                            labels:
                              - gpu: "-"
                              - device: "-"
                              - modelName: "-"
                              - Hostname: "-"
                              - pod: "-"
                              - container: "-"
                              - namespace: "-"
                              - DCGM_FI_CUDA_DRIVER_VERSION: "-"
                              - DCGM_FI_DRIVER_VERSION: "-"
                              - DCGM_FI_PROCESS_NAME: "-"
                              - DCGM_FI_DEV_BRAND: "-"
                              - DCGM_FI_DEV_SERIAL: "-"
                              - DCGM_FI_DEV_MINOR_NUMBER: "-"
                              - DCGM_FI_DEV_NAME: "-"
                      - storage_ref:
                          name: metrics_storage
                    output:
                      plugin: yc_metrics
                      config:
                        url: "${MONITORING_API_ADDRESS}/monitoring/v2/data/write"
                        folder_id: "${folder_id}"
                        service: compute
                        iam:
                          cloud_meta: { }

              routes:
                - input:
                    plugin: metrics_pull
                    config:
                      url: "http://${dcgm_ip}:9400/metrics"
                      format:
                        prometheus: {}
                      metric_name_label: name
                  channel:
                    channel_ref:
                      name: metrics_channel
              EOF

              if [ ! -f /usr/local/bin/unified_agent ]; then
                echo "Installing unified agent binary"

                s3_bucket_address="https://storage.il.nebius.cloud/yc-unified-agent"
                ua_version="$(curl -s ${s3_bucket_address}/latest-version)"
                curl -s "${s3_bucket_address}/releases/${ua_version}/unified_agent" -o unified_agent
                chmod +x ./unified_agent
                mkdir -p /usr/local/bin
                mv unified_agent /usr/local/bin/unified_agent
              fi

              if [ ! -f /etc/systemd/system/unified_agent.service ]; then
                echo "Installing unified agent.service"

                cat <<'EOF' > /usr/local/bin/unified_agent_pre
              #!/usr/bin/env bash

              set -eu
              set -o pipefail

              dcgm_cur_ip="$(grep "9400/metrics" /etc/unified_agent/config.yaml | grep -Eo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}')"
              dcgm_new_ip="$(/home/kubernetes/bin/crictl inspectp -o json $(/home/kubernetes/bin/crictl pods | grep "nvidia-dcgm-exporter" | awk '{print $1}') | jq -r '.status.network.ip')"

              if [ ! -z "${dcgm_new_ip}" ] && [ "${dcgm_cur_ip}" == "${dcgm_new_ip}" ]; then
                exit 0
              fi

              sed -i "s/${dcgm_cur_ip}/${dcgm_new_ip}/g" /etc/unified_agent/config.yaml
              EOF
                chmod +x /usr/local/bin/unified_agent_pre

                cat <<EOF > /etc/systemd/system/unified_agent.service
              [Unit]
              Description=unified agent
              After=network.target

              [Service]
              Type=simple
              MemoryLimit=500M
              ExecStartPre=/usr/local/bin/unified_agent_pre
              ExecStart=/usr/local/bin/unified_agent --config /etc/unified_agent/config.yaml
              KillMode=process
              Restart=always
              RestartSec=2s

              [Install]
              WantedBy=multi-user.target
              EOF
                systemctl daemon-reload

                echo "Enabling and starting unified_agent.service"
                systemctl enable unified_agent.service
                systemctl start unified_agent.service
              fi

              echo "unified agent is installed"
              echo "metrics will be available at https://console.nebius.ai/folders/${folder_id}/compute/instance/${vm_id}/monitoring?tab=gpu"
              
              sleep infinity
      volumes:
        - name: noderoot
          hostPath:
            path: /

EODAEMONSET

Apply the configuration file:

kubectl create -f unified_agent_daemonset.yaml

Explore the dashboard

The agent delivers the metrics to the dashboard in 5–10 minutes after the VM creation or update. Then the GPU health status details are available in the management console. You can view them on the GPU tab of the Monitoring section of the Managed Service for Kubernetes cluster management page.

Metrics reference

The corresponding NVIDIA metrics are shown next to the Nebius AI metric.

GPU Utilization DCGM_FI_DEV_GPU_UTIL

Percentage of time a GPU spends executing tasks.
Memory Utilization DCGM_FI_DEV_MEM_COPY_UTIL

Percentage of time GPU memory was in use (performing read or write tasks) in a dedicated period.
Free Frame Buffer in MB DCGM_FI_DEV_FB_FREE

Amount of free frame buffer memory.
Used Frame Buffer in MB DCGM_FI_DEV_FB_USED

Amount of used frame buffer memory.
Total Frame Buffer of the GPU in MB DCGM_FI_DEV_FB_TOTAL

A constant. The total amount of frame buffer memory.
Reserved Frame buffer in MB DCGM_FI_DEV_FB_RESERVED

A constant. Amount of frame buffer memory reserved for the internal use of the hardware: drivers, firmware, etc.
The number of bytes of active PCIe rx/tx DCGM_FI_PROF_PCIE_RX_BYTES, DCGM_FI_PROF_PCIE_TX_BYTES

Number of bytes a GPU received from (rx) or transmitted to (tx) its host VM and other devices over PCIe. Both header and payload of each PCIe packet are included.
SM Clock for the device DCGM_FI_DEV_SM_CLOCK

Frequency of the main GPU clock.
Memory clock for the device DCGM_FI_DEV_MEM_CLOCK

Frequency and total amount of operations in time spans.
Current clock throttle reasons DCGM_FI_DEV_CLOCK_THROTTLE_REASONS

A bitmask of possible reasons for GPU throttling.

For example, if the GPU is throttling because it has overheated and slowed down, the chart will show 72: code 0x40 for overheating (DCGM_CLOCKS_THROTTLE_REASON_HW_THERMAL) + code 0x8 for slowdown (DCGM_CLOCKS_THROTTLE_REASON_HW_SLOWDOWN) = 72 in decimal.
Power usage for the device DCGM_FI_DEV_POWER_USAGE

Current energy consumption by a GPU in watts.
Total energy consumption for the GPU since the driver was last reloaded DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION

Cumulative energy consumption by a GPU since the recent driver reload in millijoules.
Memory temperature for the device DCGM_FI_DEV_MEMORY_TEMP

Memory temperature in degrees Celsius.
Current temperature readings for the device DCGM_FI_DEV_GPU_TEMP

GPU core temperature in degrees Celsius.
Current Power limit for the device DCGM_FI_DEV_POWER_MGMT_LIMIT

A constant. Power consumption limit after which the GPU will be throttled.
Slowdown temperature for the device DCGM_FI_DEV_SLOWDOWN_TEMP

A constant. Temperature threshold after which the GPU will be throttled until it cools down.
Rows remapping DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, DCGM_FI_DEV_ROW_REMAP_FAILURE

Metrics related to row remapping in GPU memory: the numbers of rows remapped for CORRECTABLE and UNCORRECTABLE errors, and the number of failed row remappings (REMAP_FAILURE).

For more details, see the NVIDIA guide.
XID errors DCGM_FI_DEV_XID_ERRORS

An indication of a general GPU error that should be used for further investigation and debugging.

See the list of all possible errors in the NVIDIA guide.