Monitoring GPUs on Compute Cloud virtual machines

Start collecting metrics
- Install the tools on a new VM
- Install the tools on an existing VM
Explore the dashboard
- Metrics reference

You can set up your Compute Cloud virtual machines with GPUs to collect GPU usage data and visualize it on a GPU dashboard.

You can use the dashboard to monitor current resource utilization, schedule quota increases, and quickly identify anomalies. It also helps the Nebius AI support team investigate any GPU issues you may be experiencing.

Start collecting metrics

Install the tools that will collect and export the metrics:

dcgm-exporter: Container that exports VM metrics to port 9400. For more information about the container, see NVIDIA documentation.
unified-agent: Agent that delivers the metrics to the Nebius AI monitoring system.

Install the tools on a new VM

Management console

When creating the VM, select a VM platform with GPU and a boot disk image with CUDA adapted for GPUs.
In Monitoring section, select Enable next to the Collecting GPU metrics field. If there is no such section in the console, select another platform or boot disk.

Install the tools on an existing VM

Connect to the VM via SSH.
Run the superuser shell:
```
sudo -i
```

Run the command:

wget -O - https://monitoring.api.nemax.nebius.cloud/monitoring/v2/gpu-metrics-exporter/install.sh | bash

Check the installation status:

sudo systemctl status docker.service
sudo systemctl status nvidia-dcgm.service
sudo systemctl status unified_agent.service

Check the container for exporting the metrics status:
```
sudo docker ps --filter name=dcgm-exporter
```
Check that the metrics are shown locally:
```
curl localhost:9400/metrics
```

Explore the dashboard

The GPU usage data becomes available in the management console 5–10 minutes after the tools are installed, either at VM creation or manually, or the VM is updated. You can view them on the GPU tab of the Monitoring section of the VM management page.

Metrics reference

The corresponding NVIDIA metrics are shown next to the Nebius AI metric.

GPU Utilization DCGM_FI_DEV_GPU_UTIL

Percentage of time a GPU spends executing tasks.
Memory Utilization DCGM_FI_DEV_MEM_COPY_UTIL

Percentage of time GPU memory was in use (performing read or write tasks) in a dedicated period.
Free Frame Buffer in MB DCGM_FI_DEV_FB_FREE

Amount of free frame buffer memory.
Used Frame Buffer in MB DCGM_FI_DEV_FB_USED

Amount of used frame buffer memory.
Total Frame Buffer of the GPU in MB DCGM_FI_DEV_FB_TOTAL

A constant. The total amount of frame buffer memory.
Reserved Frame buffer in MB DCGM_FI_DEV_FB_RESERVED

A constant. Amount of frame buffer memory reserved for the internal use of the hardware: drivers, firmware, etc.
The number of bytes of active PCIe rx/tx DCGM_FI_PROF_PCIE_RX_BYTES, DCGM_FI_PROF_PCIE_TX_BYTES

Number of bytes a GPU received from (rx) or transmitted to (tx) its host VM and other devices over PCIe. Both header and payload of each PCIe packet are included.
SM Clock for the device DCGM_FI_DEV_SM_CLOCK

Frequency of the main GPU clock.
Memory clock for the device DCGM_FI_DEV_MEM_CLOCK

Frequency and total amount of operations in time spans.
Current clock throttle reasons DCGM_FI_DEV_CLOCK_THROTTLE_REASONS

A bitmask of possible reasons for GPU throttling.

For example, if the GPU is throttling because it has overheated and slowed down, the chart will show 72: code 0x40 for overheating (DCGM_CLOCKS_THROTTLE_REASON_HW_THERMAL) + code 0x8 for slowdown (DCGM_CLOCKS_THROTTLE_REASON_HW_SLOWDOWN) = 72 in decimal.
Power usage for the device DCGM_FI_DEV_POWER_USAGE

Current energy consumption by a GPU in watts.
Total energy consumption for the GPU since the driver was last reloaded DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION

Cumulative energy consumption by a GPU since the recent driver reload in millijoules.
Memory temperature for the device DCGM_FI_DEV_MEMORY_TEMP

Memory temperature in degrees Celsius.
Current temperature readings for the device DCGM_FI_DEV_GPU_TEMP

GPU core temperature in degrees Celsius.
Current Power limit for the device DCGM_FI_DEV_POWER_MGMT_LIMIT

A constant. Power consumption limit after which the GPU will be throttled.
Slowdown temperature for the device DCGM_FI_DEV_SLOWDOWN_TEMP

A constant. Temperature threshold after which the GPU will be throttled until it cools down.
Rows remapping DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, DCGM_FI_DEV_ROW_REMAP_FAILURE

Metrics related to row remapping in GPU memory: the numbers of rows remapped for CORRECTABLE and UNCORRECTABLE errors, and the number of failed row remappings (REMAP_FAILURE).

For more details, see the NVIDIA guide.
XID errors DCGM_FI_DEV_XID_ERRORS

An indication of a general GPU error that should be used for further investigation and debugging.

See the list of all possible errors in the NVIDIA guide.