Monitoring GPUs on Compute Cloud virtual machines
You can set up your Compute Cloud virtual machines with GPUs to collect GPU usage data and visualize it on a GPU dashboard.
You can use the dashboard to monitor current resource utilization, schedule quota increases, and quickly identify anomalies. It also helps the Nebius AI support team investigate any GPU issues you may be experiencing.
Start collecting metrics
Install the tools that will collect and export the metrics:
dcgm-exporter
: Container that exports VM metrics to port 9400. For more information about the container, see NVIDIA documentation .unified-agent
: Agent that delivers the metrics to the Nebius AI monitoring system.
Install the tools on a new VM
- When creating the VM, select a VM platform with GPU and a boot disk image with CUDA adapted for GPUs.
- In Monitoring section, select Enable next to the Collecting GPU metrics field. If there is no such section in the console, select another platform or boot disk.
Install the tools on an existing VM
-
Run the superuser shell:
sudo -i
-
Run the command:
wget -O - https://monitoring.api.nemax.nebius.cloud/monitoring/v2/gpu-metrics-exporter/install.sh | bash
-
Check the installation status:
sudo systemctl status docker.service sudo systemctl status nvidia-dcgm.service sudo systemctl status unified_agent.service
-
Check the container for exporting the metrics status:
sudo docker ps --filter name=dcgm-exporter
-
Check that the metrics are shown locally:
curl localhost:9400/metrics
Explore the dashboard
The GPU usage data becomes available in the management console 5–10 minutes after the tools are installed, either at VM creation or manually, or the VM is updated. You can view them on the GPU tab of the Monitoring section of the VM management page.
Metrics reference
The corresponding NVIDIA metrics
-
GPU Utilization
DCGM_FI_DEV_GPU_UTIL
Percentage of time a GPU spends executing tasks.
-
Memory Utilization
DCGM_FI_DEV_MEM_COPY_UTIL
Percentage of time GPU memory was in use (performing read or write tasks) in a dedicated period.
-
Free Frame Buffer in MB
DCGM_FI_DEV_FB_FREE
Amount of free frame buffer memory.
-
Used Frame Buffer in MB
DCGM_FI_DEV_FB_USED
Amount of used frame buffer memory.
-
Total Frame Buffer of the GPU in MB
DCGM_FI_DEV_FB_TOTAL
A constant. The total amount of frame buffer memory.
-
Reserved Frame buffer in MB
DCGM_FI_DEV_FB_RESERVED
A constant. Amount of frame buffer memory reserved for the internal use of the hardware: drivers, firmware, etc.
-
The number of bytes of active PCIe rx/tx
DCGM_FI_PROF_PCIE_RX_BYTES
,DCGM_FI_PROF_PCIE_TX_BYTES
Number of bytes a GPU received from (rx) or transmitted to (tx) its host VM and other devices over PCIe. Both header and payload of each PCIe packet are included.
-
SM Clock for the device
DCGM_FI_DEV_SM_CLOCK
Frequency of the main GPU clock.
-
Memory clock for the device
DCGM_FI_DEV_MEM_CLOCK
Frequency and total amount of operations in time spans.
-
Current clock throttle reasons
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS
A bitmask of possible reasons for GPU throttling
.For example, if the GPU is throttling because it has overheated and slowed down, the chart will show 72: code 0x40 for overheating (
DCGM_CLOCKS_THROTTLE_REASON_HW_THERMAL
) + code 0x8 for slowdown (DCGM_CLOCKS_THROTTLE_REASON_HW_SLOWDOWN
) = 72 in decimal. -
Power usage for the device
DCGM_FI_DEV_POWER_USAGE
Current energy consumption by a GPU in watts.
-
Total energy consumption for the GPU since the driver was last reloaded
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION
Cumulative energy consumption by a GPU since the recent driver reload in millijoules.
-
Memory temperature for the device
DCGM_FI_DEV_MEMORY_TEMP
Memory temperature in degrees Celsius.
-
Current temperature readings for the device
DCGM_FI_DEV_GPU_TEMP
GPU core temperature in degrees Celsius.
-
Current Power limit for the device
DCGM_FI_DEV_POWER_MGMT_LIMIT
A constant. Power consumption limit after which the GPU will be throttled.
-
Slowdown temperature for the device
DCGM_FI_DEV_SLOWDOWN_TEMP
A constant. Temperature threshold after which the GPU will be throttled until it cools down.
-
Rows remapping
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS
,DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS
,DCGM_FI_DEV_ROW_REMAP_FAILURE
Metrics related to row remapping in GPU memory: the numbers of rows remapped for
CORRECTABLE
andUNCORRECTABLE
errors, and the number of failed row remappings (REMAP_FAILURE
).For more details, see the NVIDIA guide
. -
XID errors
DCGM_FI_DEV_XID_ERRORS
An indication of a general GPU error that should be used for further investigation and debugging.
See the list of all possible errors in the NVIDIA guide
.