Monitoring resource usage
Why does vCPU utilization displayed on the chart exceed 100%?
If you use cores with guaranteed, say, 5% vCPU performance, then this 5% represents 100% of the expected load for the monitoring system. If there are no "neighbors" on the physical core, you can be allocated up to 100% of vCPU performance, which is 20 times higher than the maximum expected load (×20 of 5%). Thus, the chart can show up to 2000%.
If you see the upper limit of 100% exceeded for quite a while in the graphs, we recommend increasing the guaranteed vCPU performance because "neighbors" may appear on the physical core at any time and your real utilization of physical core resources will drop to the guaranteed 5% (about 100 MHz). In this case, the guest system may not cope with the load and you'll lose access to the VM.
How do I track vRAM use through monitoring?
The Compute Cloud service can't measure vRAM consumption inside the guest operating system because for the service, memory consumption by the virtual machine is always the same: the one that is allocated the moment it is started.
Why don't I see the GPU dashboard?
Agent for collecting metrics works with a limited amount of platforms and operating systems. Make sure that your VM platform is the VM platform with GPU and the boot disk has Compute Unified Device Architecture (CUDA).
-
Find out your platform by running the command:
nvidia-smi
-
Find out your operating system by running the command:
lsb_release -d
If the VM matches the conditions, contact support
How to report missing GPU metrics?
-
Check that all the services are installed correctly:
sudo systemctl status docker.service
sudo systemctl status nvidia-dcgm.service
sudo systemctl status unified_agent.service
-
Find out the status of the container for exporting the metrics:
sudo docker ps --filter name=dcgm-exporter
-
Check that the metrics are shown locally:
curl localhost:9400/metrics
-
Contact support
and provide outputs of the commands.