Applying the Nebius AI device topology to Compute Cloud VMs with GPUs
For Nebius AI virtual machines in GPU clusters, the device topology differs from the default bare-metal one. As a result, the NCCL tests
If you are using Slurm, set up a Slurm managed cluster using the dedicated tutorial — the configuration described there includes the topology patch.
If you are using MPIrun, to run stable NCCL tests as described in the dedicated guide and improve workloads performance, apply the Nebius AI topology to your VMs or the containers you run on them in one of these ways:
-
Download the
nccl-topo-h100-v1.xml
file from Nebius AI repository . -
Set the path to the downloaded file via the
NCCL_TOPO_FILE
environment variable. For example:export NCCL_TOPO_FILE=/opt/nebius/nccl-topo-h100-v1.xml
This is a preferable way, because it does not depend on the NCCL updates.
Update NCCL version to 2.19.4 or higher for every VM and container.