Applying the Nebius AI device topology to Compute Cloud VMs with GPUs

For Nebius AI virtual machines in GPU clusters, the device topology differs from the default bare-metal one. As a result, the NCCL tests from NVIDIA might not work stable in a multi-host environment.

If you are using Slurm, set up a Slurm managed cluster using the dedicated tutorial — the configuration described there includes the topology patch.

If you are using MPIrun, to run stable NCCL tests as described in the dedicated guide and improve workloads performance, apply the Nebius AI topology to your VMs or the containers you run on them in one of these ways:

Set up the topology

Update NCCL version

Download the nccl-topo-h100-v1.xml file from Nebius AI repository.
Set the path to the downloaded file via the NCCL_TOPO_FILE environment variable. For example:
```
export NCCL_TOPO_FILE=/opt/nebius/nccl-topo-h100-v1.xml
```

This is a preferable way, because it does not depend on the NCCL updates.

Update NCCL version to 2.19.4 or higher for every VM and container.

Applying the Nebius AI device topology to Compute Cloud VMs with GPUs

See also