Testing inter-GPU connection for Compute Cloud VMs

This guide explains how to test the GPU interconnection between Compute Cloud VMs in a GPU cluster by running NCCL tests developed by NVIDIA.

For details about testing the GPU interconnection in a Managed Service for Kubernetes cluster with an attached GPU cluster, see Setting up and testing InfiniBand connection between GPUs in a Managed Service for Kubernetes cluster.

MPIrun

Slurm

Apply the Nebius AI topology to all VMs in the cluster.
For each VM in the GPU cluster:
1. Get the VM's internal IP address.
2. Connect to the VM through SSH.
3. Install the Open MPI library on the VM:
```
sudo apt-get install openmpi-bin
```
Choose one of the VMs as the main VM – you will run the tests from it. Build the tests on the main VM:
1. Clone the NVIDIA repository with the tests:
```
git clone https://github.com/NVIDIA/nccl-tests
```
2. Build the tests with Open MPI:
```
cd nccl-tests
MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi MPI=1  make
```
3. Copy the built binary file, all_reduce_perf, to the same directory on other VMs.
Set up SSH connectivity between the VMs in the cluster:
1. On the main VM, generate an SSH key pair without a passphrase:
```
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N ""
```
2. Copy the generated pair, ~/.ssh/id_ed25519 and ~/.ssh/id_ed25519.pub, to the same directory on each other VM.
3. On all other VMs, add the public key from the pair to the list of authorized keys:
```
cat ~/.ssh/id_ed25519.pub >> ~/.ssh/authorized_keys
```
For more details, see the Open MPI documentation.
Run the tests from the main VM with the mpirun command:
```
mpirun --host <IP_address_1>:8,<IP_address_2>:8,<IP_address_3>:8,<IP_address_4>:8 \
    --allow-run-as-root -np 32 \
    -mca pml ucx \
    -x NCCL_TOPO_FILE=$NCCL_TOPO_FILE \
    ~/nccl-tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1
```
Where:
- IP_address_[1-4]: IP address of the VM where you want to run the test.
- :8: Amount of GPUs on the VM.
- -mca pml ucx: Instruction for MPI communications to go through InfiniBand using UCX. To use Ethernet instead, replace the option with -mca btl_tcp_if_include eth0. This does not affect InfiniBand data exchanges of the test itself.
- ~/nccl-tests/build/all_reduce_perf: A path to the binary file that should be available on all VMs.

Set up a Slurm managed cluster.

Create an nccl.sbatch file that contains the script with the topology patch:

nccl.sbatch

#!/bin/bash
###
#SBATCH --job-name=nccl_test
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
#SBATCH --time=10:00
#SBATCH --output="%j.log"
#SBATCH --exclusive

# NCCL environment variables are documented at:
# https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html

#export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_HCA=mlx5
export UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1
export SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1
export NCCL_COLLNET_ENABLE=0
export NCCL_TOPO_FILE=/etc/nccl-topo-h100-v1.xml

# Relaxed ordering is fixed in NCCL 2.18.3+, but
# in NCCL 2.18.1 and earlier it should be disabled
# for H100s due to a bug. See:
# https://docs.nvidia.com/deeplearning/nccl/archives/nccl_2181/release-notes/rel_2-18-1.html
# export NCCL_IB_PCI_RELAXED_ORDERING=0

# Log the assigned nodes
echo "Using nodes: $SLURM_JOB_NODELIST"

srun --container-image="cr.ai.nebius.cloud#examples/nccl-tests:latest" \
     --container-remap-root --no-container-mount-home --container-mounts=$NCCL_TOPO_FILE:$NCCL_TOPO_FILE \
     /opt/nccl_tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1 $@

Run the script:
```
sbatch -n <number_of_VMs> nccl.sbatch
```

To check the job status and get its logs, run:

scontrol show job

The path to the logs file is shown in the StdOut field.

In the result, check the average bus bandwith. If its value is higher than 300 GB/sec, the connection is stable.

Example:

...
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
   536870912     134217728     float     sum      -1   3674.4  146.11  283.09      0   3648.4  147.15  285.11      0
  1073741824     268435456     float     sum      -1   6411.6  167.47  324.47      0   6416.7  167.33  324.21      0
  2147483648     536870912     float     sum      -1    12735  168.62  326.71      0    12979  165.45  320.57      0
  4294967296    1073741824     float     sum      -1    25389  169.17  327.76      0    25598  167.79  325.09      0
  8589934592    2147483648     float     sum      -1    50979  168.50  326.47      0    50799  169.10  327.63      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 317.11

The average bus bandwith is not equal to the InfiniBand one as some of the NCCL operations it measures use NVLink. Nevertheless, it accurately estimates the connection.

Testing inter-GPU connection for Compute Cloud VMs

See also