Testing inter-GPU connection for Compute Cloud VMs
This guide explains how to test the GPU interconnection between Compute Cloud VMs in a GPU cluster by running NCCL tests developed by NVIDIA.
For details about testing the GPU interconnection in a Managed Service for Kubernetes cluster with an attached GPU cluster, see Setting up and testing InfiniBand connection between GPUs in a Managed Service for Kubernetes cluster.
-
Apply the Nebius AI topology to all VMs in the cluster.
-
For each VM in the GPU cluster:
-
Install the Open MPI
library on the VM:sudo apt-get install openmpi-bin
-
Choose one of the VMs as the main VM – you will run the tests from it. Build the tests on the main VM:
-
Clone the NVIDIA repository with the tests:
git clone https://github.com/NVIDIA/nccl-tests
-
Build the tests with Open MPI:
cd nccl-tests MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi MPI=1 make
-
Copy the built binary file,
all_reduce_perf
, to the same directory on other VMs.
-
-
Set up SSH connectivity between the VMs in the cluster:
-
On the main VM, generate an SSH key pair without a passphrase:
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N ""
-
Copy the generated pair,
~/.ssh/id_ed25519
and~/.ssh/id_ed25519.pub
, to the same directory on each other VM. -
On all other VMs, add the public key from the pair to the list of authorized keys:
cat ~/.ssh/id_ed25519.pub >> ~/.ssh/authorized_keys
For more details, see the Open MPI documentation
. -
-
Run the tests from the main VM with the
mpirun
command:mpirun --host <IP_address_1>:8,<IP_address_2>:8,<IP_address_3>:8,<IP_address_4>:8 \ --allow-run-as-root -np 32 \ -mca pml ucx \ -x NCCL_TOPO_FILE=$NCCL_TOPO_FILE \ ~/nccl-tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1
Where:
IP_address_[1-4]
: IP address of the VM where you want to run the test.:8
: Amount of GPUs on the VM.-mca pml ucx
: Instruction for MPI communications to go through InfiniBand using UCX . To use Ethernet instead, replace the option with-mca btl_tcp_if_include eth0
. This does not affect InfiniBand data exchanges of the test itself.~/nccl-tests/build/all_reduce_perf
: A path to the binary file that should be available on all VMs.
-
Create an
nccl.sbatch
file that contains the script with the topology patch:nccl.sbatch#!/bin/bash ### #SBATCH --job-name=nccl_test #SBATCH --ntasks-per-node=8 #SBATCH --gpus-per-node=8 #SBATCH --time=10:00 #SBATCH --output="%j.log" #SBATCH --exclusive # NCCL environment variables are documented at: # https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html #export NCCL_DEBUG=INFO export NCCL_SOCKET_IFNAME=eth0 export NCCL_IB_HCA=mlx5 export UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1 export SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 export NCCL_COLLNET_ENABLE=0 export NCCL_TOPO_FILE=/etc/nccl-topo-h100-v1.xml # Relaxed ordering is fixed in NCCL 2.18.3+, but # in NCCL 2.18.1 and earlier it should be disabled # for H100s due to a bug. See: # https://docs.nvidia.com/deeplearning/nccl/archives/nccl_2181/release-notes/rel_2-18-1.html # export NCCL_IB_PCI_RELAXED_ORDERING=0 # Log the assigned nodes echo "Using nodes: $SLURM_JOB_NODELIST" srun --container-image="cr.ai.nebius.cloud#examples/nccl-tests:latest" \ --container-remap-root --no-container-mount-home --container-mounts=$NCCL_TOPO_FILE:$NCCL_TOPO_FILE \ /opt/nccl_tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1 $@
-
Run the script:
sbatch -n <number_of_VMs> nccl.sbatch
To check the job status and get its logs, run:
scontrol show job
The path to the logs file is shown in the StdOut
field.
In the result, check the average bus bandwith. If its value is higher than 300 GB/sec, the connection is stable.
Example:
...
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
536870912 134217728 float sum -1 3674.4 146.11 283.09 0 3648.4 147.15 285.11 0
1073741824 268435456 float sum -1 6411.6 167.47 324.47 0 6416.7 167.33 324.21 0
2147483648 536870912 float sum -1 12735 168.62 326.71 0 12979 165.45 320.57 0
4294967296 1073741824 float sum -1 25389 169.17 327.76 0 25598 167.79 325.09 0
8589934592 2147483648 float sum -1 50979 168.50 326.47 0 50799 169.10 327.63 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 317.11
The average bus bandwith is not equal to the InfiniBand one as some of the NCCL operations it measures use NVLink. Nevertheless, it accurately estimates the connection.