Setting up and testing InfiniBand connection between GPUs in a Managed Service for Kubernetes cluster
To boost performance of the high-performance computing (HPC) and AI workloads that you run in a Managed Service for Kubernetes cluster, you can set it up so that the GPUs on its nodes are interconnected directly using NVIDIA InfiniBand.
In this tutorial, you will create a Managed Service for Kubernetes cluster with GPUs interconnected using InfiniBand, install operators and drivers from NVIDIA on it, and run NVIDIA NCCL tests to check InfiniBand performance.
Getting started
-
If you don't have the Nebius AI command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the
--folder-name
or--folder-id
parameter. -
Create a Managed Service for Kubernetes cluster with any suitable configuration.
-
Create a node group. Requirements:
- Select a VM platform with GPU that is supported by GPU clusters. See the list of supported platforms.
- Select the GPU cluster that you created.
- Select the Do not install GPU drivers checkbox to avoid conflicts between the pre-installed drivers and operators you will install in this tutorial.
-
Install kubectl
and configure it to work with the created cluster.
Install NVIDIA operators
-
Install Helm
v3.7.0 or higher. -
Install the NVIDIA Network Operator:
-
Add the NVIDIA NGC chart repository to Helm:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update
-
Create
network-operator-values.yaml
with the following contents:nfd: enabled: true # NicClusterPolicy CR values: deployCR: true ofedDriver: deploy: true version: 23.04-0.5.3.3.1 rdmaSharedDevicePlugin: deploy: false secondaryNetwork: deploy: false multus: deploy: false ipoib: deploy: false ipamPlugin: deploy: false
-
Download and install the Helm chart:
helm install network-operator nvidia/network-operator \ -n nvidia-network-operator \ --create-namespace \ --version v23.7.0 \ -f ./network-operator-values.yaml \ --wait
-
-
Download and install the NVIDIA GPU Operator Helm chart from Nebius AI Marketplace:
helm pull oci://cr.ai.nebius.cloud/yc-marketplace/nebius/gpu-operator/chart/gpu-operator \ --version v23.9.0 helm install gpu-operator ./gpu-operator-v23.9.0.tgz \ --namespace nvidia-gpu-operator \ --create-namespace \ --set driver.upgradePolicy.autoUpgrade=false \ --set driver.rdma.enabled=true \ --wait
This command installs NVIDIA drivers version 535.104.12 with GPUDirect RDMA support and NVIDIA Container Toolkit.
Check that the operators and drivers are installed
-
Check that the operators' pods are running:
-
For NVIDIA Network Operator:
kubectl get pod -n nvidia-network-operator
-
For NVIDIA GPU Operator:
kubectl get pod -n nvidia-gpu-operator
Note
For 2 nodes, the pods take around 10–15 minutes to get running; the more nodes, the longer.
-
-
Check that the InfiniBand drivers are installed by the NVIDIA Network Operator:
How to check the InfiniBand drivers-
Check logs of the
mofed-ubuntu20.04
pod:DRIVERS_POD_NAME="$(kubectl get pods --namespace nvidia-network-operator | grep mofed | head -1 | awk '{print $1}')" kubectl --namespace nvidia-network-operator logs "${DRIVERS_POD_NAME}"
They should contain messages about the successful driver installation, like:
... + ofed_info -s MLNX_OFED_LINUX-23.04-0.5.3.3: + [[ 0 -ne 0 ]] + mount_rootfs + echo 'Mounting Mellanox OFED driver container rootfs...' Mounting Mellanox OFED driver container rootfs... ... + set_driver_readiness + touch /.driver-ready + trap 'echo '\''Caught signal'\''; exit 1' HUP INT QUIT PIPE TERM + trap handle_signal EXIT + wait + sleep infinity
-
Check that the drivers version installed on the nodes is 23.04-0.5.3:
for node in $(kubectl get nodes -o name); do kubectl debug $node -it --image=busybox:1.27.2 -- cat /sys/module/mlx5_core/version done
-
-
Check that the GPU drivers are installed by the NVIDIA GPU Operator:
How to check the GPU drivers-
Check logs of the
nvidia-driver-daemonset
pod:DRIVERS_POD_NAME="$(kubectl get pods --namespace nvidia-gpu-operator | grep nvidia-driver-daemonset | head -1 | awk '{print $1}')" kubectl --namespace nvidia-gpu-operator logs "${DRIVERS_POD_NAME}"
They should contain messages about the successful driver installation, like:
... Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 535.104.12) is now complete. Parsing kernel module parameters... Loading ipmi and i2c_core kernel modules... Loading NVIDIA driver kernel modules... + modprobe nvidia + modprobe nvidia-uvm + modprobe nvidia-modeset + set +o xtrace -o nounset Mellanox device found at 0000:8c:00.0 Loading NVIDIA Peer Memory kernel module... + modprobe nvidia-peermem + set +o xtrace -o nounset Starting NVIDIA persistence daemon... Mounting NVIDIA driver rootfs... Done, now waiting for signal
-
Check that the drivers version installed on the nodes is 535.104.12:
for node in $(kubectl get nodes -o name); do kubectl debug $node -it --image=busybox:1.27.2 -- cat /proc/driver/nvidia/version done
-
Run NCCL tests
-
Install the Kubeflow Training Operator
.kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.5.0"
-
Create namespace for your tests. For example,
nccl-test
.kubectl create ns nccl-test
-
Apply the device topology from Nebius AI in the
nccl-test
namespace. Usetopo-config
as the name of the topologyConfigMap
, as instructed in the linked guide. -
Create
nccl-test.yaml
with anMPIJob
for your tests. This example is for 2 nodes.nccl-test.yamlapiVersion: kubeflow.org/v1 kind: MPIJob metadata: name: nccl-test-h100 spec: slotsPerWorker: 8 # Number of GPUs on each node mpiReplicaSpecs: Launcher: replicas: 1 template: spec: containers: - args: # In `-np 16`, 16 is the total number of GPUs on all nodes # (`.spec.slotsPerWorker` × `.spec.mpiReplicaSpecs.Worker.replicas`) - 'mpirun -np 16 -bind-to none -x LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_HCA=mlx5 -x UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1 -x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 -x NCCL_COLLNET_ENABLE=0 /opt/nccl_tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1' command: - /bin/bash - -c env: - name: OMPI_ALLOW_RUN_AS_ROOT value: "1" - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM value: "1" image: cr.ai.nebius.cloud/examples/nccl-tests:latest name: nccl resources: requests: cpu: 2 memory: 1208Mi securityContext: privileged: true initContainers: - command: - sh - -c - ulimit -Hl unlimited && ulimit -Sl unlimited image: busybox:1.27.2 name: init-limit securityContext: privileged: true Worker: replicas: 2 # Number of nodes template: spec: automountServiceAccountToken: false containers: - env: - name: NCCL_TOPO_FILE value: /etc/nccl-topo-h100-v1.xml image: cr.ai.nebius.cloud/examples/nccl-tests:latest name: nccl resources: limits: cpu: 150 memory: 1200G nvidia.com/gpu: 8 requests: cpu: 150 memory: 1200G nvidia.com/gpu: 8 securityContext: privileged: true volumeMounts: - mountPath: /dev/shm name: dshm - mountPath: /etc/nccl-topo-h100-v1.xml name: config subPath: nccl-topo-h100-v1.xml enableServiceLinks: false initContainers: - command: - sh - -c - ulimit -Hl unlimited && ulimit -Sl unlimited image: busybox:1.27.2 name: init-limit securityContext: privileged: true volumes: - emptyDir: medium: Memory name: dshm - configMap: name: topo-config name: config runPolicy: cleanPodPolicy: Running
Note
The example uses NCCL 2.19.4. Use this or a newer version in your tests and workloads. When using older versions, you may encounter errors such as
ring N does not loop back to start
. For details, see the issue in NCCL GitHub repository. -
Deploy the
MPIJob
innccl-test
:kubectl apply -f nccl-test.yaml -n nccl-test
-
Check that the test pods are running:
kubectl get pods -n nccl-test
Result:
NAME READY STATUS RESTARTS AGE nccl-test-h100-launcher 1/1 Running 0 24s nccl-test-h100-worker-0 1/1 Running 0 24s nccl-test-h100-worker-1 1/1 Running 0 24s
-
Check the test logs:
kubectl logs -f nccl-test-h100-launcher -n nccl-test | grep -v "NCCL INFO"
In the result, check the average bus bandwith. If its value is higher than 300 GB/sec, the connection is stable.
Example:
... # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 536870912 134217728 float sum -1 3674.4 146.11 283.09 0 3648.4 147.15 285.11 0 1073741824 268435456 float sum -1 6411.6 167.47 324.47 0 6416.7 167.33 324.21 0 2147483648 536870912 float sum -1 12735 168.62 326.71 0 12979 165.45 320.57 0 4294967296 1073741824 float sum -1 25389 169.17 327.76 0 25598 167.79 325.09 0 8589934592 2147483648 float sum -1 50979 168.50 326.47 0 50799 169.10 327.63 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 317.11
The average bus bandwith is not equal to the InfiniBand one as some of the NCCL operations it measures use NVLink. Nevertheless, it accurately estimates the connection.
-
Delete the
MPIJob
.kubectl delete -f nccl-test.yaml -n nccl-test
Note
You should delete the
MPIJob
even if you want to run another test. In this case, redeploy theMPIJob
as described in steps 4–5.