Setting up and testing InfiniBand connection between GPUs in a Managed Service for Kubernetes cluster

Getting started
Install NVIDIA operators
Check that the operators and drivers are installed
Run NCCL tests

To boost performance of the high-performance computing (HPC) and AI workloads that you run in a Managed Service for Kubernetes cluster, you can set it up so that the GPUs on its nodes are interconnected directly using NVIDIA InfiniBand.

In this tutorial, you will create a Managed Service for Kubernetes cluster with GPUs interconnected using InfiniBand, install operators and drivers from NVIDIA on it, and run NVIDIA NCCL tests to check InfiniBand performance.

Getting started

If you don't have the Nebius AI command line interface yet, install and initialize it.

The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name or --folder-id parameter.
Create a GPU cluster in Compute Cloud.
Create a Managed Service for Kubernetes cluster with any suitable configuration.
Create a node group. Requirements:
- Select a VM platform with GPU that is supported by GPU clusters. See the list of supported platforms.
- Select the GPU cluster that you created.
- Select the Do not install GPU drivers checkbox to avoid conflicts between the pre-installed drivers and operators you will install in this tutorial.
Install kubectl and configure it to work with the created cluster.

Install NVIDIA operators

Install Helm v3.7.0 or higher.

Install the NVIDIA Network Operator:

Add the NVIDIA NGC chart repository to Helm:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

Create network-operator-values.yaml with the following contents:

nfd:
  enabled: true

# NicClusterPolicy CR values:
deployCR: true
ofedDriver:
  deploy: true
  version: 23.04-0.5.3.3.1

rdmaSharedDevicePlugin:
  deploy: false

secondaryNetwork:
  deploy: false
  multus:
    deploy: false
  ipoib:
    deploy: false
  ipamPlugin:
    deploy: false

Download and install the Helm chart:

helm install network-operator nvidia/network-operator \
  -n nvidia-network-operator \
  --create-namespace \
  --version v23.7.0 \
  -f ./network-operator-values.yaml \
--wait

Download and install the NVIDIA GPU Operator Helm chart from Nebius AI Marketplace:

helm pull oci://cr.ai.nebius.cloud/yc-marketplace/nebius/nvidia-gpu-operator/chart/gpu-operator \
  --version v24.3.0
helm install gpu-operator ./gpu-operator-v24.3.0.tgz \
  --namespace nvidia-gpu-operator \
  --create-namespace \
  --set driver.upgradePolicy.autoUpgrade=false \
  --set driver.rdma.enabled=true \
  --wait

This command installs NVIDIA drivers version 535.161.08 with GPUDirect RDMA support and NVIDIA Container Toolkit.

Check that the operators and drivers are installed

Check that the operators' pods are running:
- For NVIDIA Network Operator:
```
kubectl get pod -n nvidia-network-operator
```
- For NVIDIA GPU Operator:
```
kubectl get pod -n nvidia-gpu-operator
```
Note

For 2 nodes, the pods take around 10–15 minutes to get running; the more nodes, the longer.

Check that the InfiniBand drivers are installed by the NVIDIA Network Operator:

How to check the InfiniBand drivers

Check logs of the mofed-ubuntu20.04 pod:

DRIVERS_POD_NAME="$(kubectl get pods --namespace nvidia-network-operator | grep mofed | head -1 | awk '{print $1}')"
kubectl --namespace nvidia-network-operator logs "${DRIVERS_POD_NAME}"

They should contain messages about the successful driver installation, like:

...
+ ofed_info -s
MLNX_OFED_LINUX-23.04-0.5.3.3:
+ [[ 0 -ne 0 ]]
+ mount_rootfs
+ echo 'Mounting Mellanox OFED driver container rootfs...'
Mounting Mellanox OFED driver container rootfs...
...
+ set_driver_readiness
+ touch /.driver-ready
+ trap 'echo '\''Caught signal'\''; exit 1' HUP INT QUIT PIPE TERM
+ trap handle_signal EXIT
+ wait
+ sleep infinity

Check that the drivers version installed on the nodes is 23.04-0.5.3:

for node in $(kubectl get nodes -o name); do 
  kubectl debug $node -it --image=busybox:1.27.2 -- cat /sys/module/mlx5_core/version 
done

Check that the GPU drivers are installed by the NVIDIA GPU Operator:

How to check the GPU drivers

Check logs of the nvidia-driver-daemonset pod:

DRIVERS_POD_NAME="$(kubectl get pods --namespace nvidia-gpu-operator | grep nvidia-driver-daemonset | head -1 | awk '{print $1}')"
kubectl --namespace nvidia-gpu-operator logs "${DRIVERS_POD_NAME}"

They should contain messages about the successful driver installation, like:

...
Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 535.161.08) is now complete.

Parsing kernel module parameters...
Loading ipmi and i2c_core kernel modules...
Loading NVIDIA driver kernel modules...
+ modprobe nvidia
+ modprobe nvidia-uvm
+ modprobe nvidia-modeset
+ set +o xtrace -o nounset
Mellanox device found at 0000:8c:00.0
Loading NVIDIA Peer Memory kernel module...
+ modprobe nvidia-peermem
+ set +o xtrace -o nounset
Starting NVIDIA persistence daemon...
Mounting NVIDIA driver rootfs...
Done, now waiting for signal

Check that the drivers version installed on the nodes is 535.161.08:

for node in $(kubectl get nodes -o name); do 
  kubectl debug $node -it --image=busybox:1.27.2 -- cat /proc/driver/nvidia/version 
done

Run NCCL tests

Install the Kubeflow Training Operator.

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.5.0"

Create namespace for your tests. For example, nccl-test.
```
kubectl create ns nccl-test
```
Apply the device topology from Nebius AI in the nccl-test namespace. Use topo-config as the name of the topology ConfigMap, as instructed in the linked guide.

Create nccl-test.yaml with an MPIJob for your tests. This example is for 2 nodes.

nccl-test.yaml

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: nccl-test-h100
spec:
  slotsPerWorker: 8 # Number of GPUs on each node
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - args:
            # In `-np 16`, 16 is the total number of GPUs on all nodes 
            # (`.spec.slotsPerWorker` × `.spec.mpiReplicaSpecs.Worker.replicas`)
            - 'mpirun -np 16 -bind-to none -x LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x
              NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_HCA=mlx5 -x UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1
              -x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 -x NCCL_COLLNET_ENABLE=0
              /opt/nccl_tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1'
            command:
            - /bin/bash
            - -c
            env:
            - name: OMPI_ALLOW_RUN_AS_ROOT
              value: "1"
            - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
              value: "1"
            image: cr.ai.nebius.cloud/examples/nccl-tests:latest
            name: nccl
            resources:
              requests:
                cpu: 2
                memory: 1208Mi
            securityContext:
              privileged: true
          initContainers:
          - command:
            - sh
            - -c
            - ulimit -Hl unlimited && ulimit -Sl unlimited
            image: busybox:1.27.2
            name: init-limit
            securityContext:
              privileged: true
    Worker:
      replicas: 2 # Number of nodes
      template:
        spec:
          automountServiceAccountToken: false
          containers:
          - env:
            - name: NCCL_TOPO_FILE
              value: /etc/nccl-topo-h100-v1.xml
            image: cr.ai.nebius.cloud/examples/nccl-tests:latest
            name: nccl
            resources:
              limits:
                cpu: 150
                memory: 1200G
                nvidia.com/gpu: 8
              requests:
                cpu: 150
                memory: 1200G
                nvidia.com/gpu: 8
            securityContext:
              privileged: true
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
            - mountPath: /etc/nccl-topo-h100-v1.xml
              name: config
              subPath: nccl-topo-h100-v1.xml
          enableServiceLinks: false
          initContainers:
          - command:
            - sh
            - -c
            - ulimit -Hl unlimited && ulimit -Sl unlimited
            image: busybox:1.27.2
            name: init-limit
            securityContext:
              privileged: true
          volumes:
          - emptyDir:
              medium: Memory
            name: dshm
          - configMap:
              name: topo-config
            name: config
  runPolicy:
    cleanPodPolicy: Running

Note

The example uses NCCL 2.19.4. Use this or a newer version in your tests and workloads. When using older versions, you may encounter errors such as ring N does not loop back to start. For details, see the issue in NCCL GitHub repository.

Deploy the MPIJob in nccl-test:

kubectl apply -f nccl-test.yaml -n nccl-test

Check that the test pods are running:

kubectl get pods -n nccl-test

Result:

NAME                      READY   STATUS    RESTARTS   AGE
nccl-test-h100-launcher   1/1     Running   0          24s
nccl-test-h100-worker-0   1/1     Running   0          24s
nccl-test-h100-worker-1   1/1     Running   0          24s

Check the test logs:

kubectl logs -f nccl-test-h100-launcher -n nccl-test | grep -v "NCCL INFO"

In the result, check the average bus bandwith. If its value is higher than 300 GB/sec, the connection is stable.

Example:

...
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
   536870912     134217728     float     sum      -1   3674.4  146.11  283.09      0   3648.4  147.15  285.11      0
  1073741824     268435456     float     sum      -1   6411.6  167.47  324.47      0   6416.7  167.33  324.21      0
  2147483648     536870912     float     sum      -1    12735  168.62  326.71      0    12979  165.45  320.57      0
  4294967296    1073741824     float     sum      -1    25389  169.17  327.76      0    25598  167.79  325.09      0
  8589934592    2147483648     float     sum      -1    50979  168.50  326.47      0    50799  169.10  327.63      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 317.11

The average bus bandwith is not equal to the InfiniBand one as some of the NCCL operations it measures use NVLink. Nevertheless, it accurately estimates the connection.

Delete the MPIJob.
```
kubectl delete -f nccl-test.yaml -n nccl-test
```
Note

You should delete the MPIJob even if you want to run another test. In this case, redeploy the MPIJob as described in steps 4–5.