Applying the Nebius AI device topology to Managed Service for Kubernetes clusters with GPU nodes

For Nebius AI virtual machines in GPU clusters, the device topology differs from the default bare-metal one. As a result, the NCCL tests from NVIDIA might not work stable in a multi-host environment.

To run stable NCCL tests as described in the dedicated tutorial and improve workloads performance, apply the Nebius AI topology to your Managed Service for Kubernetes cluster.

Before applying the topology:

Create a GPU cluster.
Create a node group with the GPU cluster and a VM configuration supported in GPU clusters.

To apply the topology:

Download the nccl-topo-h100-v1.xml file from Nebius AI repository.
Create a namespace for the topology. For example, name it nccl-test:
```
kubectl create namespace nccl-test
```
Warning

Your Kubernetes resources that will use the topology must be created in the same namespace.

Create a ConfigMap resource with the topology:

kubectl create configmap topo-config \
  --from-file=nccl-topo-h100-v1.xml \
  -n nccl-test

Applying the Nebius AI device topology to Managed Service for Kubernetes clusters with GPU nodes

See also