Installing drivers on Managed Service for Kubernetes clusters with GPUs

Getting started
Install the GPU Operator
Check that drivers are installed correctly
Delete the resources you created

You can use Managed Service for Kubernetes node groups for workloads on GPUs without pre-installed drivers. Use the GPU Operator to select a suitable driver version.

To prepare your cluster and Managed Service for Kubernetes node group without pre-installed drivers for running workloads:

Install the GPU Operator
Check that drivers are installed correctly

If you no longer need the resources you created, delete them.

Getting started

If you don't have the Nebius AI command line interface yet, install and initialize it.

The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name or --folder-id parameter.
Create a Managed Service for Kubernetes cluster with any suitable configuration.
Create a node group on a platform with a GPU, and enable Do not install GPU drivers.
Install kubectl and configure it to work with the created cluster.

Install the GPU Operator

Install Helm v3.7.0 or higher.

Download and install the NVIDIA GPU Operator Helm chart from Nebius AI Marketplace:

helm pull oci://cr.ai.nebius.cloud/yc-marketplace/nebius/nvidia-gpu-operator/chart/gpu-operator \
  --version v24.3.0
helm install gpu-operator ./gpu-operator-v24.3.0.tgz \
  --namespace nvidia-gpu-operator \
  --create-namespace \
  --set driver.upgradePolicy.autoUpgrade=false \
  --set driver.rdma.enabled=true \
  --wait

This command installs NVIDIA drivers version 535.161.08 with GPUDirect RDMA support and NVIDIA Container Toolkit.

Check that drivers are installed correctly

Get the nvidia-driver-daemonset pod logs:

DRIVERS_POD_NAME="$(kubectl get pods --namespace nvidia-gpu-operator | grep nvidia-driver-daemonset | awk '{print $1}')" && \
kubectl --namespace nvidia-gpu-operator logs "${DRIVERS_POD_NAME}"

They should contain a message saying that the driver has been installed successfully, e.g.:

Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-535.161.08
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.161.08

...

Loading NVIDIA driver kernel modules...
+ modprobe nvidia
+ modprobe nvidia-uvm
+ modprobe nvidia-modeset

...

Done, now waiting for signal

Now, you can run GPU-based workloads by following the Running workloads with GPUs guide.

Delete the resources you created

To stop charges for the Kubernetes cluster you have created, delete it.