Installing drivers on Managed Service for Kubernetes clusters with GPUs
You can use Managed Service for Kubernetes node groups for workloads on GPUs without pre-installed drivers. Use the GPU Operator
To prepare your cluster and Managed Service for Kubernetes node group without pre-installed drivers for running workloads:
If you no longer need the resources you created, delete them.
Getting started
-
If you don't have the Nebius AI command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the
--folder-name
or--folder-id
parameter. -
Create a Managed Service for Kubernetes cluster with any suitable configuration.
-
Create a node group on a platform with a GPU, and enable Do not install GPU drivers.
-
Install kubectl
and configure it to work with the created cluster.
Install the GPU Operator
-
Install Helm
v3.7.0 or higher. -
Download and install the NVIDIA GPU Operator Helm chart from Nebius AI Marketplace:
helm pull oci://cr.ai.nebius.cloud/yc-marketplace/nebius/gpu-operator/chart/gpu-operator \ --version v23.9.0 helm install gpu-operator ./gpu-operator-v23.9.0.tgz \ --namespace nvidia-gpu-operator \ --create-namespace \ --set driver.upgradePolicy.autoUpgrade=false \ --set driver.rdma.enabled=true \ --wait
This command installs NVIDIA drivers version 535.104.12 with GPUDirect RDMA support and NVIDIA Container Toolkit.
Check that drivers are installed correctly
Get the nvidia-driver-daemonset
pod logs:
DRIVERS_POD_NAME="$(kubectl get pods --namespace gpu-operator | grep nvidia-driver-daemonset | awk '{print $1}')" && \
kubectl --namespace gpu-operator logs "${DRIVERS_POD_NAME}"
They should contain a message saying that the driver has been installed successfully, e.g.:
Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-535.54.03
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.54.03
...
Loading NVIDIA driver kernel modules...
+ modprobe nvidia
+ modprobe nvidia-uvm
+ modprobe nvidia-modeset
...
Done, now waiting for signal
Now, you can run GPU-based workloads by following the Running workloads with GPUs guide.
Delete the resources you created
To stop charges for the Kubernetes cluster you have created, delete it.