Setting up a cluster for training using Slurm
Slurm
In this tutorial, you will set up a Slurm managed cluster in Nebius AI. It also covers optional steps of testing the cluster and deleting the created resources.
Costs
The cost of this infrastructure includes:
- Fees for continuously running VMs and disks (see Compute Cloud pricing).
- Fees for using public IP addresses and outgoing traffic (see Virtual Private Cloud pricing).
- If your framework relies on Slurm accounting
to automate distributed training processes, fees for continuosly running MySQL hosts and their disks (see Managed Service for MySQL pricing).
Prepare the environment
-
If you do not have the Nebius AI command line interface yet, install and configure it.
-
Install Terraform and configure Nebius AI Terraform provider.
Warning
Nebius AI Terraform provider is in beta and may be unstable. If you are experiencing issues with it, contact support.
Create a cluster
On your local machine:
-
Clone the nebius-architect-solution-library
repository from GitHub and go to theslurm/slurm-standalone
directory:git clone https://github.com/nebius/nebius-architect-solution-library.git cd ./nebius-architect-solution-library/slurm/slurm-standalone
-
See the variables.tf
file for the list of variables and their default values used in the configuration. Create aterraform.tfvars
file using the example below and modify it with your values:# (Required) Folder in Nebius AI to create a cluster folder_id = "<folder_id>" # (Required) SSH public key for cluster VMs sshkey = "<ssh_key>" # (Required) Number of VMs in the cluster cluster_nodes_count = 4 # (Optional) Creates a MySQL cluster for Slurm accounting mysql_jobs_backend = true # (Optional) Platform type: gpu-h100, gpu-h100-b or gpu-h100-c platform_id = "gpu-h100-b"
The VM platform (
platform_id
) must support GPU clusters.To keep the default value for a variable (except
folder_id
), remove its line fromterraform.tfvars
. -
Apply the configuration:
-
Initialize Terraform:
terraform init
-
Check the Terraform file configuration:
terraform validate
-
Check the list of created cloud resources:
terraform plan
-
Create resources:
terraform apply
-
This will start the cluster creation. After Terraform creates the VMs and other resources and exits, deployment scripts will still be running on the created VMs. To follow the deployment process, we recommend to check logs on the Slurm master:
-
Connect to the
node-master
in the cluster over SSH as the userslurm
:ssh -i <ssh-key-path> slurm@<node-master-ip>
-
Get the cloud-init logs:
sudo tail -f /var/log/cloud-init-output.log
-
Check the cloud-init status:
cloud-init status
The deployed cluster is prepared for Slurm workloads. The VMs in the cluster have special plugins installed:
- Enroot
for multi-user containerized access to the cluster. - Pyxis
for running containerized tasks viasrun
.
How to test the cluster
To check that Slurm is installed correctly, for each VM in the cluster:
-
Connect to the VM over SSH as the user
slurm
. -
Run
sinfo
. The availability status (AVAIL
) should beup
.Result:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 2 idle node[1-2] compute* up infinite 8 alloc node[3-10] ...
-
Create a script that prints an output of the NVIDIA System Management Interface command, for example,
testjob.sh
:nvidia-smi > output.txt EOF
-
Submit the script as a Slurm job:
sbatch testjob.sh
-
Check the job status:
squeue -u slurm
-
Once the job is completed, open the
output.txt
. It should contain thenvidia-smi
command output.
It is also recommended to run the NCCL tests using the Testing inter-GPU connection for Compute Cloud VMs guide.
How to delete the resources you created
While the cluster VMs and the Managed Service for MySQL cluster (if you created it) are running, you are charged for them. If you do not need them anymore, delete them.
To delete the VMs and the Managed Service for MySQL cluster, on your local machine, go to the directory with the Terraform configuration (see Create a cluster) and run the following command:
terraform destroy