Setting up a cluster for training using Slurm

Costs
Prepare the environment
Create a cluster
How to test the cluster
How to delete the resources you created

Slurm is an open source and high scalable workload manager for Linux clusters.

In this tutorial, you will set up a Slurm managed cluster in Nebius AI. It also covers optional steps of testing the cluster and deleting the created resources.

Costs

The cost of this infrastructure includes:

Fees for continuously running VMs and disks (see Compute Cloud pricing).
Fees for using public IP addresses and outgoing traffic (see Virtual Private Cloud pricing).
If your framework relies on Slurm accounting to automate distributed training processes, fees for continuosly running MySQL hosts and their disks (see Managed Service for MySQL pricing).

Prepare the environment

If you do not have the Nebius AI command line interface yet, install and configure it.
Install Terraform and configure Nebius AI Terraform provider.

Warning

Nebius AI Terraform provider is in beta and may be unstable. If you are experiencing issues with it, contact support.

Create a cluster

On your local machine:

Clone the nebius-architect-solution-library repository from GitHub and go to the slurm/slurm-standalone directory:

git clone https://github.com/nebius/nebius-architect-solution-library.git
cd ./nebius-architect-solution-library/slurm/slurm-standalone

Get your folder ID.

See the variables.tf file for the list of variables and their default values used in the configuration. Create a terraform.tfvars file using the example below and modify it with your values:

# (Required) Folder in Nebius AI to create a cluster
folder_id = "<folder_id>"

# (Required) SSH public key for cluster VMs
sshkey = "<ssh_key>"

# (Required) Number of VMs in the cluster
cluster_nodes_count = 4

# (Optional) Creates a MySQL cluster for Slurm accounting
mysql_jobs_backend = true

# (Optional) Platform type: gpu-h100, gpu-h100-b or gpu-h100-c
platform_id = "gpu-h100-b"

The VM platform (platform_id) must support GPU clusters.

To keep the default value for a variable (except folder_id), remove its line from terraform.tfvars.

Apply the configuration:
1. Initialize Terraform:
```
terraform init
```
2. Check the Terraform file configuration:
```
terraform validate
```
3. Check the list of created cloud resources:
```
terraform plan
```
4. Create resources:
```
terraform apply
```

This will start the cluster creation. After Terraform creates the VMs and other resources and exits, deployment scripts will still be running on the created VMs. To follow the deployment process, we recommend to check logs on the Slurm master:

Connect to the node-master in the cluster over SSH as the user slurm:
```
ssh -i <ssh-key-path> slurm@<node-master-ip>
```

Get the cloud-init logs:

sudo tail -f /var/log/cloud-init-output.log

Check the cloud-init status:
```
cloud-init status 
```

The deployed cluster is prepared for Slurm workloads. The VMs in the cluster have special plugins installed:

Enroot for multi-user containerized access to the cluster.
Pyxis for running containerized tasks via srun.

How to test the cluster

To check that Slurm is installed correctly, for each VM in the cluster:

Connect to the VM over SSH as the user slurm.

Run sinfo. The availability status (AVAIL) should be up.

Result:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      2   idle node[1-2]
compute*     up   infinite      8  alloc node[3-10]
...

Create a script that prints an output of the NVIDIA System Management Interface command, for example, testjob.sh:
```
nvidia-smi > output.txt
EOF
```
Submit the script as a Slurm job:
```
sbatch testjob.sh
```
Check the job status:
```
squeue -u slurm
```
Once the job is completed, open the output.txt. It should contain the nvidia-smi command output.

It is also recommended to run the NCCL tests using the Testing inter-GPU connection for Compute Cloud VMs guide.

How to delete the resources you created

While the cluster VMs and the Managed Service for MySQL cluster (if you created it) are running, you are charged for them. If you do not need them anymore, delete them.

To delete the VMs and the Managed Service for MySQL cluster, on your local machine, go to the directory with the Terraform configuration (see Create a cluster) and run the following command:

terraform destroy