Slurm vs Kubernetes: Which to choose for your ML workloads

Scaling your machine learning workloads will eventually require resource orchestration. This article compares the most popular options today — Slurm and Kubernetes, covering their design origins, ML adaptations and other factors to consider.

Challenges in large model training

  1. Distributed training sensitivity to hardware failures
    Distributed training is highly sensitive to hardware failures. If a GPU or its host fails during training, you risk losing all progress stored in the GPU’s memory. Therefore, checkpointing and other processes are necessary to minimize damage to the entire training process.
  2. Higher failure rates of GPU servers
    GPU servers tend to fail more frequently than standard servers. Each server typically hosts eight GPUs, increasing the number of potential failure points. To address this, NVIDIA has recently introduced additional tooling for maintenance issues, but it is still very important to be ready for those failures when they happen.
  3. Complexity of multi-cloud infrastructure management
    Setting up and managing a multi-cloud infrastructure can be challenging. Running training on one cloud platform, performing inference on another and running web servers on yet another can introduce numerous technical issues, including human error. Virtualization can help by abstracting different underlying infrastructure layers into a more uniform system.
  4. High costs of GPUs
    Training large models requires significant parallel computing power, typically provided by GPUs. Given the high cost of these resources, it is crucial to maximize their usage and minimize downtime between training jobs.

Requirements to meet when building your workflow

  • First of all, you need sufficient GPU compute power. Unfortunately, a software platform isn’t going to help in this, but it can help in getting access to it.
  • Scale according to actual demand, ensuring you only pay for GPUs when needed. Conversely, you must be prepared for sudden spikes in horsepower demand, such as sudden increased demand during inference or additional training jobs.
  • Automate as much of your workflow as possible. This is how you make sure that you have, first, the necessary agility in actual pipelines, and second, codifying elements allowing to manage and monitor your flow reliably — removing a lot of mid-pipeline human error. For consistency and collecting metrics, this is mandatory.
  • Implement fault tolerance technologies and practices — which is typically the responsibility of your cloud provider. This, again, can be a difficult task with the way GPUs and GPU hosts operate. You, of course, want your platform to be able to minimize the aforementioned issues by all available means.
  • Set up scheduling and orchestration, the core functions that your platform should provide.
  • Optimize resource efficiency — not just in terms of cost but also to achieve the highest performance from the network and hardware (or the virtual hardware) that you’re paying for. Making sure you are not leaving paid-for GPUs sitting idle.
  • Last but not least, understand how to operate your platform. Both Slurm and Kubernetes are difficult to set up and equally challenging to operate and maintain.

Slurm specifics

“Slurm” originally stood for Simple Linux Utility for Resource Management, but that was over twenty years ago, in 2002. With two decades of active development and constant new feature additions, it is no longer quite as simple. Some might now call it Sophisticated Linux Utility for Resource Management. Officially, any sort of acronym or a meaning has long been let go. Slurm is just Slurm.

Slurm is an open-source HPC job scheduler used in over half of the Top 500 supercomputers worldwide (where our own ISEG sits in the 19th spot, of course). The primary caretakers and the only team that can give you official enterprise support for Slurm is SchedMD, formed in 2010, the same that year more advanced scheduling logic was added to Slurm.

If you are a research lab, university or a large corporation doing HPC, then Slurm is something you’re likely already familiar with. Compared to K8s, Slurm has a long history of running compute workloads for all kinds of institutions doing intense computation. The latest version (24.05) came out just days ago, and new releases are expected every six months.

Architecture

The architecture of Slurm is quite simple:

This graphic includes a control plane (or controller daemons) managing all scheduling algorithms. The control plane is light and can be easily run on a single machine, but it can be duplicated for high availability. Then there are the worker nodes that host the compute node daemons. Users manage the cluster and schedule jobs with the client commands.

Slurm lacks auto-scaling and is designed for a fixed scale. As such, the scheduling algorithm is designed to prioritize efficiency, allowing you to get the most out of your infrastructure. Slurm works well from an exceedingly small cluster to an extremely large one. Also, the control plane is notably lightweight, and you can even run it without a database if you do not need accounting.

Workloads in Slurm are organized into queues. Therefore, if you do not have enough resources to run a job immediately, the batch script waits in the queue for its turn. Only then can it access the compute power and worker nodes.

The jobs are generally shell scripts, and the compute nodes are essentially just remote shell hosts. With Slurm, you are running the same shell script on multiple different Linux machines. Slurm manages users across the cluster and their access to different resources, adding many policies on top of that.

Plugins

Slurm is highly extensible with various plugins that can be a great asset in running ML workloads. For instance, you might want to use container-related plugins like Enroot and Pyxis, both open-source products of NVIDIA. For GPU-driven ML workloads, the plugin GRES (Generic Resource Scheduling) is essential to schedule training jobs according to GPU availability. For more specific requirements, you might also want to take a look at developing your own plugin.

Deploying Slurm on Nebius AI

The team of Nebius AI cloud solution architects, of which I am a part, provides ML and MLOps engineers with a Terraform-based Slurm solution packaged with Pyxis and Enroot for container support. This solution deploys a Slurm cluster of VMs with GPUs and InfiniBand networking.

This solution is something we are actively developing and will continue to improve, so if you have any additional ideas or something you would want to see in a solution like this, please do not hesitate to reach out to us. Customer requests are a high priority on our development roadmap.

I have recently recorded a short video on deploying Slurm on our cloud:

Running a training job

Below, I oversimplify the process into five consecutive steps. To serve the topic of comparison between Slurm and Kubernetes, I will focus on the big picture. This explanation does not cover aspects like prep, checkpointing, storing your actual results, etc.

  1. Deploy your Slurm cluster and prepare it with all necessary dependencies. I recommend using environmental management solutions like Conda, Linux environment modules or containers. With the latter, you, of course, get the benefit of managing these in the runtime with more agility.
  2. Write your distributed training script using frameworks like PyTorch or TensorFlow for help.
  3. Write a Slurm batch script which specifies all the metadata Slurm uses to orchestrate and schedule training jobs. This should include at least these three: the number of nodes needed, the number of GPUs required and any framework-specific variables.
  4. Submit the batch script, and once it reaches its turn in the queue, monitor its completion.
  5. Review your results.

Kubernetes specifics

We talked a lot about Kubernetes in previous posts, especially in this one, so I will be brief. Kubernetes, often referred to as K8s, is an open-source container orchestration platform. It was originally developed by Google in 2014 specifically for managing containers in the cloud.

A year later, Google offered the platform to the Linux Foundation, a move that formed the Cloud Native Computing Foundation. After a couple more years, all the major cloud providers swiftly launched their own native Kubernetes services — the predecessors to what we now call a Managed Service for Kubernetes. Up until then — and for some time afterward — the market had seen competition for container orchestration. Now though, I can confidently say (just like my colleague Levon did in the previous article) that K8s has become the de-facto standard for container orchestration. Numerous organizations in the field actively develop it further; it is a very popular open-source project.

Architecture

The K8s control plane is a lot more complex compared to Slurm, mainly because it has a lot more responsibilities than the Slurm control plane. K8s is designed to run continuous web applications as workloads. It maintains the overall state of the cluster and the workloads and checks if it matches the expected state. If there is a discrepancy, K8s can self-heal to the correct state. A worker node may crash, making K8s move the workloads to other worker nodes. The design philosophy behind this architecture prioritizes high availability, unlike the resource efficiency of Slurm orchestration.

Autoscaling is a key feature of Kubernetes. K8s primarily designed for a public cloud environment where additional compute power is always available. If you want to launch a workload that exceeds the current capacity of the cluster, the deployment just fails or scales the cluster to meet the required resources.

Workload objects in Kubernetes are Deployments, StatefulSets, DaemonSets or Jobs. All of these describe how to launch a set of Pods to the cluster. A single Pod contains one or multiple containers, but it acts as the single smallest unit with a single IP address.

Networking is also virtualized. K8s has Services for internal connections combined with Ingresses and Gateway APIs for external access. This abstracts the location of the workload and makes it available from the same endpoint no matter where in the cluster it is located. With users and namespaces, you can control user access to specific workloads and resources.

Deploying Kubernetes on Nebius AI

Our Managed Service for Kubernetes is designed for GPU workloads. You can read a lot more about it on the service page, in the documentation and in the previous post. Its deployment can be done in three different ways: via a web console, our CLI or Terraform, for which we’ve built useful templates. Here, I’d like to share a video guide in which my colleague Boris explains how to quickly start a K8s cluster:

To optimize your cluster for ML tasks, check out another video with Boris.

Running distributed training on Kubernetes

Again, I’m oversimplifying.

  1. Deploy your Kubernetes cluster and configure your kubectl client. On-prem, this can be difficult, while public cloud providers like us have made it easy.
  2. Write your distributed training script. Use frameworks like PyTorch or TensorFlow for help.
  3. Create a Docker file of your training script containing all necessary dependencies.
  4. Build your container image and upload it to a container registry that is accessible from your K8s cluster.
  5. Write a ReplicaSet configuration for your training job. Be sure to specify at least the following meta info:
  • Image you want to use.
  • Resources your job requires (CPU, memory, GPU).
  • Number of replicas you want to deploy.
  • Storage to attach for storing the results.
  1. Deploy the ReplicaSet and monitor the progress until it’s done.

Having said that, this is not necessarily how most people use Kubernetes for machine learning. Instead, thanks to the high level of abstraction Kubernetes offers, they use tools like Kubeflow or Ray deployed on top of K8s with the help of templates and package management tools like Helm. These platforms offer a wide variety of tools to help you in model training. You can develop environments, create ML pipelines, use training operators to simplify model training and fine-tuning, and finally, even serve the trained model. Of course, NVIDIA also offers many of its own containerized applications and packaged GPU controllers, which make it much easier to run ML workloads on Kubernetes.

How to make the right choice between the two

Let us see how they compare in a few aspects.

Utilization and efficiency during scaling and scheduling

Slurm is built for fixed-size HPC scheduling. The workloads are managed in queues, and the underlying design prioritizes maximizing resource utilization. It does not really support auto-scaling, and tasks that cannot be run with current available compute will just wait in the queue.

GPUs are an expensive resource, so you might not even need auto-scaling and instead focus on optimising resource utilization where Slurm might take the upper hand. You would just want to be careful if there is a gap in training. You would want to scale down to minimize unnecessary costs.

K8s is built with a cloud-native philosophy where the resources are assumed to be very elastic. Your cluster can always accommodate the amount of load you want to run on top of it. There is no native job queue — however, there are other solutions available to build something close to a queue on top of K8s.

If you have a dynamic amount of load, the auto-scaling of Kubernetes might be a better choice. On top of that, if you are running inference, which by nature comes with the requirement of having a flexible amount of performance, then you also might want to go with Kubernetes.

Checkpointing for ML jobs

When it comes to checkpointing, because the jobs themselves are running in internal GPU memory anyway, there is no major difference between the two. Kubernetes with its more abstracted architecture could be an advantage or a disadvantage here. You might find pre-packaged tooling to help, but it might also get complicated if you are not familiar with the platform. The ability of Slurm to distinguish between scripts, job steps and child processes might give it an advantage as well. In any case, neither platform offers a magical solution to make the challenge of checkpointing much easier.

Containers and runtime

The Slurm daemon runs given instructions directly on the worker node operating system. Hence, you can use Slurm to run anything that can be launched from the Linux shell. This, of course, includes containers. There are also additional plugins to help. For example, Pyxis and Enroot from NVIDIA offer a much more streamlined flow when running containers on Slurm. In the case of MPI workloads, the benefit of orchestration algorithms optimized for resource utilization and speed becomes more than just ease of use. MPI workloads can simply perform better on Slurm.

Kubernetes is a container orchestration platform. There is not much that cannot be ran in containers, but some could still count it as a disadvantage that you cannot access the underlying worker node directly. For others it can be a welcome relief. Combined with an elevated level of network and resource abstraction, K8s offers an easily expandable platform. This is leveraged by the NVIDIA GPU Operator, Helm package management or the Kubeflow Training Operator. Speaking of which, you can also easily deploy ML platforms such as Kubeflow or Ray to orchestrate your ML pipelines.

There are clear benefits of running ML jobs in containers, such as:

  • Ensuring environment consistency throughout your pipelines. This can help a lot in making sure you are always using the correct versions of all the dependencies you are leveraging. With this consistency, you are also more capable of tracking how different environmental changes affect your training performance.
  • Accelerating the iterations of your platform by codifying the training environment and enabling a higher degree of MLOps. Containers are a key part of running effective training pipelines.
  • Enhancing job portability. A container image will keep the runtime environment identical no matter where you run it. You can easily and consistently move your training job from a local machine, on-prem data center or the cloud.

Understanding your team

While this is a very technical topic, there are clear less-technical considerations to take when selecting the training platform. You need to look at your own team, your organization and the surrounding community to see what requirements they might pose for the platform.

Do you have a single small team or do you need to provide controlled access to multiple teams? Slurm and Kubernetes both offer good security controls. You can make sure teams can only access their own allocated resources and projects. Platforms such as Kubeflow might make this easier to manage on Kubernetes.

Does your team already have experience with Slurm or Kubernetes? If you are already familiar with one platform and feel that there could be some benefit in switching to the other, be sure to consider the learning curve ahead of you. Getting competent on either platform will take a while. You could even end up running both platforms for a while, adding a good bit of complexity to your infrastructure. Neither platform eliminates the need for maintenance and operations.

If you do not have experience on either platform, consider if you have relevant experience. If you are already familiar with operating and managing Linux environments, the jump to Slurm might not be as challenging as learning containerisation and container orchestration with Kubernetes.

How you manage your budget can also play a part. If you are buying a fixed amount of compute for a set price, Slurm might be a good option to get the most out of that. If you can or want to be flexible with the budget and will be purchasing on-demand resources, Kubernetes can offer a better option as the more scalable platform.

Getting help from a community around you will be most effective if the platform you choose aligns with this community’s ideals. According to our market research, there are some patterns here: research labs and universities prefer Slurm, while AI startups and teams training generative models more often opt for Kubernetes. The whole community surrounding K8s is a lot larger but also more “business oriented” and doing a lot more than just ML.

Consider the other infrastructure requirements you might have. If you also need to run web applications, it might be a better option to streamline the overall architecture towards Kubernetes. For inference, Kubernetes might be an obvious choice.

Final remarks

Neither Slurm nor Kubernetes were originally designed for our modern machine learning workloads. However, large, distributed training does not differ much from the high-performance computing Slurm is built for, but the long training jobs with expensive and highly sought-after GPUs can make Kubernetes more attractive by providing self-healing and auto-scaling. In selecting Slurm or Kubernetes, you can also see the two different communities migrating to the AI/ML space: the academics that have been running ML research on Slurm for a very long time, and the cloud-native technology companies much more familiar with Kubernetes.

There is also an ongoing discussion about combining Slurm and Kubernetes, a topic worthy of its own article. By combining the best aspects of both platforms, we could reach something more suitable for ML workloads than either one on their own. There are some experiments for this already available with more robust solutions on the way, but these will take a while to reach maturity and broader adoption.

If you are considering between Slurm or Kubernetes, I hope this article helps you in making a more informed and structured decision. I, of course, welcome you to contact us for a detailed discussion on your needs and how we can help.

author
Panu Koskela
Cloud Solutions Architect at Nebius AI
Sign in to save this post