Soperator

A Slurm-based workload manager for ML and HPC clusters with a modern and simplified user experience.

Robust job scheduling

Slurm can schedule and orchestrate an immense number of jobs on thousands of compute nodes. Together with granular hardware control, it creates a powerful tool for solving the most complicated ML and HPC tasks.

Fault-tolerant training

Thanks to hardware health checks and Kubernetes high-availability mechanisms, Soperator ensures seamless and predictable ML training without disruptions caused by GPU failures.

Simplified environment management

Shared root filesystem provides a single file environment for all nodes of the cluster, allowing ML practitioners to focus on model development, not on complicated package management.

Use cases

Distributed training of any scale

Soperator is a perfect solution for orchestrating highly intensive distributed training with a scale of up to tens of thousands of GPU nodes.

Collaboration within the same cluster

The ability to run jobs in parallel and schedule jobs from various projects in sequences can save time and money when organizing collaborative work of your ML team.

Open source solution

At Nebius, we believe that only together can we create better technologies. That’s why we made this product open-source, providing ML enthusiasts and HPC practitioners with the opportunity to use this technology for their endeavors and improve it according to their needs.

GitHub Repository

Service features

Hardware health checks

The system constantly monitors the availability of every hardware unit within the cluster and reports if any issue occurs.

Shared root filesystem

All system files are shared across all cluster nodes, mitigating the necessity to manually maintain all nodes of the cluster in an identical state.

Easy bootstrap

The solution is ready-to-go and could be deployed within 20-30 minutes. We also provide a Terraform recipe for our cloud that simplifies it even further.

Pre-installed GPU and network drivers

Soperator has pre-packed all NVIDIA GPU, InfiniBand and other drivers necessary for running the ML training cluster.

Easy scaling

You can easily scale your GPU cluster up and down based on the upcoming workloads, new model development tasks or team expansion.

Advanced scheduling mechanism

Slurm allows you to split one large job into many steps, some executed sequentially and others in parallel.

Granular hardware control

Slurm can distinguish hardware like CPU sockets, CPU/GPU cores and hyperthreads, allowing you to provision even the smallest compute units.

Cluster accounting

Slurm accounting provides detailed statistics about cluster usage consumption, job duration, errors and other system data.

High performance filesystem

Delivers up to 100 GB/s throughput and 1M IOps for quick checkpoint restoration during large-scale trainings and effective dataset streaming.

Terraform support

Configuring cluster via Terraform makes the user experience easier even if your team doesn’t have a deep DevOps expertise.

Compare Soperator with Slurm and Kubernetes

Soperator

Slurm

Managed Kubernetes

Hardware health checks

Shared root filesystem

Easy bootstrap

Pre-installed GPU and network drivers

Easy scaling

Advanced scheduling mechanism

Granular hardware control

Cluster accounting

High performance filesystem

Terraform support

Related resources

Explaining Soperator

From this article, you’ll find out how the community run Slurm clusters as Kubernetes resources previously and the details of the Soperator architecture.

Soperator

Our main repository with the source code and Helm charts for deploying it.

Terraform code for Nebius AI

This repository contains Terraform recipe that helps you deploy Soperator on Nebius AI platform with ease.

Questions and answers about Soperator

What is Slurm?

Slurm is an open source, fault-tolerant and highly scalable cluster management and job scheduling system for large and small Linux clusters.

Learn more

Why is Slurm good for distributed training?

Does Soperator on Nebius ready to use for training out of the box?

Is Soperator a paid service?

Should I pay for Managed Service for Kubernetes when using Soperator?

Can I use Soperator in other public cloud or on-premises?

SchedMD

By partnering directly with SchedMD, the developer of the Slurm Workload Manager, Nebius AI provides exceptional support to Slurm users. SchedMD robust Slum workload manager streamlines job scheduling and resource allocation. Its scalability and reliability make it a versatile solution that can meet a variety of business needs.

Start your journey

More to know

Read the article

GitHub Repository

^* — Slurm is a registered trademerk of SchedMD LLC.

Robust job scheduling

Fault-tolerant training

Simplified environment management

Use cases

Distributed training of any scale

Collaboration within the same cluster

How it works

Open source solution

Service features

Hardware health checks

Shared root filesystem

Easy bootstrap

Pre-installed GPU and network drivers

Easy scaling

Advanced scheduling mechanism

Granular hardware control

Cluster accounting

High performance filesystem

Terraform support

Compare Soperator with Slurm and Kubernetes

Related resources

Explaining Soperator

Soperator

Terraform code for Nebius AI

Questions and answers about Soperator

What is Slurm?

Why is Slurm good for distributed training?

Does Soperator on Nebius ready to use for training out of the box?

Is Soperator a paid service?

Should I pay for Managed Service for Kubernetes when using Soperator?

Can I use Soperator in other public cloud or on-premises?

SchedMD

Start your journey

More to know