Soperator

A Slurm-based workload manager for ML and HPC clusters with a modern and simplified user experience.

Robust job scheduling

Slurm can schedule and orchestrate an immense number of jobs on thousands of compute nodes. Together with granular hardware control, it creates a powerful tool for solving the most complicated ML and HPC tasks.

Fault-tolerant training

Thanks to hardware health checks and Kubernetes high-availability mechanisms, Soperator ensures seamless and predictable ML training without disruptions caused by GPU failures.

Simplified environment management

Shared root filesystem provides a single file environment for all nodes of the cluster, allowing ML practitioners to focus on model development, not on complicated package management.

Use cases

Distributed training of any scale

Soperator is a perfect solution for orchestrating highly intensive distributed training with a scale of up to tens of thousands of GPU nodes.

Collaboration within the same cluster

The ability to run jobs in parallel and schedule jobs from various projects in sequences can save time and money when organizing collaborative work of your ML team.

How it works

Open source solution

At Nebius, we believe that only together can we create better technologies. That’s why we made this product open-source, providing ML enthusiasts and HPC practitioners with the opportunity to use this technology for their endeavors and improve it according to their needs.

Service features

Hardware health checks

The system constantly monitors the availability of every hardware unit within the cluster and reports if any issue occurs.

Shared root filesystem

All system files are shared across all cluster nodes, mitigating the necessity to manually maintain all nodes of the cluster in an identical state.

Easy bootstrap

The solution is ready-to-go and could be deployed within 20-30 minutes. We also provide a Terraform recipe for our cloud that simplifies it even further.

Pre-installed GPU and network drivers

Soperator has pre-packed all NVIDIA GPU, InfiniBand and other drivers necessary for running the ML training cluster.

Easy scaling

You can easily scale your GPU cluster up and down based on the upcoming workloads, new model development tasks or team expansion.

Advanced scheduling mechanism

Slurm allows you to split one large job into many steps, some executed sequentially and others in parallel.

Granular hardware control

Slurm can distinguish hardware like CPU sockets, CPU/GPU cores and hyperthreads, allowing you to provision even the smallest compute units.

Cluster accounting

Slurm accounting provides detailed statistics about cluster usage consumption, job duration, errors and other system data.

High performance filesystem

Delivers up to 100 GB/s throughput and 1M IOps for quick checkpoint restoration during large-scale trainings and effective dataset streaming.

Terraform support

Configuring cluster via Terraform makes the user experience easier even if your team doesn’t have a deep DevOps expertise.

Compare Soperator with Slurm and Kubernetes

Soperator
Slurm
Managed Kubernetes
Hardware health checks
Shared root filesystem
Easy bootstrap
Pre-installed GPU and network drivers
Easy scaling
Advanced scheduling mechanism
Granular hardware control
Cluster accounting
High performance filesystem
Terraform support

Questions and answers about Soperator

What is Slurm?

Slurm is an open source, fault-tolerant and highly scalable cluster management and job scheduling system for large and small Linux clusters.

SchedMD

By partnering directly with SchedMD, the developer of the Slurm Workload Manager, Nebius AI provides exceptional support to Slurm users. SchedMD robust Slum workload manager streamlines job scheduling and resource allocation. Its scalability and reliability make it a versatile solution that can meet a variety of business needs.

* — Slurm is a registered trademerk of SchedMD LLC.