Criterion 1: Kubernetes as a product
At its core, K8s is an orchestrator, adept at autoscaling your fleet of machines exactly when needed. Whether it’s handling standalone applications (pod autoscaler) or managing physical nodes (node autoscaler), its flexibility is unmatched.
From an infrastructure perspective, node autoscaling is the standout feature of Kubernetes. Instead of juggling individual virtual machines, users can seamlessly control the entire cluster as they go. It’s a lifesaver, particularly when it comes to machine learning.
In other words, you only have to deploy a K8s cluster with multiple GPU-equipped nodes, and the system will orchestrate them from that point on. Using tools like Terraform, which we’ll discuss below, you can define your compute resource needs, set limits on the number of nodes, and then let K8s work its magic. It autonomously replaces crashed nodes, starts new nodes as needed, or shuts down excess active nodes to ensure the training cycle is uninterrupted.
This autonomy means ML engineers can focus on what they love most — training algorithms — without getting bogged down by infrastructure nuances.
Criterion 2: Native tooling
After deployment, an ML engineer’s work is far from done. From driver installation to tool and environment setup, the MLOps life is complicated. Take GPU driver installation, for example. It’s an extremely time-consuming, yet non-negotiable step; every node needs the right driver.
With Kubernetes, though, this process becomes a breeze. By using a Helm chart with a GPU operator, engineers can instruct the K8s cluster to handle driver installation and testing for an entire zoo of newly deployed nodes. But what if there’s a need for driver modifications — for instance, to account for corner cases? In this scenario, the YAML file serves as the universal command center. Set the driver type you need, and you’re good to go.
This centralized approach, combined with tools like Helm charts and cron jobs, makes software maintenance much less of a chore, which applies to both low- and high-level software. Change something once, and it’s applied everywhere. Achieving such streamlined operations on regular VMs would require writing cumbersome scripts and applying them across you VM fleet, costing both time and money.
Today, Kubernetes has no real alternatives: it’s established itself as the de-facto industry standard. The days of deploying standard Docker containers are long gone. The shift to K8s containers is evident, and for good reason. Its long-standing presence means a wealth of knowledge is readily available, smoothing the learning curve and encouraging collaborative learning.
From self-managed to cloud provider-managed Kubernetes
Let’s say we all agree that Kubernetes is one of the most effective ways to orchestrate VM clusters for ML and related tasks. But while K8s offers automation for VM clusters, it doesn’t exempt ML practitioners from some infrastructure responsibilities. Kubernetes itself needs setup, configuration, and continuous monitoring.
This is where managed Kubernetes comes in. In the Nebius AI implementation, the Managed Service for Kubernetes shifts all master maintenance to the platform side. Updates and patches, security-related or otherwise, are installed without any hands-on intervention. All you have to do is set the version and release channel — and a single button click in the UI sets the updates in motion. We also provide out-of-the-box operational logging, so instead of wrestling with kubectl in the terminal, users can see all the logs in an easy-to-use GUI.