Nebius AI monthly digest, March 2024

March was a busy month for us: we opened access to Managed databases, hosted a webinar on Slurm vs Kubernetes, published new handy guides in our documentation and several ML-focused articles on the blog.

Managed databases and Container Registry are now available to all users

Our newest release enables you to build your ML workloads in a more secure, fault-tolerant, and ready-to-use way. These databases can store metadata for common ML tools and services. Providing a Managed Service, we as your cloud provider take responsibility for setup, maintenance, backups, and scaling, thus freeing up your time for building and optimizing models and services.

Managed Service for PostgreSQL
Manage clusters of the popular object-relational DBMS

Managed Service for MySQL®
Manage the most popular relational DBMS renowned for its reliability

Container Registry
Store and distribute Docker images, ensuring they are safe and accessible

Managed Service for ClickHouse®
Manage the resource-efficient open-source database. Preview

Managed Service for Redis
Deploy and maintain fast in‑memory clusters of the NoSQL DBMS. Preview

Managed Service for OpenSearch
Deploy and maintain OpenSearch server clusters. Preview

Choosing the right way of building platforms: Slurm vs Kubernetes

Introducing Panu Koskela, our cloud solutions architect, who has explored Slurm and Kubernetes in our latest webinar. Panu covered their architecture, design, original purposes, adaptations for ML, and key considerations when choosing between them.

Docs and blog

Monitoring GPUs on virtual machines
You can set up your VMs to collect GPU usage data and visualize it. Then you can use the dashboard to monitor current resource utilization, schedule quota increases, and quickly identify anomalies. See this new guide for details.

Deploying a GlusterFS distributed storage
Learn how to use VMs to set up a GlusterFS distributed storage. GlusterFS can scale to several petabytes and handle thousands of clients, allowing for efficient checkpointing in your model training setup.

Tips and tricks for performing large model checkpointing
To make your processes around checkpoints even more efficient, we’ve discussed multiple strategies — including asynchronous checkpointing, choosing the proper storage and format, adjusting your code to the network parameters, and scheduling with possible redoing kept in mind.

Nebius AI is among the first cloud providers adopting NVIDIA Blackwell GPUs
With a keen eye on power usage effectiveness, Nebius AI is excited to be among the first cloud providers adopting NVIDIA B200 Tensor Core GPUs and offering our customers this advanced and energy-efficient technology.

Transformer alternatives in 2024
With this article, we are starting a new category on our blog, the one dedicated to AI research. Expect these posts to be very technical and insightful. The first one is about possible alternatives to the key component of the LLM architecture.

Nebius AI team
Sign in to save this post