Nebius AI monthly digest, April 2024

Our main news of the past month is that Nebius AI has become available to everyone! We also participated in the MLOps podcast and published several videos about setting up training, as well as stories about how Nebius AI clients are building their models.

Videos on setting up training infrastructure

Fail fast & recover faster: infrastructure resilience of the LLM training
Training an LLM in a multi-node setup is a complex and expensive process. Training failures can’t be eliminated, but downtime can be reduced. In this talk, Filipp Fisin, Senior ML Engineer at Nebius AI, provided an overview of the techniques for more resilient training that we find useful:

How to deploy Slurm on Nebius AI
Our Cloud Solution Architect Panu Koskela is back to show you the essentials of running a Slurm cluster and a tool for managing resources in a computing environment.

MLOps Community podcast: handling multi-terabyte large model checkpoints
In the latest episode of the podcast, our ML Engineer Simon Karasik shared his five-year experience in the field and provided an introduction to the topic of LLM checkpointing. The audio is available across popular podcast platforms, and here’s the video.

What’s new on our docs and blog

Nebius AI is now open to everyone
Whether you are a company or an individual engineer, log in with your Google or GitHub account and start running your ML experiments. To make your journey easier, we have prepared pages in the docs on how to create an account, set up your billing (including, for example, linking a credit card) and sort out your taxes.

Demo: applying RAG with open tools
Retrieval-augmented generation is a technique that enhances language models by combining generative AI with a retrieval component. Check out a quick example of applying RAG in a real-world context.

The natural next step after the demo is to build a production environment for applying RAG — the topic of our upcoming webinar, which will take place on May 16. We’ll build a pipeline powered by NVIDIA® H100 Tensor Core GPUs and discuss integration with Kubernetes, CUDA, Triton Server, TensorRT, Milvus, PyTorch and Llama2.

Training a 20B foundational model: Recraft’s journey
Recraft, recently funded in a round led by Khosla Ventures and former GitHub CEO Nat Friedman, is the first generative AI model built for designers. Featuring 20 billion parameters, the model was trained from scratch on Nebius AI. Here’s how.

How Unum partnered with us to preserve knowledge in compact models
In our field, effective partnerships that harness complementary strengths can drive significant breakthroughs. Such is the case with the collaboration between Nebius AI and Unum, an AI research lab known for developing compact and efficient models.

The first AI safety benchmark is here, with Nebius AI contribution
The AI Safety v0.5 Proof of Concept is the result of months of collaboration between industry professionals and researchers. The benchmark has been developed by MLCommons, an engineering consortium based on the philosophy of open collaboration to enhance AI systems. Fedor Zhdanov, our Head of Applied AI, participates in the AI Safety working group developing this benchmark.

Marketplace releases

Kubeflow
Handle machine learning workflows on Kubernetes with this open-source toolkit.

NVIDIA Triton Inference Server
Run any AI model from multiple frameworks.

Ray Cluster
Orchestrate scalable distributed computing environments for a variety of large-scale AI workloads.

author
Nebius AI team
Sign in to save this post