MLOps Podcast: Handling Multi-Terabyte LLM Checkpoints

MLOps Community is the world’s largest community dedicated to addressing the unique technical and operational challenges of production machine learning systems.

The talk provides a gentle introduction to the topic of LLM checkpointing: why is it hard, and how big are the checkpoints? It covers various tips and tricks for saving and loading multi-terabyte checkpoints, as well as the selection of cloud storage options for checkpointing.

Simon Karasik, Machine learning engineer at Nebius AI, and Demetrios Brinkmann, Chief happiness engineer at MLOps Community.

