MLOps Podcast: Handling Multi-Terabyte LLM Checkpoints

MLOps Community is the world’s largest community dedicated to addressing the unique technical and operational challenges of production machine learning systems.

The talk provides a gentle introduction to the topic of LLM checkpointing: why is it hard, and how big are the checkpoints? It covers various tips and tricks for saving and loading multi-terabyte checkpoints, as well as the selection of cloud storage options for checkpointing.

Simon Karasik, Machine learning engineer at Nebius AI, and Demetrios Brinkmann, Chief happiness engineer at MLOps Community.

Need custom pricing for a large-scale project?

Leave your contact details, and our cloud experts will contact you promptly to provide a transparent pricing that meets your specific needs.