TractoAI is a modern way to tackle AI & Big Data challenges

TractoAI is your end-to-end solution for data preparation, exploration and distributed training, designed to work with large-scale ML and AI workloads. It stands behind the ongoing development of the Nebius 300B LLM, a project that requires careful processing of petabytes of data.

We see the challenges ML and AI professionals face

79% identify data preparation and generation* as the most common strategic task performed by AI teams.

35% report that changes in data accessibility* frequently challenge AI implementation.

30% view data volume and complexity* as one of the most challenging aspects of AI implementation.

That’s why we’ve built TractoAI, an end-to-end data platform powered by proven open-source technology

TractoAI will help ML and AI professionals in overcoming significant challenges, such as a lack of cohesive 'ecosystem-ness', poor integration among solutions, user-unfriendly tools, instability in data processing and training run-times.

And here is our product landscape

Full screen image

We provide tools for ML teams essential needs

Data Management

Ingest, store, process, and analyze your data at any scale with our powerful computation engines.

AI and ML Training

Put your distributed training on a solid foundation of Cypress and in the hands of our resilient scheduler, forget about cumbersome MLOps.


Let us take care of your data and workloads, and use our flexible administration tools for managing resources.

Think of flawless data management and get the best distributed training experience

Let us walk you through some of TractoAI’s key features that will ensure your ML and AI projects are not only well-handled and secure, but also highly efficient and reliable.

Store your data the way you want it

Upload your data, whether in tables or files, to Cypress, our resilient distributed file storage, for further processing or as a training dataset. Choose between HDD or NVMe storage, select the optimal compression codec and manage older data with erasure.

At its core, the TractoAI engine is built for tables, moving away from loosely structured files. Our tables support trillions of rows, thousands of columns, and petabytes of data of any kind of format (from plane numeric numbers and JSONs to vectors, pictures and videos). Just provide a schema and annotate your datasets in Markdown.

Modern solution built on a solid foundation

TractoAI is built on YTsaurus, a big data platform designed for distributed storage and processing.

  • It became open source in March 2023 under the Apache 2.0 license.
  • It was over 200 human-years of development invested by a diverse team of 50+ engineers.
  • It serves as a flagship storage and data processing platform for Big Tech companies.

Questions and answers about TractoAI

Do you have any customers using your product?

Yes, we do! First of all, our own LLM team uses it to train a 300B model. While we can’t disclose detailed client information, recent notable workloads within TractoAI involved processing petabytes of data on tens of thousands of CPU, also using thousands of NVIDIA® H100 GPU-hours for data preparation. Please contact us to talk more about your case.

Start your journey with TractoAI

Gartner, Explore Data-Centric AI Solutions to Streamline AI Development, 2023