Setting up an Object Storage bucket as Data Version Control (DVC) storage

Background
- Why Data Version Control
- Nebius AI and DVC
Steps
Costs
Prepare the environment
Create an Object Storage bucket
Create a DVC project
Add the bucket as a DVC remote
How to use DVC with Object Storage
- Start tracking a file and upload it
- Change the file and roll it back
How to remove the resources you created

In this tutorial, you will learn how to set up version control for files used in machine learning tasks, like datasets and models, and store them efficiently. The setup uses Data Version Control (DVC) together with Object Storage, a Nebius AI service.

Background

This section provides useful background information. To get started with the tutorial straight away, go to Steps.

Why Data Version Control

In machine learning tasks, each change to code or data affects future models. To be able to reproduce experiments, it makes sense to control versions of datasets and models as well as code, and preferably all in one place. But Git and similar version control systems for code struggle with large data files.

You would think the solution would be something like Git for data – and Data Version Control (DVC) is exactly that. It is a free, open-source ML-oriented tool that manages your data and structures its versions, together with code, like that:

Version timeline during experiments

We will focus on data versioning in this tutorial, but DVC offers much more to data science and machine learning teams. You can use it to automate ML pipelines, run and manage experiments, and evaluate them using metrics and plots. Learn more about DVC use cases and features in its documentation.

Nebius AI and DVC

One of DVC's main features is that it does not store the data itself under Git. Instead, you specify a storage location in DVC settings: a DVC remote. It is like a Git remote, but for data in a shared storage. You and your teammates can use it to share the data with each other and reproduce previous experiments with their original data.

DVC supports remotes in various storage systems, like Amazon S3 and other compatible object storage services. In this tutorial, we will use Object Storage offered by Nebius AI: it is a fast and cost-efficient storage service, fully compatible with Amazon S3.

While you can install DVC anywhere, we suggest you do it on a Nebius AI virtual machine, next to your GPUs and other computing resources. This will increase the data operations speed and eliminate the costs for egress traffic. For example, create a Data Science Virtual Machine with just a couple of clicks, connect to it over SSH and go from there.

Steps

In this tutorial, you will:

Create an Object Storage bucket which will be made a DVC remote later.
Create a DVC project.
Add the bucket as a DVC remote.

Also, the tutorial covers optional steps of using DVC with Object Storage and deleting the created resources.

Costs

The cost of the infrastructure deployed in this tutorial includes:

Fees for storing data and performing operations with it (see Object Storage pricing).

Prepare the environment

Install the necessary tools (the example commands are for Ubuntu):
- DVC (requires Python 3.8 or later, more instructions):
```
pip install dvc
```
- Nebius AI CLI (ncp), to create and manage the Object Storage bucket and other Nebius AI resources. See the installation guide.
- AWS CLI (fully supported by Object Storage), to use aws s3 commands for checking that the bucket stores DVC-tracked data (more instructions):
```
sudo apt-get install unzip
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
```
- jq, to extract IDs and tokens from JSON data returned by the Nebius AI CLI (more instructions):
```
sudo apt-get install jq
```
Create a service account that will manage objects in the bucket on your behalf, and get the account's ID:
```
SA_ID=$(ncp iam service-account create dvc-test-sa --format json \
  | jq ".id" -r)
```

Add the service account to the editors group to grant it necessary permissions:

ORG_ID=$(ncp organization-manager organization list --format json \
  | jq ".[0].id" -r)
ncp organization-manager group add-members editors \
  --subject-id $SA_ID --organization-id $ORG_ID

Create an access key that will be used for the service account's authentication:

ACCESS_KEY=$(ncp iam access-key create \
  --service-account-name dvc-test-sa --format json)

Set up the AWS CLI:

Write the access key into ~/.aws/credentials:

cat <<EOF > ~/.aws/credentials
[default]
  aws_access_key_id = $(echo $ACCESS_KEY | jq ".access_key.key_id" -r)
  aws_secret_access_key = $(echo $ACCESS_KEY | jq ".secret" -r)
EOF

Write the Nebius AI region and endpoint settings into ~/.aws/config:

cat <<EOF > ~/.aws/config
[default]
  region = eu-north1
  endpoint_url = https://storage.ai.nebius.cloud
EOF

Create an Object Storage bucket

Create a bucket named dvc-test that will serve as a DVC remote and store DVC-tracked data:

ncp storage bucket create --name dvc-test

Create a DVC project

Initialize Git in a dedicated directory, for example dvc-example:
```
mkdir dvc-example
cd dvc-example
git init
```
Initialize DVC in the same directory and commit DVC internal files with Git:
```
dvc init
git commit -m "Initialize DVC"
```
Note

dvc init creates a .dvc directory with DVC internal files and stages it for Git commit. You do not need to run git add .dvc.

Add the bucket as a DVC remote

DVC storage is managed through DVC remotes. To add the Object Storage bucket as a DVC remote:

Create an AWS S3-compatible remote. For example, if you created a bucket named dvc-test and want to store data under a "folder" named dvcstore (that is, the object keys will have dvcstore/ as prefix):
```
dvc remote add -d nebius-storage s3://dvc-test/dvcstore
```
Set up the remote to connect to Nebius AI Object Storage:
```
dvc remote modify nebius-storage endpointurl \
  https://storage.ai.nebius.cloud
dvc remote modify nebius-storage region eu-north1
```
These dvc remote commands modify the .dvc/config file. Alternatively, you can change it manually.

Commit the updated configuration file with Git:

git add .dvc/config
git commit -m "Add Nebius AI remote"

How to use DVC with Object Storage

Start tracking a file and upload it

Add data to the DVC project. For example:
```
dvc get https://github.com/iterative/dataset-registry \
  get-started/data.xml -o data.xml
dvc add data.xml
```
This will create data.xml.dvc, which serves as a metafile for data.xml, and add data.xml to the list of files ignored by Git.

Commit data.xml.dvc and .gitignore with Git:

git add data.xml.dvc .gitignore
git commit -m "Add raw data"

Push the data to the DVC remote:
```
dvc push
```
Check that the data is uploaded to the bucket:
```
aws s3 ls s3://dvc-test/dvcstore --recursive
```
You will see that the object key includes the file's MD5 checksum (the part after md5/ with slashes omitted):
```
2024-05-10 14:51:48   14445097 dvcstore/files/md5/22/a1a2931c8370d3aeedd7183606fd7f
```
You can check that your local file has the same checksum:
```
md5sum data.xml
```

Change the file and roll it back

Change the contents of data.xml. For example:
```
echo "" > data.xml
```
Add the new version to DVC and Git and push it to the remote, as you did with the original file:
```
dvc add data.xml
git add data.xml.dvc
git commit -m "Update raw data"
dvc push
```
Note

data.xml is already in .gitignore, so we are not adding it to the Git commit this time.

There will now be two files in the bucket.
Checkout the version of data.xml.dvc from the previous commit (HEAD~1) with Git, and update data.xml accordingly with DVC:
```
git checkout HEAD~1 data/data.xml.dvc
dvc checkout
```
This will roll back the local data file to its previous version.

Note

dvc checkout gets files from the local cache. It is enough here, before someone else starts working with the data and pushes it to the remote. To get data from the remote instead, use dvc pull.

How to remove the resources you created

You are charged for storing objects in the bucket and performing operations on them, like uploading, downloading, or deleting them. If you do not need to use the bucket for DVC-tracked data anymore:

Pull the data from the DVC remote and remove the remote (your local data will not be affected):
```
dvc pull
dvc remote remove nebius-storage
```
Remove all objects from the bucket:
```
aws s3 rm s3://dvc-test --recursive
```
Remove the bucket:
```
ncp storage bucket remove dvc-test
```