Setting up an Object Storage bucket as Data Version Control (DVC) storage
In this tutorial, you will learn how to set up version control for files used in machine learning tasks, like datasets and models, and store them efficiently. The setup uses Data Version Control
Background
This section provides useful background information. To get started with the tutorial straight away, go to Steps.
Why Data Version Control
In machine learning tasks, each change to code or data affects future models. To be able to reproduce experiments, it makes sense to control versions of datasets and models as well as code, and preferably all in one place. But Git and similar version control systems for code struggle with large data files.
You would think the solution would be something like Git for data – and Data Version Control
Version timeline during experiments
We will focus on data versioning in this tutorial, but DVC offers much more to data science and machine learning teams. You can use it to automate ML pipelines, run and manage experiments, and evaluate them using metrics and plots. Learn more about DVC use cases
Nebius AI and DVC
One of DVC's main features is that it does not store the data itself under Git. Instead, you specify a storage location in DVC settings: a DVC remote. It is like a Git remote, but for data in a shared storage. You and your teammates can use it to share the data with each other and reproduce previous experiments with their original data.
DVC supports remotes in various storage systems, like Amazon S3 and other compatible object storage services. In this tutorial, we will use Object Storage offered by Nebius AI: it is a fast and cost-efficient storage service, fully compatible with Amazon S3.
While you can install DVC anywhere, we suggest you do it on a Nebius AI virtual machine, next to your GPUs and other computing resources. This will increase the data operations speed and eliminate the costs for egress traffic. For example, create a Data Science Virtual Machine with just a couple of clicks, connect to it over SSH and go from there.
Steps
In this tutorial, you will:
- Create an Object Storage bucket which will be made a DVC remote later.
- Create a DVC project.
- Add the bucket as a DVC remote.
Also, the tutorial covers optional steps of using DVC with Object Storage and deleting the created resources.
Costs
The cost of the infrastructure deployed in this tutorial includes:
- Fees for storing data and performing operations with it (see Object Storage pricing).
Prepare the environment
-
Install the necessary tools (the example commands are for Ubuntu):
-
DVC (requires Python 3.8 or later, more instructions
):pip install dvc
-
Nebius AI CLI (
ncp
), to create and manage the Object Storage bucket and other Nebius AI resources. See the installation guide. -
AWS CLI
(fully supported by Object Storage), to useaws s3
commands for checking that the bucket stores DVC-tracked data (more instructions ):sudo apt-get install unzip curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install
-
jq
, to extract IDs and tokens from JSON data returned by the Nebius AI CLI (more instructions ):sudo apt-get install jq
-
-
Create a service account that will manage objects in the bucket on your behalf, and get the account's ID:
SA_ID=$(ncp iam service-account create dvc-test-sa --format json \ | jq ".id" -r)
-
Add the service account to the
editors
group to grant it necessary permissions:ORG_ID=$(ncp organization-manager organization list --format json \ | jq ".[0].id" -r) ncp organization-manager group add-members editors \ --subject-id $SA_ID --organization-id $ORG_ID
-
Create an access key that will be used for the service account's authentication:
ACCESS_KEY=$(ncp iam access-key create \ --service-account-name dvc-test-sa --format json)
-
Set up the AWS CLI:
-
Write the access key into
~/.aws/credentials
:cat <<EOF > ~/.aws/credentials [default] aws_access_key_id = $(echo $ACCESS_KEY | jq ".access_key.key_id" -r) aws_secret_access_key = $(echo $ACCESS_KEY | jq ".secret" -r) EOF
-
Write the Nebius AI region and endpoint settings into
~/.aws/config
:cat <<EOF > ~/.aws/config [default] region = eu-north1 endpoint_url = https://storage.ai.nebius.cloud EOF
-
Create an Object Storage bucket
Create a bucket named dvc-test
that will serve as a DVC remote and store DVC-tracked data:
ncp storage bucket create --name dvc-test
Create a DVC project
-
Initialize Git in a dedicated directory, for example
dvc-example
:mkdir dvc-example cd dvc-example git init
-
Initialize DVC in the same directory and commit DVC internal files with Git:
dvc init git commit -m "Initialize DVC"
Note
dvc init
creates a.dvc
directory with DVC internal files and stages it for Git commit. You do not need to rungit add .dvc
.
Add the bucket as a DVC remote
DVC storage is managed through DVC remotes. To add the Object Storage bucket as a DVC remote:
-
Create an AWS S3-compatible remote. For example, if you created a bucket named
dvc-test
and want to store data under a "folder" nameddvcstore
(that is, the object keys will havedvcstore/
as prefix):dvc remote add -d nebius-storage s3://dvc-test/dvcstore
-
Set up the remote to connect to Nebius AI Object Storage:
dvc remote modify nebius-storage endpointurl \ https://storage.ai.nebius.cloud dvc remote modify nebius-storage region eu-north1
These
dvc remote
commands modify the.dvc/config
file. Alternatively, you can change it manually. -
Commit the updated configuration file with Git:
git add .dvc/config git commit -m "Add Nebius AI remote"
How to use DVC with Object Storage
Start tracking a file and upload it
-
Add data to the DVC project. For example:
dvc get https://github.com/iterative/dataset-registry \ get-started/data.xml -o data.xml dvc add data.xml
This will create
data.xml.dvc
, which serves as a metafile fordata.xml
, and adddata.xml
to the list of files ignored by Git. -
Commit
data.xml.dvc
and.gitignore
with Git:git add data.xml.dvc .gitignore git commit -m "Add raw data"
-
Push the data to the DVC remote:
dvc push
-
Check that the data is uploaded to the bucket:
aws s3 ls s3://dvc-test/dvcstore --recursive
You will see that the object key includes the file's MD5 checksum (the part after
md5/
with slashes omitted):2024-05-10 14:51:48 14445097 dvcstore/files/md5/22/a1a2931c8370d3aeedd7183606fd7f
You can check that your local file has the same checksum:
md5sum data.xml
Change the file and roll it back
-
Change the contents of
data.xml
. For example:echo "" > data.xml
-
Add the new version to DVC and Git and push it to the remote, as you did with the original file:
dvc add data.xml git add data.xml.dvc git commit -m "Update raw data" dvc push
Note
data.xml
is already in.gitignore
, so we are not adding it to the Git commit this time.There will now be two files in the bucket.
-
Checkout the version of
data.xml.dvc
from the previous commit (HEAD~1
) with Git, and updatedata.xml
accordingly with DVC:git checkout HEAD~1 data/data.xml.dvc dvc checkout
This will roll back the local data file to its previous version.
Note
dvc checkout
gets files from the local cache. It is enough here, before someone else starts working with the data and pushes it to the remote. To get data from the remote instead, usedvc pull
.
How to remove the resources you created
You are charged for storing objects in the bucket and performing operations on them, like uploading, downloading, or deleting them. If you do not need to use the bucket for DVC-tracked data anymore:
-
Pull the data from the DVC remote and remove the remote (your local data will not be affected):
dvc pull dvc remote remove nebius-storage
-
Remove all objects from the bucket:
aws s3 rm s3://dvc-test --recursive
-
Remove the bucket:
ncp storage bucket remove dvc-test