Analytics Vidhya
Published in

Analytics Vidhya

Versioning data and models in ML projects using DVC and AWS S3

We will be looking at how DVC can be used to version our data and models in this blog in detail. The code for this blog is available here. For details regarding the model training for Named Entity Recognition (NER) tagging of sentences (CoNLL-2003 dataset ) using Tensorflow2.2.0, please read the blog here.

What is DVC?

Data Version Control, or DVC, is a data and ML experiments management tool which is very similar to Git. It helps us to track and save data and ML models. DVC saves information about the data in special metafiles that replace the data in the repository. These can be versioned with regular Git workflows (branches, pull requests, etc). DVC uses a built-in cache to store the data, and supports synchronizing it with remote storage options like was s3, google drive, Microsoft azure, google cloud etc.

Code and data flows in DVC

We will be storing our data from the data folder, embeddings and model outputs in the model_output folder in a S3 bucket. I have created an s3 bucket s3://dvc-example. Refer the link to create a S3 bucket.

Let us get started with dvc. Install the dvc package using pip. To install using other options refer this link. Also install boto3, dvc[all] and dvc[s3] packages.

pip install dvc 
pip boto3 dvc[all] dvc[s3]

After installing, initialise dvc in a git project

dvc init

.dvc/.gitignore and .dvc/config is created. Commit this change with

git commit -m "Initialize DVC"

The folder structure for this project looks like this

├── data
│ ├── train.txt
│ ├── valid.txt
│ ├── test.txt

├── embeddings
│ ├── glove.6B.100d.txt

├── model_output
│ ├── checkpoint
│ ├── embedding.pkl
│ ├── idx2Label.pkl
│ ├── model_weights.data-00000-of-00001
│ ├── model_weights.index
│ ├── word2Idx.pkl

We will have to track data , embeddings and model_output directories. Let us first track the data folder with dvc.

dvc add data

DVC moves the file contents to the cache. It also creates a corresponding .dvc file named data.dvcin data folder to track the files, using its path and hash to identify the cached data. It also adds the files in the data folder .gitignore in order to prevent them from being committed to the Git repository.

Commit the .dvc file in git.

git add data.dvc
git commit -m “add data”

Similarly track the embeddings folder and model_output folder using dvc using the code shown below.

dvc add embeddings
git add embeddings.dvc
git commit -m “add embeddings”
dvc add model_output
git add model_output.dvc
git commit -m “add models”

Let us set the s3 bucket : s3://dvc-example for remote storage.

dvc remote add -d remote s3://dvcexample/ner

This command creates a remote section in the DVC project's config file and optionally assigns a default remote in the core section to myremote.

We need to set the aws access_key_id and secret_access_key for the s3 bucket

dvc remote modify myremote access_key_id AWS_ACCESS_KEY_ID
dvc remote modify myremote secret_access_key AWS_SECRET_ACCESS_KEY

(Note : Substitute your AWS_KEY_ID for AWS_ACCESS_KEY_ID and your AWS_SECRET_ACCESS_KEY for AWS_SECRET_ACCESS_KEY. You need not use double quotes)

The information about your remote storage gets captured and stored in .dvc/config . It has the s3 bucket name, the access key id and secrete access key as well. Please make sure you delete the access_key_id and secret_access_key when you push the .config file to git. Alternatively, you can use credentialpath which is described below in detail under the heading Using credentialpath.

[core]      remote = myremote['remote "myremote"']      url = s3://dvcexample/ner      access_key_id = AWS_ACCESS_KEY_ID      secret_access_key = AWS_SECRET_ACCESS_KEY

Now that we have configured the s3 remote storage lets push the data, embeddings and model_output files.

dvc push

All the files in those three directories is now pushed to s3 storage.

Let us commit the .dvc/config file to git. Make sure you delete the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY from the .dvc/config file. Cool !!! Data and models are stored in the remote storage and dvc helps you to version it.

Now let us push the code to git. Assuming it is at master branch,

git push origin master

Using credentialpath

Let us configure aws using aws cli. Make sure you have installed awscli.

pip install awscli

Then type aws configure in your terminal.

When you enter this command, the AWS CLI prompts you for four pieces of information. Type in your Access key ID, Secret access key, AWS Region and Output format.

$ aws configure 
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: us-west-2
Default output format [None]: json

The AWS CLI stores this information in a profile (a collection of settings) named default in the credentials file. The ~/.aws/credentials file has the access key id and secret access key. Something very similar to the one shown below.

[default] 
aws_access_key_id=AKIAIOSFODNN7EXAMPLE aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Now that you have configured your aws. You can use the credentialpath instead of using the keys. This sets up the access for s3.

dvc remote modify myremote credentialpath ~/.aws/credentials

This actually edits the .dvc/config file as shown below.

[core]    remote = myremote['remote "myremote"']    url = s3://dvcexample/ner    credentialpath = /Users/user1/.aws/credentials

Getting data and models

All the team needs to do is git clone the repository. Set up the virtual environment. Install all required files in the requirements.txt file. Remember, the .dvc/config does not have the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Replace AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY with your credentials in the .dvc/config file.

dvc pull 

The above command pulls all the data, embeddings and the model_output files. You are all set.

Changes in data/ models

Suppose you are changing some hyperparameters in the train.py and training a new model. After training, the model_output has new model files. All you need to do is

dvc add model_output
git commit model_output.dvc -m 'model updates'
dvc push
git push origin master

Likegit checkout, we can use dvc checkout to switch between different versions of our data.

References

https://dvc.org/

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store