4 Reasons Why Data Scientists Should Version Data

How to start data versioning using DVC

Published in

The Startup

4 min readSep 3, 2020

While working in a software project it is very common and, in fact, a standard to start right away versioning code, and the benefits are already pretty obvious for the software community: it tracks every modification of the code in a particular code repository. If any mistake is made, developers can always travel through time and compare earlier versions of the code in order to solve the problem while minimizing disruption to all the team members. Code for software projects is the most precious asset and for that reason must be protected at all costs!

Well, for Data Science projects, data can also be considered the crown jewels, so why us, as Data Scientists, don’t treat as the most precious thing on earth through versioning control?

For those familiar with Git, you might be thinking, “Git cannot handle large files and directories.. at least it can’t with the same performance as it deals with small code files. So how can I version control my data in the same old fashion we version control code?”. Well, this is now possible, and it’s easy as just typing git cloneand see the data files and ML model files saved in the workspace, and all this magic can be achieved with DVC.

Quick start with DVC

First things first, we have to get DVC installed in our machines. It’s pretty straightforward and you can do it by following these steps.

As I’ve already mentioned, tools for data version control such as DVC makes it possible to build large projects while making it possible to reproduce the pipelines. Using DVC it’s very simple to add datasets into a git repository, and when I mean by simple, is as easy as typing the line below:

dvc add path/to/dataset

Regardless of the size of the dataset, the data is added to the repository. Assuming that we also want to push the dataset into the cloud, it is also possible with the below command:

dvc push path/to/dataset.dvc

Out of the box, DVC supports many cloud storage services such as S3, Google Storage, Azure Blobs, Google Drive, etc… And since the dataset was pushed to the cloud through the version control system, if I clone the project into another machine, I’m able to download the data, or any other artifact, using the following command:

dvc pull

Well, now that you know how to start with DVC, I suggest you to go and further explore the tool, or similar ones. Version control should be your best friend as a Data Scientist, as they allow not only to version datasets but also to create reproducible pipelines, while keeping all the developments traceable and reproducible.

If this hasn’t yet convinced, next I’ll tell why you must start versioning control your data!!

Why should I start using data version control?

1. Save and reproduce all of your data experiments

As Data Scientists we know that to develop a Machine Learning model, is not all about code, but also about data and the right parameters. A lot of times, in order to find the perfect match, experimentation is required, which makes the process highly iterative and extremely important to keep track of the changes made as well as their impacts on the end results. This becomes even more important in a complex environment where multiple data scientists are collaborating. In that sense, if we are able to have a snapshot of the data used to develop a certain version of the model and have it versioned, it makes the process of iteration and model development not only easier but also trackable.

2. Debugging and testing

While playing around in Kaggle competitions many times we do not understand the real challenges inherent to the development of an ML-based solution while working with production systems. In fact, one of the biggest challenges is to deal with the variety of data sources and the amount of data that we’ve available. Sometimes can be a bit daunting to reproduce the results of experimentation if we are not even able to retrieve the exact dataset that has been used. Data version control can ease these issues and make the process of machine learning solutions development must simpler, organized, and reproducible.

3. Compliance and auditing

Privacy regulations, such as GDPR, already request companies and organizations to demonstrate compliance and history of the available data sources. The ability to track data version provided by version control tools is the first step to have companies data sources ready for compliance, and an essential step in maintaining a strong and robust audit train and risk management processes around data.

4. Align software and data science teams

Sometimes, to have Data Science and Software teams talking the same language can be quite challenging and can highly depend on the profiles involved in the interactions between the teams. To start implementing some of the good practices from the software into the data science processes, can help not only to align the work between the teams involved, but also to accelerate the development and integration of the solutions.

Conclusions

Data science is had to productize, and one of the main reasons for that is because there are too many mutable elements, such as data. The concept of versioning for data science applications can be interpreted in many possible ways, from models to data versioning. This article aimed to cover the importance and benefits of versioning data for the data science teams, but there are many more aspects that we should pay attention to as Data Scientists. In the end, keeping an eye on continuous delivery principles is very important for the success of ML-based solutions!

Fabiana Clemente is CDO at YData.

Improved data for AI

YData provides a data-centric development platform for Data Scientists to work to high-quality and synthetic data.