How to Version Control your Machine Learning task — I
What is Version Control ?
A component of software configuration management, version control, also known as revision control or source control, is the management of changes to documents, computer programs, large web sites, and other collections of information. Changes are usually identified by a number or letter code, termed the “revision number”, “revision level”, or simply “revision”. For example, an initial set of files is “revision 1”. When the first change is made, the resulting set is “revision 2”, and so on. Each revision is associated with a timestamp and the person making the change. Revisions can be compared, restored, and with some types of files, merged.
Why Version Control ?
An important question: Why do we need Version Control ? I am doing task on my local computer/cloud and I am deploying it at my server once the model is ready and only if I am done testing it. So why do I need version control ?
Now let’s look at a scenario: I am working for a company like Botsupply and I have clients. I am the AI guy. I made a question answering search using TF-IDF based model. I deployed it on my server. In the next phase, I made some changes to it and on my dummy data my accuracy increases. I deployed it on the server. Now due to the complexity in the test data, the performance decreases. Now I want to go back to the previous version.
One way is to deploy the previous version again. Second, or the better solution is version control and revert to the previous version.
How to do version control ?
- One of the most popular ways for Version Control is Git. Very popular and basically everyone knows how to use it today. (Every programmer and data scientist at least).
Now, Git is really cool but for a data scientist, keeping all the folders synced in Git is a hard task. The models checkpoints and data size takes all the unnecessary space. So, one way is to store all the data set in the cloud server like Amazon S3 and all the reproducible codes in the Git and generate the models on the fly. Seems a good choice but multiple data sets, if used in the same code will create confusion and might lead to mixing of data sets in the long run, if not documented properly.
Also, if the data changes/upgrades and all the commits are not documented properly, the model might lose the context.
Results without contexts are more deadly than poison — Giovanni Toschi , Botsupply
If the files cannot be reproduced in the fly, git-annex might be an option.
git-annex allows managing files with git, without checking the file contents into git. While that may seem paradoxical…git-annex.branchable.com
2. The second option is to do everything in a sandbox environment , see the results and if not good, don’t commit the changes to production. Ipython notebook (Jupyter Notebook) is a good way to do so. The code can be broken into smaller segments in different cells and then results can be seen at every step which makes Ipython one of the best editors for Machine Learning.
The Jupyter Notebook is a web-based interactive computing platform. The notebook combines live code, equations…jupyter.org
3. The best option (in my opinion) is to Data Version Control or DVC. DVC is similar to Git in many ways (like the command structure) but it also provides tracking of the steps, dependencies between the steps, dependencies between the code and data files and all code running arguments, so it combines version control for the code and for the database.
Read more about dataversioncontrol. Git for data science projects.dataversioncontrol.com
It is hardly possible in real life to develop a good machine learning model in a single pass. ML modeling is an iterative process and it is extremely important to keep track of your steps, dependencies between the steps, dependencies between your code and data files and all code running arguments. — Dmitry Petrov , DVC
What is DVC and 5 good reasons to use it?
DVC makes data science projects reproducible by automatically building data dependency graph (DAG). Your code and the dependencies could be easily shared by Git, and data — through cloud storage (AWS S3, GCP) in a single DVC environment.
- It is completely open source and can be install in simple commands with pip :
pip install dvc #pip3 for python3
2. The command is similar to git:
dvc run python train_model.py data/train_matrix.p data/model.p
3. It is language independent and machine learning processes can be easily transformed into reproducible DVC pipelines for any language.
4. Not only can DVC streamline your work into a single, reproducible environment, it also makes it easy to share this environment by Git including the dependencies (DAG) — an exciting collaboration feature which gives the ability to reproduce the research results in different computers.
5. Data files can be shared by any cloud files sharing services like AWS S3 or GCP Storage since DVC does not push data files to Git repositories.
To learn more about the installation and usage of DVC in detail, please check the following blog post:
Read the part II to know how DVC helped me in version control while working with Numerai, where data scientists use machine learning to make predictions that power Numerai’s hedge fund.