An introduction: Version Control for Data Science projects with DAGsHub

Samuel Theophilus
Jun 21 · 5 min read
DAGsHub· The home for data science collaboration

Platforms like GitHub have been tools for version controlling software projects. However, Machine learning projects are faced with new challenges while working with GitHub: “Model & Data version control”.

GitHub has a strict file limit of 100MB. This means that Data Scientists & ML Engineers will have to improvise in order to work with GitHub as this restriction prevents version control for Large Datasets and Model Weights. The good news is that DAGsHub solves this challenge thereby allowing efficient Version Control for Data Science projects!

What is DAGsHub?

DAGsHub is a web platform for data version control and collaboration for data scientists and machine learning engineers. It is built on git and DVC, which is an open-source command-line tool built for data and pipeline versioning.

With DAGsHub you get to use git for the exact same things you would in a regular code project. DAGsHub adds visualizations and automation features. This is possible because DVC adds commands for Data Science and Machine Learning on top of the existing git, but the syntax is similar to git, so it’s not entirely unfamiliar. Most git commands have a direct equivalent in DVC. Feel free to check out the DAGsHub FAQs for more info.

What is DVC?

DVC is an open-source version control tool for machine learning projects designed to handle large files, data sets, machine learning models, and metrics. It works on top of Git so that it can easily integrate with your existing Git code repositories. For more information on DVC, visit DAGsHub’s- What is DVC.

Managing the Machine Learning Workflow

A Typical ML Project consists of trying different algorithms, models, libraries, and datasets. There is a need to manage and track the lifecycle of projects and experiments executed. DAGsHub allows users to track experiments by using libraries ML Flow or simply working with the metrics.csv and params.yml files.

Preview of DAGsHub Repository

Getting Started

  1. Create Account

If you do not have a DAGsHub account, visit https://dagshub.com/ and sign up in a similar manner as you would on GitHub (you can sign up using your GitHub account to link your DAGsHub profile with your GitHub profile).

2. Create New Repository or Mirror a GitHub Repository

CREATING NEW DAGSHUB REPO: Visit this link to create a new repo (You must be logged into your account).

MIRROR GITHUB REPO: If you want to work from GitHub and only track your models & dataset on DAGsHub, you can simply create a mirror repo.

3. Clone the Repository

Source

Run Command:

git clone https://dagshub.com/<DAGsHub-user-name>/hello-world.git

Visit Link for more information.

4. Setup DVC

Run Command:

pip install dvc

dvc init

If you encounter errors while trying to execute, try again with this code:

pip install dvc --ignore-installed

dvc init

Now the next step is to configure DAGsHub as remote storage. This enables you to save large files on the cloud via DAGsHub.

dvc remote add origin — local https://dagshub.com/<DAGsHub-user-name>/hello-world.dvc

dvc remote modify origin — local auth basic

dvc remote modify origin — local user <DAGsHub-user-name>

dvc remote modify origin — local ask_password true

5. Configure & Track Files with DVC (Models & Data)

GitHub will track all .dvc files unless they are included in the .gitignore file. Similarly, add files you don't want DAGsHub to track to .dvcignore.

Visit Link for more information.

6. Create & Track Experiments

If you would like to know how to get started with ML FLow for creating and tracking experiments, visit this link.

LOGGING WITH DAGSHUB LOGGER: Model will be tracked using metrics.csv and params.yml:

!pip install dagshub

import dagshub

with dagshub.dagshub_logger() as logger:

Model Definition

# log the model’s parameters logger.log_hyperparams(model_class=type(model_obj).__name__) logger.log_hyperparams({‘model’: model_obj.get_params()})

# Train the model

# log the model’s performances

logger.log_metrics({‘roc_auc_score’:<roc_auc_score>})

Visit Link for more information.

LOGGING WITH ML FLOW

!pip install mlflow

import mlflow
import os
import requests
import datetime
import time

REPO_NAME= “nnitiwe-dev\\Hello-DAGsHub\\”
USER_NAME = <DAGsHub USERNAME>

os.environ[‘MLFLOW_TRACKING_USERNAME’] = USER_NAME
os.environ[‘MLFLOW_TRACKING_PASSWORD’] = <DAGsHub access token or password>
token = <DAGsHub access Token>

mlflow.set_tracking_uri(‘https://dagshub.com/<DAGsHub Repository>.mlflow’)

mlflow.keras.autolog()

run_id = None
with mlflow.start_run(experiment_id=<Experiment ID>) as run:
run_id = run.info.run_id
mlflow.log_params({“Name”:<Experiment Name>,”Group”:<Group Name>,”Contributor Name”:<”Contributor Name>,”Model”:<Model Name>,”Extra Model Values”:<Model Value>})


Model Training

# ML flow parameter logging
with mlflow.start_run(run_id=run_id) as run:
mlflow.log_metric(“<PARAMETER>”, <parameter_value>,step=i)

For more information on how to log and monitor experiments in real-time with ML Flow, check out this Colab examples by DAGsHub:

Geek Culture

Proud to geek out. Follow to join our 1M monthly readers.