MLFlow with DVC

Ashish Kumar
TheCyPhy
Published in
4 min readMar 29, 2023
Photo by Wexor Tmg on Unsplash

I want a peaceful life like this turtle 🥲

Hello Mortals 🤖, If you want to manage your dataset versioning like a pro then this article is for you. MLFlow and DVC, both are very popular tools. Using them together will make lot of work easier so let’s do it 💪🏻.

First make sure a python environment is activated in your command prompt. Then initialize your git repository and dvc repository using below commands

git init
dvc init

Now we have to add a remote storage for DVC, for tutorial purpose I will set this location to a local storage location but it can be set to any type of cloud storage locations like Amazon S3 . Then commit the changes using git.

dvc remote add -d dvc-remote /tmp/dvc-storage
git add .
git commit -m "configure remote storage"

Then create a folder named “data” using mkdir command and place the dataset file there. You can download the dataset from here. It is wine quality dataset. Now add this file to DVC using below command.

dvc add data/wine-quality.csv

Create a “.gitignore” file and add this csv file to that, in this way we are avoiding the dataset to be uploaded in git repository.

Now again let’s commit our changes, and tag this dataset as v1. Tagging is mandatory, you will see later why. Then I will push my dataset to remote storage. Later you can also pull back the data from remote storage using “dvc pull” command.

git add .
git commit -m "data: track"
git tag -a "v1" -m "raw data"
dvc push

# Note: "dvc push" pushes\identifies files based on the file names, if two
# files have same file name but are stored in different directory, dvc
# considers both files as the same file.

Now let’s see how DVC will be useful, let us assume you make some changes to the dataset, a very simple one will be that you deleted 1000 rows from your dataset. Now to track this change in the dataset use below commands.

dvc add data/wine-quality.csv
dvc push
git add .
git commit -m "data: remove 1000 lines"
git tag -a "v2" -m "removed 1000 lines"

Now create a file named “mlflow-with-dvc.py” and copy paste the below code.

import os
import warnings
import sys

import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from urllib.parse import urlparse
import mlflow
import mlflow.sklearn

import logging

logging.basicConfig(level=logging.WARN)
logger = logging.getLogger(__name__)

# Get URL from dvc
import dvc.api
path = "data/wine-quality.csv" # wine-quality.csv.dvc file presence is enough if dvc push to remote storage is done
repo = "/home/ashish/data_versioning/demo" # git init directory location
version = "v1" # git tag -a 'v1' -m 'removed 1000 lines' command is required

data_url = dvc.api.get_url(
path=path,
repo=repo,
rev=version
)

def eval_metrics(actual, pred):
rmse = np.sqrt(mean_squared_error(actual, pred))
mae = mean_absolute_error(actual, pred)
r2 = r2_score(actual, pred)
return rmse, mae, r2


if __name__ == "__main__":
warnings.filterwarnings("ignore")
np.random.seed(40)

# read the wine-qualtiy csv file from remote repository
data = pd.read_csv(data_url, sep=";")

# Split the data into training and test sets. (0.75, 0.25) split.
train, test = train_test_split(data)

# The predicted column is "quality" which is a scalar from [3, 9]
train_x = train.drop(["quality"], axis=1)
test_x = test.drop(["quality"], axis=1)
train_y = train[["quality"]]
test_y = test[["quality"]]

alpha = float(sys.argv[1]) if len(sys.argv) > 1 else 0.5
l1_ratio = float(sys.argv[2]) if len(sys.argv) > 2 else 0.5

with mlflow.start_run():

# Log data params
mlflow.log_param('data_url', data_url)
mlflow.log_param('data_version', version)
mlflow.log_param('input_rows', data.shape[0])
mlflow.log_param('input_columns', data.shape[1])

lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
lr.fit(train_x, train_y)

predicted_qualities = lr.predict(test_x)

(rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
print(" RMSE: %s" % rmse)
print(" MAE: %s" % mae)
print(" R2: %s" % r2)

mlflow.log_param("alpha", alpha)
mlflow.log_param("l1_ratio", l1_ratio)
mlflow.log_metric("rmse", rmse)
mlflow.log_metric("r2", r2)
mlflow.log_metric("mae", mae)

tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme

# Model registry does not work with file store
if tracking_url_type_store != "file":

# Register the model
# There are other ways to use the Model Registry, which depends on the use case,
# please refer to the doc for more information:
# https://mlflow.org/docs/latest/model-registry.html#api-workflow
mlflow.sklearn.log_model(lr, "model", registered_model_name="ElasticnetWineModel")
else:
mlflow.sklearn.log_model(lr, "model")

To run the above code, type the below command

python mlflow-with-dvc.py {alpha} {l1_ratio}

Now you can see that just by changing the “version” variable in the code, you can define which dataset version you want to use and you can also log the same using MLFlow. You can log for which version of dataset what scores you were getting. Suddenly dataset versioning seems so easy now.😱.

I will be back with more interesting articles like this, till then Enjoy your day 😇.

Feel free to connect with me at linkedin.com/in/ashish — kumar for a Cup of Chat

--

--

Ashish Kumar
TheCyPhy

I write and share interesting articles. Follow me if you are an avid reader. Connect with me at https://topmate.io/ashish_kumar17