MLOPS-building a deep learning project from end to end (part-1).

Prathmesh Patil
Analytics Vidhya
Published in
12 min readApr 22, 2021
credits — valohai.com

From the past few years in the software industry, we have been hearing about ‘DEVOps’ a lot, which has adopted quickly by many industries due to its proven improvements over the traditional approaches which consist of SDLC (Software Development Life Cycle). Even though the most popular ‘agile’ methodology served us well for the past decade but where it focuses on bridging the gap between the customers and developers team the DevOps focuses on bridging the gap between developers and the further operation and that too in form of automated pipelines also knowns as CI-CD(Continuous Integration and Continuous Delivery).

On the other hand, the data science industry also experiences a bloom and having some similar practices will enhance it even further, therefore, it adopted ‘MLOPS’ (Machine Learning Operations) which was proposed back in the year 2015.
In this article, we will briefly go through some basic concepts and then proceed towards building a simple deep learning project completely from end to end till deployment, this article will be divided into 2 parts for simplification. In part-1 there will be loading, splitting and training a model and in part-2 will consist of deployment.

Contents

  1. What is MLOPS?
  2. Introduction to DVC
  3. Project implementation and design part-1

1. What is MLOPS?

To explain it in a simple way it’s an aggregation of the life cycle of a data science/machine learning project with operations, it's based on DevOps but with some extra additions like building ci-cd pipelines and model retraining approaches. These practices solely come into the picture to reduce the time taken to build the complete project paired with assured quality and robust software experience with monitoring as well.
MLOps simply helps in building a more effective business solution to a given problem. To know about it in detail check this article out.

2. Introduction to DVC

DVC(Data Version Control) is an open-source tool used for tracking and version control for machine learning projects. Which is used to create ci-cd pipelines in collaboration with git. If you are familiar with git then using this tool is pretty easy to use. This tool also covers up remote storages like S3, Azure Blob Storage, Google Drive, Google Cloud Storage, HDFS, HTTP, network-attached storage, or disc to store data.
Moreover, it also incorporates big data as well like HDFS, Hive and Apache Spark.

source-dvc.org

The DVC framework is quite big and I can write down a separate article on DVC itself showing its different features and functionality. Other than DVC there are many other tools for MLOps like MlFlow, DataBricks, cloud platforms like AWS, Azure and GCP, etc.
In this article, we will use DVC as it is a simple and powerful tool to get started.
The above picture shows how DVC helps to build the complete pipeline till deployment.

3. Project Implementation part-1

Introduction:- we will be building a simple classification model at the back and create a web app at the front and finally deploy it on cloud platforms like Heroku.

3.1: Create an anaconda environment and install the requirements

Open a terminal or cmd in your os and type the following command

conda create -n 'env name' python==3.7

Once the environment is created then type

conda activate 'env name'

you will see your environment name beside the path in the cmd/terminal

1.1-After creating and activating the environment

now download my repo from Github and install the requirements.txt file which contains all the dependency of the project.

pip install requirements.txt -r

while executing you might see the following & come to the current cmd working directory.

1.2-installing the packages

3.2 Understanding our project architecture
Before proceeding any further let's go through our project architecture to get an idea about the pipelines.

1.3-Project Architecture

Here the pipeline starts from the collection of data it can be from any source like any cloud or live stream of data. In our case, we will be using the data we have already collected.
In the next stage, we will be preparing the data especially splitting it into train and test depending upon the given ratio.
Then comes the model building stage where we will be training our model and saving it.
Once the model is trained we will be going through some of the metrics and plots to evaluate how our model is performing.
Finally after going through all these the model will be deployed on Heroku using web-app at the front end paired with flask API.
DVC is responsible for keeping a track of the data and pipelines.
Github actions help in building the pipeline and continuous integration them till the deployment.
Lastly, as to all ways, the git repository is the base repository to push all the code.

3.3 Initializing Git and DVC and getting data (Stage-1)
In the root directory, we will be creating a local repo of git and initiate DVC.
type the following commands in terminal/cmd.

git init
dvc init

once done you will see two files as shown below

1.4-after running the above two commands

now to start tracking the data in cmd type the following command

dvc add Data_Set/Bulbasaur
dvc add Data_Set/Charmander
dvc add Data_Set/Squirtle

after running the above you will see some new files in the Data_Set directory shown below.

1.5- .dvc files been created after running the above 3 commands

3.4 Creating ‘config.yaml’ and other folders

Now we need some configuration files and folders from which we can start building the pipelines look at the below image and type commands in cmd/terminal

1.6-commands for creating required files and folders

Note-if you are using Linux the use the command ‘touch’ instead of ‘type nul >>’ also don’t run the ‘git init’ as we have already done it previously.

Finally, you will get a directory structure like this remaining

3.5 The config file and data preparation (Stage-2)

In ‘parameters.yaml’ file copy-paste the following parameters for this you can use and a text editor or an IDE. I will be using VS Code for this.

base:
project: Deep Learning using MlOPS
data_source:
data_src: Data_Set
load_data:
num_classes: 3
raw_data: Data_Set
preprocessed_data: Data\preprocessed
full_p: \MlOPS-DeepLearning\Data_Set

Following are the below parameters from the ‘.yaml’ file
1. ‘project’: stands for the project name
2. data_src: from where are we getting the data in our case it's the ‘Data_Set’ folder.
3. num_classed: its the number of classes available for classification
4. preprocessed_data: the directory where the split data will be present
5. full_p: it is the path for the ‘Data_Set’ folder

Note: as this is the pipeline we are going to build so any changes should be made in ‘parameter.yaml’ only unless and until we want to changes to the complete pipline. for ex if you have more classes like 6 or 7 then change the value ‘num_classes’ from 3 to 6.

To read the config file I have made a ‘getdata.py’ file which reads the parameters from the ‘.yaml’ file based on the mentioned parameters return the specific value from that file.

import os
import numpy as np
import shutil
import random
import yaml
import argparse
def get_data(config_file):
config = read_params(config_file)
return config
def read_params(config_file):
with open(config_file) as conf:
config = yaml.safe_load(conf)
return config
if __name__ == '__main__':
args = argparse.ArgumentParser()
args.add_argument("--config", default = 'parameters.yaml')
passed_args = args.parse_args()
a = get_data(config_file = passed_args.config)

The next step is to prepare the data(in our case images) for train and testing, so the below code creates new folders like data->preprocessed->train, test
each sub-directory of train and test will contain 3 classes of images. Also, the split ratio will be given in ‘parameters.yaml’.

import os
import numpy as np
import shutil
import random
import yaml
import argparse
from getdata import get_data
####################method for creating folder#####################def create_fold(config,img = None):
config = get_data(config)
dirr = config['load_data']['preprocessed_data']
cla = config['load_data']['num_classes']
print(dirr)
print(cla)
if os.path.exists(dirr+'/'+'train'+'/'+'class_0') and os.path.exists(dirr+'/'+'test'+'/'+'class_0'):
print('train and test folders already exist...')
print('skipping it!')
else:
os.mkdir(dirr+'/'+'train')
os.mkdir(dirr+'/'+'test')
for i in range(cla):

os.makedirs(os.path.join(dirr+'/'+'train','class_'+str(i)))
os.makedirs(os.path.join(dirr+'/'+'test','class_'+str(i)))
#####method for splitting the images for train and test####def train_test_split(config):
config = get_data(config)
root_dir = config['data_source']['data_src']
dest = config['load_data']['preprocessed_data']
p = config['load_data']['full_p']
cla = config['data_source']['data_src']
cla = os.listdir(cla)
cla = [i for i in cla if not i.endswith('.dvc') and cla if not i.startswith('.git')]
print(cla)
splitr = config['train_split']['split_ratio']
print(splitr)
for k in range(len(cla)):
print(cla[k])
per = len(os.listdir((os.path.join(root_dir,cla[k]))))
cnt = 0
for j in os.listdir(os.path.join(root_dir,cla[k])):
#per = len(os.path.join(root_dir,cla[k]))
#print(per)
pat = os.path.join(p+'/'+cla[k],j)
split_ratio = round((splitr/100)*per)
print(split_ratio)
if cnt != split_ratio:
#print(cnt)
shutil.copy(pat,dest+'/'+'train/class_'+str(k))
cnt = cnt+1
else:
shutil.copy(pat,dest+'/'+'test/class_'+str(k))
print('done')if __name__ == '__main__':
args = argparse.ArgumentParser()
args.add_argument("--config", default = 'parameters.yaml')
passed_args = args.parse_args()

create_fold(config=passed_args.config)
train_test_split(config = passed_args.config)

Note-put the above two codes into src folder don’t keep them in root folder.

1.7-running ‘split.py’ file
1.7- subdirectories created.

The above two pics are the result of running the ‘split.py’ file, each subdirectory contains images as per the splitting ratio(same for the test folder).

3.6 Model Building & Training

Now it's about time to build and train the model which in our case it will be VGG19. You can use any other model of your choice. But even for building the model, we need some parameters to be given, which we will given through ‘parameters.yaml’ file. Look at the below snippet.

model:
name: ResNet50
trainable: False
train_path: Data\preprocessed\train
test_path: Data\preprocessed\test
image_size: [225,225]
loss: 'categorical_crossentropy'
optimizer: 'adam'
metrics: ['accuracy']
epochs: 8
sav_dir : 'saved_models/trained.h5'
img_augment:
rescale: 1./255,
shear_range: 0.2
zoom_range: 0.2
horizontal_flip: True
vertical_flip: True
batch_size: 18
class_mode: 'categorical'

each and every attribute in the above config file are parameters for training the model and its augmentation. You can make changes to this file according to your choice. It’s time to build the model.

import numpy as np
from keras.applications.resnet import ResNet50
from keras_preprocessing.image import ImageDataGenerator
from keras.layers import Dense,Input,Flatten
from keras.models import Model
from glob import glob
import os
import argparse
from getdata import get_data
import matplotlib.pyplot as plt
from keras.applications.vgg19 import VGG19
def train_model(config_file):

config = get_data(config_file)
train = config['model']['trainable']
if train == True:
img_size = config['model']['image_size']
trn_set = config['model']['train_path']
te_set = config['model']['test_path']
num_cls = config['load_data']['num_classes']
rescale = config['img_augment']['rescale']
shear_range = config['img_augment']['shear_range']
zoom_range = config['img_augment']['zoom_range']
verticalf = config['img_augment']['vertical_flip']
horizontalf = config['img_augment']['horizontal_flip']
batch = config['img_augment']['batch_size']
class_mode = config['img_augment']['class_mode']
loss = config['model']['loss']
optimizer = config['model']['optimizer']
metrics = config['model']['metrics']
epochs = config['model']['epochs']
#print(type(batch))resnet = VGG19(input_shape = img_size +[3], weights = 'imagenet', include_top = False)for p in resnet.layers:
p.trainable = False
op = Flatten()(resnet.output)
prediction = Dense(num_cls,activation = 'softmax')(op)
mod = Model(inputs = resnet.input,outputs = prediction)
print(mod.summary())
img_size = tuple(img_size)
mod.compile(loss = loss ,optimizer = optimizer , metrics = metrics)train_gen = ImageDataGenerator(rescale = 1./255,
shear_range = shear_range,
zoom_range = zoom_range,
horizontal_flip = horizontalf,
vertical_flip = verticalf,
rotation_range = 90)
test_gen = ImageDataGenerator(rescale = 1./255)train_set = train_gen.flow_from_directory(trn_set,
target_size = (225,225),
batch_size = batch,
class_mode = class_mode
)
test_set = test_gen.flow_from_directory(te_set,
target_size = (225,225),
batch_size = batch,
class_mode = class_mode
)
history = mod.fit(train_set,
epochs = epochs,
validation_data = test_set,
steps_per_epoch = len(train_set),
validation_steps = len(test_set)
)
plt.plot(history.history['loss'], label='train_loss')
plt.plot(history.history['val_loss'],label='val_loss')
plt.legend()
plt.savefig('Reports/train_v_loss')
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'],label='val_acc')
plt.legend()
plt.savefig('Reports/acc_v_vacc')
mod.save('saved_models/trained.h5')
print('model saved')

else:
print('Model not trained')
if __name__ == '__main__':args_parser = argparse.ArgumentParser()
args_parser.add_argument('--config',default='parameters.yaml')
passed_args = args_parser.parse_args()
train_model(config_file=passed_args.config)

In the above code we train VGG16 on our data and save the model file to ‘saved models’ dir and the accuracy, the loss will be saved in ‘Reports’ dir.

Note: if you don’t want to train the model again change the ‘trainable’ parameter to ‘False’ inside the parameters.yaml file.

After training, you will see the following output.

1.8-After training

After this, we will push all the updates done till now to our GitHub repo for that go to GitHub create a new repo and copy its URL. Then do the following steps.

git remote add origin 'paste your github repo url'
git add .
git commit -m "model training done"
git push origin main

Note: Optionally you can keep pushing the change to the github after every phase as well.

1.9-after pushing the updates to GitHub repo

Also if you check reports dir then you will find 2 images which are the plot of train accuracy with validation accuracy and train loss with validation loss.

2.0-Saved plots

3.7 Evaluation

As we are done with our training, we will move towards evaluating the performance by generating the confusion matrix and classification report.
And saving these as image and CSV respectively in the ‘Reports’ directory. Create a new file as ‘evaluate.py’ in the source directory and paste the below snippet.

from keras.models import load_model
from sklearn.metrics import confusion_matrix,classification_report
import os
import numpy as np
import argparse
from getdata import get_data
from keras_preprocessing.image import ImageDataGenerator
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
def m_evaluate(config_file):
config = get_data(config_file)
batch = config['img_augment']['batch_size']
class_mode = config['img_augment']['class_mode']
te_set = config['model']['test_path']
model = load_model('saved_models/trained.h5')
config = get_data(config_file)
test_gen = ImageDataGenerator(rescale = 1./255)
test_set = test_gen.flow_from_directory(te_set,
target_size = (225,225),
batch_size = batch,
class_mode = class_mode
)
label_map = (test_set.class_indices)
print(label_map)
Y_pred = model.predict_generator(test_set, len(test_set))
y_pred = np.argmax(Y_pred, axis=1)
print('Confusion Matrix')
sns.heatmap(confusion_matrix(test_set.classes, y_pred ),annot = True)
plt.xlabel('Actual values, 0:Bulbasaur, 1:Charmander,2:Squirtle')
plt.ylabel('Predicted values, 0:Bulbasaur, 1:Charmander,2:Squirtle')
plt.savefig('Reports/Confusion Matrix')
# plt.show()
print('Classification Report')
target_names = ['Bulbasaur', 'Charmander', 'Squirtle']
df = pd.DataFrame(classification_report(test_set.classes, y_pred, target_names=target_names, output_dict=True)).T
df['support'] = df.support.apply(int)
df.style.background_gradient(cmap='viridis',
subset=pd.IndexSlice['0':'9', :'f1-score'])
df.to_csv('Reports/classification_report')
print('Classification Report And Confusion Matrix Saved at Reports Directory')
if __name__ == '__main__':
args_parser = argparse.ArgumentParser()
args_parser.add_argument('--config',default='parameters.yaml')
passed_args = args_parser.parse_args()
m_evaluate(config_file=passed_args.config)

After running this you will see 2 new files as shown below

2.1-confusion matrix and classification report

Now it’s time to push the updates to the GitHub

git add .
git commit -m "model evaluated and reports saved"
git push origin main

3.8 Building pipelines

Moving on we will start making the pipeline as all the stages get executed sequentially, for this we need to add the parameters in the ‘dvc.yaml’ file.
Open ‘dvc.yaml’ file and paste the following parameters. This file is responsible to connect all the stages together to form a complete pipeline.

stages:
load_data:
cmd: python src/split.py --config=parameters.yaml
deps:
- src/getdata.py
- src/split.py
- Data_Set/


outs:
- data/preprocessed:
persist: true
train_model:
cmd: python src/model_train.py --config=parameters.yaml
deps:
- src/getdata.py
- src/model_train.py
outs:
- saved_models:
persist: true
- Reports:
persist: true
evaluate:
cmd: python src/evaluate.py --config=parameters.yaml
deps:
- src/getdata.py
- src/evaluate.py

The ‘dvc.yaml’ is used to build and execute the pipelines in which:
1. stages: defining the stages of the pipeline
2. load_data,train_model,evaluate: are the names of the stages
3. cmd: is used to run the specific python file
4. deps: is the dependency required to run our python file
5. outs: is used to specify where the outputs of the executing script should be saved
6. persist: allows to keep the changes made after executing the pipeline

So to run all the stages in the pipeline simply run the below command in cmd/terminal

dvc repro
2.2-after running the above command

After running ‘dvc repro’ command all the pipelines will execute and generate a ‘dvc.lock’ file in which it tracks all the changes in the pipelines

2.3-dvc.lock file

it also generates a unique hash value i.e md5 for every file in every stage of the pipeline.

Till now we have built the pipelines for preparing data, model training and evaluating. But we are not done yet we still need to build a web app using flask and deploy it in Heroku using ‘continuous integration and deployment’ with the help of GitHub actions.
As this article has already become pretty long, I will continue it in the next part i.e part-2 where we will go ahead and deploy this model.
Here is my GitHub repo for our current project you can go ahead and download it.

So that being said see you in my next article i.e part-2 which is a very interesting one!….here is the link for part-2 of my article. For any queries, you can contact me through Linkedin.

--

--

Prathmesh Patil
Analytics Vidhya

ML enthusiast, Data Science, Python developer, Google Cloud & Serverless. LinkedIn: https://www.linkedin.com/in/prathmesh