End to End Machine Learning Pipeline With MLOps Tools (MLFlow+DVC+Flask+Heroku+EvidentlyAI+Github Actions)
This article will show you how to automate the entire machine learning lifecycle with the MLOps tools. Firstly, a simple machine learning model will be trained using a churn dataset which is available in Kaggle. Then, the deployment will be done using flask, GitHub actions, and Heroku. Finally, production model monitoring will be done using EvidentlyAI. Below tools will be used throughout the project. Most of these tools are open-source so that anyone can simply experiment with the code. The full codebase is available here in the Github repository. The main reference for this article is the 4-day MLOps course by Krish Naik's youtube channel.
- Cookiecutter: Data science project structure
- Data version control (DVC): Version control of the data assets and to make pipeline
- Github: For code version control
- GitHub Actions: To create the CI-CD pipeline
- MLFlow: For model registry
- Heroku: To deploy the application
- Flask: To create a web app
- EvidentlyAI: To evaluate and monitor ML models in production
- Pytest: To implement the unit tests
Step 1: Create the Environment
Create a working directory and change the directory to the newly created folder. Then open a terminal and create a new Conda environment as below. I have created the new Conda environment with python 3.7 and named it churn_model. Then activate the new environment. You can either install the anaconda if you do not have Anaconda installed on your machine or another option is to use a python virtual environment. I have been used the anaconda environment throughout this experiment
conda create -n churn_model python=3.7 -y
conda activate churn_model
Step 2: Create the data science project structure with the cookiecutter
It is important to have a project structure when we are dealing with a data science project. We can use the cookiecutter template to organize the project. You will be asked to enter the required details when importing cookiecutter. Below details have been added for this project.
- project_name: churn_model
- description: End to End Machine learning pipeline with MLOps tools
- Select open_source_license: select MIT(option 1)
- s3_bucket /aws_profile
[Optional]: just press enter
- Select python_interpreter:python3 ( Option 1)
pip install cookiecutter
Step 3: Create a Github repo
Create a GitHub repo and push the current version to the repo. Here I have done all the developments in the main branch. You can either create a separate branch and do the developments as well.
git add .
git commit -m "Adding cookiecutter template"
git remote add origin <your_github_repo>
git branch -M main
git push -u origin main
Step4: Download the Dataset
Download a training dataset from Kaggle and put it in the external folder inside the data folder. Refer to this to download the train.csv from Kaggle. There are 4 folders inside the data main folder. We will be only using external, raw, and processed folders in this project.
external: External files (ex. train.csv from Kaggle)
raw: Raw data for this project
Processed: Processed files using the raw files
Step5: Track the dataset with DVC
Data Version Control (DVC) is a new form of Git-based data versioning, workflow, and experiment management software. We will be using GIT for code version control while DVC for the data version control. You can refer to more information on DVC from here. In this project, I will explain how DVC can be used. Check the external folder after executing the below three commands. You will see the new file named train.csv.dvc. This will be used by DVC to track the train.csv file.
Important: Remember to comment the /data/ line in .gitignore file before execute the commands. Because, now we are going to track datasets with the DVC, therefore, it will create a separate .gitignore file inside the data folder.
pip install dvc
dvc add train.csv
Step6: Create the source code inside the src folder
All python scripts related to the projects are located in the src folder. There are 4 folders namely data, features, visualization, models, and prediction within the src folder. But in this project, I will be using only data, models, and the prediction folders. Also params.yaml file need to be created inside the main churn_model folder.
Data: Data loading related python scripts (load_data.py, split_data.py)
Models: Model-related python scripts (train_model.py, production_model_selection.py, model_monitor.py)
This will store all the configurations related to this project.
External train.csv file will be loaded into the raw folder in this script. Only 6 numerical features were used in this model for simplicity. The new CSV which is in the raw folder contains the six numerical features and the target column churn. The main focus of this project is to give more details about the MLOps tools. Therefore very little effort was done for the modeling part.
The objective of this python script is to split the train.csv in a raw folder and create new churn_train and churn_test inside the processed folder.
Model training will be done using this script. Model-related information is available in params.yaml file. We can experiment with several ML models by changing and adding the parameters in params.yaml file. The mlflow will be used to track the model performances. We can easily check the model performance using the mlflow dashboard.
This script will support you to select the best-performing model from the model registry and save the best model in the model directory. The best model was selected using an accuracy score in this model.
7. Pipeline Creation
After creating the above files in src, Now it's time to write the model pipeline to execute the model. DVC will be used to create the model pipeline. For that, first, create the dvc.yaml file inside the churn_model folder.
8. Pipeline Execution
The next step is to execute the model pipeline. Now we are going to execute the dvc.yaml file. It contains four stages. Each stage contains at least three steps
1. cmd: command used to execute the script.
2. deps: specify the dependencies to execute the step.
3. outs: output from the step(Model files or datasets).
4. params: parameters used in the script.
Execute the below commands to run the pipeline. Remember to start the mlflow sever using another terminal.
* mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./artifacts --host 0.0.0.0 -p 1234
* dvc repro
The advantage of using DVC is that it only executes the stage only if dependencies are changed. For example, if we run the dvc repro command again, it will not execute any of the stages. But if we change one of the random forest model-related parameters(max_depth or n_estimators) in the params.yaml, it will execute the stages after the model_train. Because those are dependencies for executing the model_train stage. But it will not execute the first two stages since no change happened to the dependencies of the first two stages. (Do several experiments by changing random forest parameters).
Now it’s time to check the mlflow dashboard.
Here I have done three experiments by changing the parameter values. It will store history with the parameter values and model results.
In the log_production_model stage, it will automatically find the best-performing model using one of the accuracy measures (accuracy was used in this project). Then the best model will be saved inside the models folder(If you check the models folder, you will be able to see the model file). This model will be used for the prediction service. We can use mlflow as a model registry as above. You can try different ML models with different parameter combinations. This will store all the information related to all runs.
9. Web app with Flask
A simple web app will be created using a flask. Flask is a micro web framework written in Python. This web app will be used to consume the created model. A user can enter the feature values into the form and after submitting, the model will predict the outcome(churn or not).
Now it’s time to create the python code related to the web app. Create app.py file in chrun_folder. The objective of this script is to send the response to the frontend after predicting the target using the request.
10. Unit tests
Pytest will be used to do the simple testing. For that, create a separate folder named tests inside the main directory. Then create the test_config.py and __Init__.py files inside the newly created folder. Here a simple test will be performed to check whether the entered values are numerical or not. It will raise an exception if someone enters a string value instead of a numerical value. It is important to remember that the function name of all test cases must start with the test. After creating the unit tests we can test them by executing the below command. We can also do this as a frontend validation. But, here I did it in the backend just to show the unit tests capabilities of python. Here it checks whether the required error message is passed when entering incorrect values into the form. (Ex. adding one or more non-numerical values)
11. Create an app in Heroku
Heroku will be used to deploy the application. Create an account in Heroku if you do not have one. After that follow the steps below to create the app and authorization token for the app.
* Go to https://dashboard.heroku.com/apps
* Click New and create a new app
* Give a name for the app and create it (I named it churnmodelprod)
* In the deployment method section, click Github.
* In the connect to Github section, enter the repo name and search. It will find the repo name for you. Then click connect.
* In the automatic deployed section, tick Wait for CI to pass before deploying and click enable the automatic deploy button.
* Then go to account setting → application → Authorizations → create authorization.
* Then enter the description in the box and click create.
* Copy and save the authorization token for future use (We need this in the next step to create secrets).
12. Create CI-CD pipeline using GitHub actions
CI-CD pipeline will be created using the GitHub actions. Create a new file inside the .github/workflows folder and named it as the ci-cd.yaml. Please note that the file extension of this must be yaml.
We can easily reflect the changes in the model or code through the frontend after implementing the CI-CD pipeline. Because we just need to push the code after doing the modifications and it will reflect the changes automatically. That is the advantage of using the CI-CD pipeline for ML model development. Because in ML context, there is a model retraining part that is not included in the normal software development life cycle(We will be discussing the retraining part in the 13th section). We can easily reflect the changes to the model with this approach.
Now we need to create two secrets inside GitHub as HEROKU_API_TOKEN and HEROKU_API_NAME to do the deployment.
* Select the repo and click the settings.
* Then click secrets in the left panel.
* Click new repository secrets. Need to create two secrets.
1. name: HEROKU_API_NAME |value: churnmodelprod
2. name: HEROKU_API_TOKEN |value: Authorization token saved in the last step
Finally, create the new file namely procfile. This file will act as an entry point to start the web app.
* web gunicorn app:app
Now it’s time to push the final code to GitHub. It will automatically be deployed to Heroku using the Github actions. You can check the status of deployment by clicking the actions tab in the Github repo. So we have automated the entire machine learning life cycle. Therefore, whenever there are modifications to the code or model parameters, the changes will be reflected automatically in the front-end.
Now, it’s time to access the final URL from Heroku. For that, first, select the dashboard from the top right corner and select the new app (in my case it is churnmodelprod), and go to settings and check the domain section. You can get the URL from there.
Now Just enter some random values to fill the form and check the output. It will show the output in the prediction section. That is, it will predict whether the customer will be churn or not (yes/no)
Also, remember to check the output by entering non-numeric values into the form. The output is as below. That is our validation is working successfully.
13. Production Model monitoring with EvidentlyAI
Now our model is in the production environment. But with time model’s performance may degrade. That is the common nature of any machine learning model. Because of that, we will need to retrain our model with new data. Therefore we need to continuously monitor the model performance. EvidentlyAI is a very good tool to measure model performance. Also, this can be used to check model drift or concept drift if there is any. It uses various statistical tests such as t-test, chi-squared..etc to measure the drift.
Let's assume you receive a new dataset. Now our objective is to check any data drift that has happened in our new dataset. Since we do not have a new dataset, for the demonstration purpose, I just extracted a sample from the same original train dataset with size 1000 and put it inside the raw folder as train_new.csv. But in a real-world scenario, we will have a totally new train dataset after some time. After that, we can check the data drift using the below python script. It will create an HTML file inside the reports folder.
A new file named data_and_target_drift_dashboard is created inside the report folder. There are three sections in this report.
1. Input features drift: This shows if there is any data drift in input features.
2. Target drift: This Shows if there is any data drift in the target.
3. Target behavior by feature: This shows the target behavior according to each feature.
Below tests are used by EvidentlyAI to measure the drift. All tests use the 95% confidence level by default.
Numerical variables: two-sample Kolmogorov-Smirnov test.
Categorical variables: chi-squared test
According to the above outputs, no data drift or target drift was detected. If there is a drift in input features we may need to retrain the model. There are a lot of awesome functions inside this tool that we can use to monitor the model performance. I just used two functions to generate the above three outputs. There is a possibility to get the output of the above results as a json file. We can automate the retraining process by using the values in the json file, in which the automatic retraining process can be incorporated into the main pipeline easily.
1. 4-day MLOps course by Krish Naik’s youtube channel (https://www.youtube.com/watch?v=1BSwYlJUxK0&list=PLZoTAELRMXVOk1pRcOCaG5xtXxgMalpIe)