Machine Learning LifeCycle Management

An awesome tutorial to manage and automate all the steps involved between gathering the data and the production level deployment of the models .

Swarup Das
The Startup
9 min readAug 9, 2020

--

Overview

Since data is the oil of 21st century , people are always finding ways to use the data science concepts to convert data into dollars . With that being said , we all agree that Machine Learning is a fairly hot topic within this realm . We should keep in mind that developing , deploying and improving ML models at a scale is not at all in alignment with the steps associated with traditional software development lifecycle . Continuous Delivery for Machine Learning (CD4ML) is the discipline of bringing Continuous Delivery principles and practices to Machine Learning applications . In this guide we would see how we can manage and automate the numerous steps involved between gathering the data and machine learning model deployment using some amazing open source tools .

Prerequisites

This might not be a very beginner friendly guide but I will try to explain the crux , working and best practices wherever necessary along with the code snippets . It is assumed that the reader has the working knowledge of the following tools that we would be using throughout . So i will quickly recap what these tools are about .

  1. Apache AirFlow

There are multiple steps involved in a ML cycle , so there can be multiple point of failures . To manage such daunting pipelines , we use airflow as our workflow management system . Airflow takes in all your defined tasks and creates a DAG (directed acyclic graph) of your tasks . The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Airflow only supports Python . It segregates your task into operators like PythonOperator , BashOperator etc . It offers you an amazing user interface from where you can trigger , stop , track your workflows along with a bunch of other options . Get a head start here

2. MLFlow

MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLflow currently offers four components : tracking , registry , deployment and projects(a convention for organising and describing your code to let other data scientists run it) . Mlflow also offers a comprehensive user interface where you can see your model performance , compare the metrics , perform versioning etc. Get a head start here.

Alright Let’s Go!

1.Configuring Airflow

From here on it’s assumed you have a working installation of airflow , mlflow along with python3 (libs : sklearn , xgboost , pandas , sqlalchemy etc). Select a directory for your airflow workspace . Open terminal and type the following:

If this command worked then you can see that the airflow would have created airflow.cfg and some other files in the AIRFLOW_HOME directory . The cfg file by default sets your airflow to work with sqlite , which is slow and does not support threading as well (so it is not recommended) . You can change it to any db of your choice as in my case I changed the connection string to postgres .

Also change the load_examples variable to False as airflow loads a bunch of other irrelevant examples into the UI part which we will see later

Okay now we are done with the config , next step is to run the following command .

All the tables for storing your task and user metadata are created and airflow is now good to go for further steps . Now let’s split our complete cycle into following elementary steps .

  • Load the data from SQL or any other Data Source
  • Preprocess and Feature Engineering
  • Training multiple models ( RF , XGB , SVM … etc)
  • Evaluation & Comparison of performance metrics
  • Deployment of the Model

Please feel free to improvise as per your needs :)

Let’s now create a folder named dags inside AIRFLOW_HOME and within the dags folder we create subfolders like this .

Folder Structure

Organising the modules into such folders is necessary and is obviously good practice to isolate the functionalities from each other . For example lets say that Load_Dump contains all the modules that are responsible for loading the data from sql and dumping it locally for the feature engineering flows to consume it . Next in the Feature_Engineering folder you would be having all necessary functions that does your label encoding , imputing of missing values or aggregation of variables etc . Models folder contains the code that takes in the processed features from Feature_Engineering and trains various models on it .

2.Setting up your ML code to interact with MlFlow:

Please make sure the ML-flow server is up and running or use the following command :

The default-artifact-root tells MLFlow where to store your artifacts like your model’s pickle file or graphs (more on this later) . I am also using a non-default port 5050 for my convenience . You can go to “http://0.0.0.0:5050/” to checkout the mlflow UI .

Now in previous steps we created separate folders for different functionalities . Coming to the modelling part , a typical code loads the processed data , trains the model and saves it in some format like a pickle file . For our use case we would be using Mlflow’s api to log the parameters , metrics and the graphs to the mlflow UI . Here is a code chunk which explains how to accomplish this :

A snippet of XGBOOST_Training.py

So after this code executes , go inside the MLflow UI (tracking server running on http://0.0.0.0:5050/) and you can see a lot of things . Select the name of your experiment and you can see the log of all the Runs along with all the metrics and parameters you have used .

MLFLOW UI

You can investigate further by clicking on a log , and it will take you to another page where you can see your artifacts (model pickle , conda deployable , graphs/plots which you have logged in the training code)like below . You also have the option to download the shown items .

MLFLOW UI

Let’s say we are comparing 2 models (XGBOOST & RF ) on the comparison screen and we select XGBOOST because of better f1-score . ( shown below )

MLFlow supports versioning and helps you keep track of all the experiments you have done . Now since you have decided you want this model to be deployed , you can go to the versioning screen of the mlflow and select the model version and transition the stage of the model to production .

MLFLOW UI

Does it means the model is deployed ? … unfortunately no . It only means the specific model is flagged for production. Let’s see how we can use airflow’s ability in conjecture to this .

3 . Creating DAG Flows

i) The first workflow

Create a python script flow_1.py that looks something like this :

A code screenshot from flow_1.py

As you can see I have specified the name of the pipeline along with default set of arguments (start date , concurrency etc .) which are self explanatory . One important argument of DAG is the schedule_interval which means how often you want this workflow to be triggered . I have specified it as “*/30 * * * *” which means , I want this workflow to be triggered every 30 minutes . Next we can see that I have defined 4 Python operators (1 for loading the data , 1 for preprocessing and feature engineering , 2 for model training ) . Here the argument python_callable takes in the primary function which acts as the entry point of the specific task .

Please note: run_Rf_flow , run_Xgboost_flow are ml model training codes who are using mlflow’s api for logging the metrics, params , artifacts as described in previous step.

Now we need to arrange the above tasks in a certain order for execution . It is clear that load_data_sql would be the first to run , followed by preprocess_feature_engineering . After which both of the tasks random_forest_training , xgboost_training can run in parallel . Using the set_downstream function we set the order of execution and link the tasks . To achieve the order , please refer to that snapshot below :

A code screenshot from flow_1.py

Now go back to the terminal and type the following command .

Usually it runs on localhost:8080 , click on the name of your dag and switch to the graph view , you should see something like this :

A view from Airflow WebServer UI

It means airflow has created the order of execution as expected

ii) The second workflow

The first flow stops at logging both the model’s performances into mlflow ui . The Second workflow is triggered after we mark the model version’s stage as “Production”.

Create a script called flow_2.py which should look something like this .

A code screenshot from flow_2.py

We can see the target of this dag-flow is deploying the model into production . Operator 1 is a Python Operator whose task is to search the artifacts within the mlflow tracking server and get the artifact which is staged as production (will come to that later) . Operator 2 is Bash Operator which takes in the artifact-source returned by operator 1 . An interesting thing to note here is since it is highly crucial for the 2 tasks to communicate with each other we have used XCOM interface for the cross-communication between the operators . Finally the Bash Operator uses Mlflow’s serve command to deploy the model at port 1234 of localhost .

If we restart the airflow webserver , we can see the graph for second dag as well :

Coming to the Operator 1 , let’s see the how we are extracting the “source” of model marked for production . Following is the attached snippet .

A code screenshot from flow_2.py

Its pretty straightforward , we set the tracking server’s URI and loop through all the registered models and check whether ‘current_stage’ of model equals to ‘Production’ .

4. Putting It All Together

In the previous steps we studied in detail how we can use MLFlow and Airflow together and automate the entire process . Basically everything drills down to 3 steps now .

  • Step 1 : Trigger the first dag that loads , processes and trains your model and logs the metrics & artifacts to the mflow ui .
  • Step 2 : Go inside the mlflow ui , study the model’s performance . After the comparison of metrics and parameters select a model’s version to be deployed into production and change its stage to ‘Production’.
  • Step 3: Trigger the second dag that searches the production artifact and deploys the same.

To keep track of things , Airflow creates a folder named logs inside your dags folder . Each operator has its own log files which are distinguished by timestamps of “start_time”. For example these are the Rf model training logs:

Log File from Airflow

Well the above mentioned use-case can be very useful for the scenarios where you need to re-train the model on newly gathered data and deploy it and you look forward to repeat the process daily , weekly or some pre defined frequency .

Well that’s it from my side , If you need to resolve queries/discuss with me please connect with me via Linkedin.

--

--

Swarup Das
The Startup

A Data Science Professional with a laser focus to convert Data into Dollars : ) . Please feel free to connect @LinkedIn : https://www.linkedin.com/in/swarupd