Machine Learning LifeCycle Management
An awesome tutorial to manage and automate all the steps involved between gathering the data and the production level deployment of the models .
Overview
Since data is the oil of 21st century , people are always finding ways to use the data science concepts to convert data into dollars . With that being said , we all agree that Machine Learning is a fairly hot topic within this realm . We should keep in mind that developing , deploying and improving ML models at a scale is not at all in alignment with the steps associated with traditional software development lifecycle . Continuous Delivery for Machine Learning (CD4ML) is the discipline of bringing Continuous Delivery principles and practices to Machine Learning applications . In this guide we would see how we can manage and automate the numerous steps involved between gathering the data and machine learning model deployment using some amazing open source tools .
Prerequisites
This might not be a very beginner friendly guide but I will try to explain the crux , working and best practices wherever necessary along with the code snippets . It is assumed that the reader has the working knowledge of the following tools that we would be using throughout . So i will quickly recap what these tools are about .
- Apache AirFlow
There are multiple steps involved in a ML cycle , so there can be multiple point of failures . To manage such daunting pipelines , we use airflow as our workflow management system . Airflow takes in all your defined tasks and creates a DAG (directed acyclic graph) of your tasks . The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Airflow only supports Python . It segregates your task into operators like PythonOperator , BashOperator etc . It offers you an amazing user interface from where you can trigger , stop , track your workflows along with a bunch of other options . Get a head start here
2. MLFlow
MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLflow currently offers four components : tracking , registry , deployment and projects(a convention for organising and describing your code to let other data scientists run it) . Mlflow also offers a comprehensive user interface where you can see your model performance , compare the metrics , perform versioning etc. Get a head start here.
Alright Let’s Go!
1.Configuring Airflow
From here on it’s assumed you have a working installation of airflow , mlflow along with python3 (libs : sklearn , xgboost , pandas , sqlalchemy etc). Select a directory for your airflow workspace . Open terminal and type the following:
(venv) $ cd /path/to/my/airflow/workspace
(venv) $ mkdir AIRFLOW_HOME
(venv) $ export AIRFLOW_HOME=`pwd`/AIRFLOW_HOME
(venv) $ airflow version
If this command worked then you can see that the airflow would have created airflow.cfg and some other files in the AIRFLOW_HOME directory . The cfg file by default sets your airflow to work with sqlite , which is slow and does not support threading as well (so it is not recommended) . You can change it to any db of your choice as in my case I changed the connection string to postgres .
sql_alchemy_conn = postgresql://username:password@127.0.0.1:5432/airflow
Also change the load_examples variable to False as airflow loads a bunch of other irrelevant examples into the UI part which we will see later
Okay now we are done with the config , next step is to run the following command .
(venv) $ airflow initdb
All the tables for storing your task and user metadata are created and airflow is now good to go for further steps . Now let’s split our complete cycle into following elementary steps .
- Load the data from SQL or any other Data Source
- Preprocess and Feature Engineering
- Training multiple models ( RF , XGB , SVM … etc)
- Evaluation & Comparison of performance metrics
- Deployment of the Model
Please feel free to improvise as per your needs :)
Let’s now create a folder named dags inside AIRFLOW_HOME and within the dags folder we create subfolders like this .
Organising the modules into such folders is necessary and is obviously good practice to isolate the functionalities from each other . For example lets say that Load_Dump contains all the modules that are responsible for loading the data from sql and dumping it locally for the feature engineering flows to consume it . Next in the Feature_Engineering folder you would be having all necessary functions that does your label encoding , imputing of missing values or aggregation of variables etc . Models folder contains the code that takes in the processed features from Feature_Engineering and trains various models on it .
2.Setting up your ML code to interact with MlFlow:
Please make sure the ML-flow server is up and running or use the following command :
(venv) $ mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root /Users/apple/PycharmProjects/data-load/mlflow_example/ml_data/ --host 0.0.0.0 --port 5050
The default-artifact-root tells MLFlow where to store your artifacts like your model’s pickle file or graphs (more on this later) . I am also using a non-default port 5050 for my convenience . You can go to “http://0.0.0.0:5050/” to checkout the mlflow UI .
Now in previous steps we created separate folders for different functionalities . Coming to the modelling part , a typical code loads the processed data , trains the model and saves it in some format like a pickle file . For our use case we would be using Mlflow’s api to log the parameters , metrics and the graphs to the mlflow UI . Here is a code chunk which explains how to accomplish this :
So after this code executes , go inside the MLflow UI (tracking server running on http://0.0.0.0:5050/) and you can see a lot of things . Select the name of your experiment and you can see the log of all the Runs along with all the metrics and parameters you have used .
You can investigate further by clicking on a log , and it will take you to another page where you can see your artifacts (model pickle , conda deployable , graphs/plots which you have logged in the training code)like below . You also have the option to download the shown items .
Let’s say we are comparing 2 models (XGBOOST & RF ) on the comparison screen and we select XGBOOST because of better f1-score . ( shown below )
MLFlow supports versioning and helps you keep track of all the experiments you have done . Now since you have decided you want this model to be deployed , you can go to the versioning screen of the mlflow and select the model version and transition the stage of the model to production .
Does it means the model is deployed ? … unfortunately no . It only means the specific model is flagged for production. Let’s see how we can use airflow’s ability in conjecture to this .
3 . Creating DAG Flows
i) The first workflow
Create a python script flow_1.py that looks something like this :
As you can see I have specified the name of the pipeline along with default set of arguments (start date , concurrency etc .) which are self explanatory . One important argument of DAG is the schedule_interval which means how often you want this workflow to be triggered . I have specified it as “*/30 * * * *” which means , I want this workflow to be triggered every 30 minutes . Next we can see that I have defined 4 Python operators (1 for loading the data , 1 for preprocessing and feature engineering , 2 for model training ) . Here the argument python_callable takes in the primary function which acts as the entry point of the specific task .
Please note: run_Rf_flow , run_Xgboost_flow are ml model training codes who are using mlflow’s api for logging the metrics, params , artifacts as described in previous step.
Now we need to arrange the above tasks in a certain order for execution . It is clear that load_data_sql would be the first to run , followed by preprocess_feature_engineering . After which both of the tasks random_forest_training , xgboost_training can run in parallel . Using the set_downstream function we set the order of execution and link the tasks . To achieve the order , please refer to that snapshot below :
Now go back to the terminal and type the following command .
(venv) $ airflow webserver
Usually it runs on localhost:8080 , click on the name of your dag and switch to the graph view , you should see something like this :
It means airflow has created the order of execution as expected
ii) The second workflow
The first flow stops at logging both the model’s performances into mlflow ui . The Second workflow is triggered after we mark the model version’s stage as “Production”.
Create a script called flow_2.py which should look something like this .
We can see the target of this dag-flow is deploying the model into production . Operator 1 is a Python Operator whose task is to search the artifacts within the mlflow tracking server and get the artifact which is staged as production (will come to that later) . Operator 2 is Bash Operator which takes in the artifact-source returned by operator 1 . An interesting thing to note here is since it is highly crucial for the 2 tasks to communicate with each other we have used XCOM interface for the cross-communication between the operators . Finally the Bash Operator uses Mlflow’s serve command to deploy the model at port 1234 of localhost .
If we restart the airflow webserver , we can see the graph for second dag as well :
Coming to the Operator 1 , let’s see the how we are extracting the “source” of model marked for production . Following is the attached snippet .
Its pretty straightforward , we set the tracking server’s URI and loop through all the registered models and check whether ‘current_stage’ of model equals to ‘Production’ .
4. Putting It All Together
In the previous steps we studied in detail how we can use MLFlow and Airflow together and automate the entire process . Basically everything drills down to 3 steps now .
- Step 1 : Trigger the first dag that loads , processes and trains your model and logs the metrics & artifacts to the mflow ui .
- Step 2 : Go inside the mlflow ui , study the model’s performance . After the comparison of metrics and parameters select a model’s version to be deployed into production and change its stage to ‘Production’.
- Step 3: Trigger the second dag that searches the production artifact and deploys the same.
To keep track of things , Airflow creates a folder named logs inside your dags folder . Each operator has its own log files which are distinguished by timestamps of “start_time”. For example these are the Rf model training logs:
Well the above mentioned use-case can be very useful for the scenarios where you need to re-train the model on newly gathered data and deploy it and you look forward to repeat the process daily , weekly or some pre defined frequency .
Well that’s it from my side , If you need to resolve queries/discuss with me please connect with me via Linkedin.