Between Machine Learning PoC and Production

Ryo Koyajima / 小矢島 諒
The Startup
Published in
17 min readFeb 1, 2021
the final architecture of this article

The Japanese version is here:
(https://qiita.com/koyaaarr/items/259ad4f0d574497c5b08)

Introduction

Machine learning Proof of Concept (PoC) is very popular these days due to the recent AI boom. And afterward, if (very fortunately) you get good achievement in the PoC, you may want to put the PoC system into production. However, while a lot of knowledge has been shared about exploratory data analysis and building predictive models, there is still not much knowledge on how to put them into practice, especially in production.

In this article, we will examine what is needed technically during the transition from PoC to production operations. I hope that this article will help you to make your machine learning PoC not only transient but also create value through production.

What is written in this article

  • How to proceed with data analysis in a PoC
  • How to proceed with the test operation of the machine learning PoC (the main topic of this article).
  • Architecture in each phase of PoC and test operation (the main topic of this article).
  • Additional things to consider for production operations

I will focus especially on test operations. During test operations, operations and analysis are often done in parallel, and I will describe an example of how to update the architecture of the system while balancing operations and analysis.

What is not written in this article

  • Details on exploratory data analysis
  • Details on preprocessing and feature engineering
  • Details on building predictive models
  • Lower layers than middleware (databases and web servers)
  • Consulting skills to handle Machine Learning PoC

Consulting skills are very important in machine learning projects because of their uncertainty but are not included in this article as the focus is on the technology.

Systems assumed in this article

  • Use a relatively small dataset, less than 100 GB
  • Handle data that can be stored in memory, rather than data in the hundreds of millions of records
  • Batch learning and batch inference
  • Not perform online (real-time) learning and inference
  • System construction proceeds in parallel with data analysis
  • Not have concrete requirements to create in the beginning so we build them as needed while proceeding

Data used in this article

We will use data from a previous Kaggle competition, “Home Credit Default Risk” in this article. This competition uses an individual’s credit information to predict whether or not they will default on their debt. There are records for each loan application in the data, and each record contains information about the applicant’s credit and the label indicating whether the person was able to repay the loan or defaulted on it.

In this article, we will assume that we are in the data analytics department of a certain loan lending company. Under this assumption, you want to utilize machine learning to automate credit decisions based on this credit information.

For the sake of explanation, we will divide “application_train.csv” among the data available in this competition as shown in the figure. The split data will be used under the following assumptions.

  • initial.csv: Past credit information, to be used in PoC
  • 20201001.csv: Credit information for October 2020. In the test operation, this data will be handled as training data together with “initial.csv”.
  • 20201101.csv”: Credit information for November 2020. In the test operation, this data is handled together with “initial.csv” as training data.
  • “20201201.csv”: Credit information for December 2020. In test operations, this data is handled together with “initial.csv” as training data.
  • “20210101.csv”: Credit information for January 2021. In the test operation, we will start forecasting from this month.

The actual code for splitting the data is shown below.

split_data.ipynb

Situation to be considered

In this article, for ease of explanation, we will assume the following project. The following story is based on the author’s imagination based on the data of “Home Credit Default Risk” and has nothing to do with the actual company or business. The author is a complete novice in the field of credit operations and may differ greatly from actual operations.

As a data scientist, I am participating in a project to automate the credit approval process at a loan lending company. The credit judgment work is done manually by the screening department, but we are considering whether machine learning can be used to reduce man-hours and improve the accuracy of credit judgment. Sample data has already been provided, and we are in the PoC stage. The sample data is a record of past loan defaults by borrowers. Based on this data, if someone wants to take out a new loan, we would like to be able to predict whether or not that person will default on the loan so that we can decide whether or not to lend the loan.

Scope of the project in this article

A machine learning project usually goes through planning, PoC, test operation, and production operation. In this article, to focus on the technical points, I will describe the scope from PoC to test operation. In particular, I will divide the test operation into three phases, since a lot of functions are required to move to production. Since the author has little experience in production operations, I only mention the points that should be considered for production operations.

Structure assumed in this article

In this article, I will assume a minimal structure, as we are going to start the project small. Specifically, there is a consultant who will communicate with the business department (the credit judgment department) and a data scientist who will perform everything from data analysis to system construction. In reality, there is a manager as a supervisor, but they will not appear in this article. Also, as a stakeholder, there is a person in the business department.

PoC phase

Purpose of this phase

The purpose of this phase is to verify whether it is feasible to automate credit decisions. In this phase, we will examine two main points: one is to validate the data otherwise whether the provided data can be used in production (e.g. whether the data can be used in forecasting and whether there is no relationship between records), and the other is to determine how accurately the defaults can be predicted by machine learning.

Architecture in this phase

In this phase, we will work only with JupyterLab. MLflow is included for storing the machine learning models, but (in my opinion,) it is not necessary at the beginning.

Data validation

If you are a data scientist, you want to start looking at the data right away, but first, you need to validate the data. This is because if the data is flawed, any predictions made using the data will likely be useless. Validation includes two main points: first, for each record, when is each column data available. The data for each column may seem to be available at the time it is provided to us, but that doesn’t mean that they are available at the same time. For the simplest example, the objective variable “whether the debtor has defaulted” will be known later than the other columns. Another point to check is to see if there is any relationship between the records. For example, if a person applied for a loan twice and the first record in the training data and the second record in the test data, the prediction will take an advantage in a bad way. In such a case, you can make sure that both records are included in either the training data or the test data. In addition to these points, it is also important to clarify the definition of the data by interviewing the business department about what each column means and what the unit of the record is (e.g. in this data, is it per person or loan application?). You may use a spreadsheet to check these checkpoints for each column of the data.

Exploratory data analysis

Once the data has been validated (or in parallel with the validation), we can use Jupyter Lab to see what columns (features) are present by visualizing the sample data. This process will help you understand the data and do feature engineering and model selection. It is also useful to find problems in the data.

First, for each column, we will check the data type, percentage of missing values, etc.

eda.ipynb

Next, to see the distribution, we will visualize it. If the data type is numeric, we will use a histogram, and if the data type is a string, we will use a bar chart.

eda.ipynb

Two of the output graphs will be shown as examples. In fact, we should look at the distributions one by one, but we will skip that for now.

AMT_CREDIT

NAME_INCOME_TYPE

Verification of prediction accuracy

From here, we will actually create the model and verify the prediction accuracy. In this case, we will use the AUC of ROC, which is the same evaluation indicator used in “Home Credit Default Risk”. In reality, we will discuss with the business department and agree in advance on which indicator to use. Before creating a prediction model manually, we will first try to make a quick prediction using PyCaret. This will allow us to compare which features/models are effective and use them as a reference when actually creating the model.

eda.ipynb

In this article, we will compare the following models provided by PyCaret.

  • Logistic regression
  • Decision Trees
  • Random Forest
  • SVM
  • LightGBM

LightGBM seems to be superior when the evaluation metric is AUC. In general, LightGBM seems to be better in both accuracy and execution speed in most cases. By the way, recall is small in all models because of the imbalanced data with few positive examples. Depending on your business goals, you may create a model with a high recall score so that you prevent more bad debts. In this article, we will not do any more detailed modeling and will use LightGBM to build models.

Next, we will create and evaluate a LightGBM model in PyCaret to see which features are effective.

If there are a lot of features, as in the case of this data, reducing the number of the features will increase both the accuracy and stability of the model. A simple way to do that is to calculate the feature importance and exclude the features with low importance. In this case, we will simply use the features with high importance. For the columns that are automatically preprocessed by PyCaret, we will use the original columns.

Now, we will create the prediction model manually.

Preprocessing

For the sake of simplicity, we will only complement the missing values in the preprocessing.

forecast.ipynb

Feature Engineering

Feature engineering involves feature selection and creating dummy variables of categorical features.

forecast.ipynb

Prediction

Use LightGBM to create a model. Also, use Optuna to tune hyperparameters.

In this verification of the prediction accuracy, we were able to achieve almost the same accuracy using PyCaret. In reality, we will conduct a more in-depth analysis based on these results, but we finish the verification of the PoC phase with this.

From here on, we will assume that the results of the PoC will be reported to the business department, and this project proceeds through PoC to production. However, the PoC will not suddenly go into production. The PoC system will be gradually brought closer to production through several test operations. Therefore, we will divide the test operation into three phases. In each phase, we will add functions little by little so that the operation will be gradually automated and get closer to the production operation.

Supplement: Machine learning model management

For managing machine learning models, MLflow is useful. It can manage models with each hyperparameter explored by Optuna, which will be useful as the number of model trials increases.

Test Operation

The three phases of test operations

Before we can go from PoC to production, we need to implement some features such as automation of operations. However, it would be difficult in terms of man-hours to implement all the necessary functions right away. (Besides, at this stage, you are probably being asked by the business department to further improve the accuracy.) Therefore, we will divide the necessary functions into three phases and implement them gradually, so that we can expand the functions as we operate. In each phase, we will implement the following functions respectively:

  1. Building data pipelines and semi-automated operations
  2. Implementation of regular operation API
  3. Migration to the cloud and automation of operations

Test Operation Phase 1: Building data pipeline and semi-automated operations

Purpose of this phase

In this phase, we will partially automate the system created in the PoC. Before that, we will build a data pipeline by dividing and organizing the PoC program into blocks such as feature engineering and prediction. This will allow the training and inference to be executed in isolation or rerun from the middle. Besides, Airflow, a workflow engine, is introduced to enable automatic execution and scheduling execution of all programs divided into each block in order.

Architecture in this phase

In the PoC phase, we used a single Jupyter Notebook for preprocessing and prediction, and so on, but from this phase, we will introduce two OSS to execute multiple Notebooks in order. The first is “papermill”, an OSS that allows us to run Jupyter Notebooks from the command line with parameters so that we can make predictions for different months without rewriting notebooks. Besides, use “Airflow” to run each Notebook in order. This OSS provides not only automatic execution, but also scheduling execution, success and failure notifications, and other useful functions for operational automation.

Data pipeline

Divide the program created by PoC into four blocks: “data accumulation”, “feature engineering”, “learning” and “inference”. When dividing the program into blocks, each block should be loosely coupled to each other by using data as an interface. This will limit the impact of changes in the program logic. For reference, here is an image of the data pipeline in this article. In each block, the month of execution is set to be passed as a parameter from papermill at the beginning of the program, so that it can be executed in a specific month.

The following is the code for each block. Basically, it is a reuse of the program used in the PoC, with some additions and modifications for operational automation.

Data accumulation

accumulate.ipynb

Feature engineering

feature_engineering.ipynb

Learn model

learn.ipynb

Inference

inference.ipynb

Semi-automating operations

Once each process has been split into individual programs, Airflow can be used to execute them in an ordered manner. By passing the forecasted month as a parameter at runtime, we can run for each month. Also, if you want to schedule the execution, you can define the date and time of the scheduling execution as a cron expression in “schedule_interval”. The Airflow code is shown below.

trial_operation.py

You can view your defined workflow as a flowchart in Airflow. For example, the above code can be visualized as the following figure. You can see that this diagram has the same structure as the data pipeline we defined earlier. (In the figure, each box is green because the blocks have already been completed successfully.)

With the implementation of test operation phase 1, we were able to automate the monthly operations as shown below. We can see that most parts are becoming greatly automated.

  • PoC Phase
  1. upload data for the forecast month
  2. Combine training data of previous months
  3. Preprocessing and feature engineering of training data
  4. Train model from training data
  5. Preprocess test data and do feature engineering
  6. Predict the test data using the trained model
  7. download the prediction result
  • Test Operation Phase 1
  1. Upload the data for the forecast month
  2. Run the Workflow from Airflow
  3. Download the prediction result

Test Operation Phase 2: Implementation of regular operation API

Purpose of this phase

In phase 1, we were able to greatly automate monthly operations by dividing functions such as preprocessing and inference into separate programs and execute in order by combining papermill and Airflow. In this phase 2, we will further automate the process. Specifically, we will prepare APIs and GUI screens to execute data upload/download and regular operations, which were done manually in Phase 1. In this way, even non-engineering users such as consultants and business departments will be able to operate the system easily. In this way, the regular operations can be left to the users, and the engineers can concentrate more on the development tasks.

Architecture in this phase

In phase 2, we will build a web server and create a GUI screen to operate it.

Creating a web server

Prepare the following APIs for the web server.

  • Upload function for input files
  • Execution of regular operations
  • Download function of forecast files

This time, we will use FastAPI to create the webserver.

server.py

Creating the GUI screen

For the GUI, we need a button to execute the web server API and a form to upload data. In this case, I used React and Typescript to create the GUI on my own, but it may be faster to use a library that creates the GUI, such as streamlit.

App.tsx

GUI screen is like the following image.

Test Operation Phase 3: Migration to the cloud and automation of operations

Purpose of this phase

In Phase 3, we will move servers to the cloud and move some functions to managed services to further automate regular operations. The purpose of using the cloud is to increase the availability of the system by delegating operations such as infrastructure to the cloud so that we can focus more on enhancing and maintaining the application. The basic functions are common to all the clouds such as AWS, GCP, and Azure, but each of them has different features and characteristics, so I think it is better to compare them.

In this article, I will briefly examine migration to AWS as an example. There are two migration examples: Pattern 1, in which the system created up in the test operation phase 2 is migrated simply to AWS, and Pattern 2, in which further automation is performed.

Architecture Pattern 1 with AWS: Simple EC2-only configuration

Each server is built on EC2, and data is stored in EBS. The usage is almost the same as local Linux machines and migration should not be difficult. However, uploading of input data and downloading of prediction results still needs to be done manually. Also, since each function of the system is just running on EC2, the ease of enhancement and maintenance has not changed much.

Architecture Pattern 2 on AWS: Further automated configuration

In this pattern 2, the following points that were issued in pattern 1 are improved.

  • Automation of data input/output
  • Splitting some functions into individual programs and managed services

To automate data input/output, we use S3 as a shared folder for exchange data with external systems. We can monitor data input/output to S3 using CloudWatch and CloudTrail, and call Airflow’s regular operation API using Lambda. And then, we can run the prediction system by triggering the storage of input files. With this system, there is no need to set up a GUI or a web server. If you set up a web server in the cloud, you will need authentication functions and vulnerability countermeasures, so this will also reduce these risks.

As for splitting some of the functions into individual programs and services, we did the following points.

  • Changed the storage location of input/output files to S3
  • Moved the trigger program for system execution to Lambda
  • Migrated the success/failure notification program to Lambda and SNS

The scope that we were able to divide up this time is not very wide, but I think we can divide up the program further by using other AWS services to make it easier to enhance and maintain. However, if you expand the scope too much, you may end up with vendor lock-in, so you need to consider the ease of migration as well.

We have now completed all considerations up to test operation phase 3. Actually, there are many technical and business hurdles in running an on-demand analysis service of PoC regularly as production, but I hope that the methods we discussed here will be helpful.

Additional things to consider for production operations

Finally, I will list things to consider for production operations in this chapter.

Utilizing the cloud

In test operation phase 3, we migrated to the cloud. Since the cloud has a variety of functions, it is best to utilize them to the extent that they do not significantly sacrifice portability. For example, data governance can be introduced by linking the internal authentication with the cloud authentication function, and the auto-scaling function can be used to handle larger-scale data.

It is also important to eliminate as much of your own code as possible and move to managed services. Considering the long-term operation of the system, you should consider utilizing a service that has similar functions to your program since your own code is not easy to maintain and is also very impersonal. For example, for Airflow, there are managed services such as GCP’s “Cloud Composer” and AWS’s “Amazon Managed Workflows for Apache Airflow”, so using these services is something to consider.

Program reusability

While Jupyter Notebook is easy and convenient for development, it is not easy to manage, run, and test with git. It may be a good idea to migrate to python files as needed, depending on the combination of development speed and quality. Also, if this system itself can be built on Docker and Kubernetes, it will not only increase the robustness of the system and make it easier to scale the process, but it will also have great business benefits such as making it easier to expand to other projects.

Data storing

In this article, data was stored in CSV or Pickle format, but it is good to consider which data to be stored in which format. For this purpose, it is useful to manage the definitions of each data in a spreadsheet when the data pipeline is developed. I often use CSV data that is difficult to recreate (input data) or data that requires external collaboration (forecast results), and Pickle format for intermediate-generated data. Pickle format is convenient, but it is not versatile or robust, so it is better to store in CSV format and define the data type separately or use the “Parquet” format if you know.

Data monitoring

To continuously operate a machine learning system, you need to pay attention to the data as well as the system. For example, if the trend of the input data changes, it may have a significant impact on the prediction accuracy even if there is no problem with the system. Therefore, it is necessary to monitor the input data, for example, to check if the distribution of data in each column and the relationship with the labels have changed. Also, depending on the system you are creating, you need to verify the fairness of the predictions, for example, whether the prediction results vary depending on gender.

Data governance

At the PoC level, access privileges to data may be naturally limited, but as the operation becomes longer and the number of people involved in the system increases, it will become necessary to set appropriate access privileges for each data. In such cases, it is best to utilize the authentication functions of cloud services. For example, by creating individual accounts with AWS IAM, you can flexibly set access privileges to the data stored in S3 according to each individual’s department or position. Also, since cloud services have functions that can be integrated with internal authentication infrastructure, it is a good idea to use these services.

Software and code used in this article

The source of the system built as an example in this article is stored in the following GitHub repository.

https://github.com/koyaaarr/between_poc_and_production

The versions of the main software used are as follows.

Reference

--

--