How to Deploy a Logistic Regression Model in GCP

An Insider’s View on Logistic Regression and How do we Deploy a Logistic Regression Model in GCP as a Batch prediction…

Himanshu Swamy
Analytics Vidhya
9 min readAug 5, 2019

--

Over the past two decades Machine Learning has become one of the mainstays of information technology and with that, a rather central, albeit usually hidden, part of our life. With the ever increasing amounts of data becoming available there is good reason to believe that smart data analysis will become even more pervasive as a necessary ingredient for technological progress. This post will give an inside look into Logistic Regression and a problem that most businesses face: predicting customer default.

Introduction

Logistic Regression is a type of regression technique used to study the relation between a dependent and one or more independent variables, when the dependent variable is categorical. Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

Sigmoid or Logistic Curve
Sigmoid or Logistic Curve

Logistic Regression Equation

Types of Logistic Regression

Binary and Multinomial

Why Logistic Regression?

Why can’t we use OLS Linear Regression technique to model a BINARY Dependent Variable?

Linear Probability Model is defined as :

pi =β0 +β1Xi

where pi =probability of occurrence of event

Two main reasons why OLS Linear Regression does not work with a binary target:

  1. Technical concern: It Violates Assumptions A binary (i.e. dichotomous) dependent variable in a linear regression model violates assumptions of:

— Homoscedasticity

— Normality of the Error Term

2. Fundamental Issue: Bounded Probabilities —

The Linear Probability Model is given by: pi = β0 +β1Xi

— If X has no upper or lower bound, then for any value of β there are values of X for which either pi > 1 or pi < 0

— This is contradictory, as the true values of probabilities should lie within the (0,1) interval

How do we solve the issue of Bounded Probabilities?

2.a) Use Odds instead of Probability of Event — Odds is defined as:

Odds = pi / 1 − p i = probability of event / probability of non-event

  • As probability of event ranges from 0 to 1, odds ranges from 0 to ∞
  • Transforming probabilities to odds removes the upper bound

2.b) Take Natural Logarithm of Odds

Some Key components to remember:

1)Sigmoid Function

1.1) Logistic Regression Model:

Z = log (p / 1− p) =β0 +β1X1+β2X2…βkXk

1.2) Probability of Event is therefore estimated from logit (‘model score’) by the following transformation:

2) Estimation Method: MLE

2.1) Construct Likelihood Function, expressing the likelihood of observing values of dependent variable Y for all n observations

2.2)Create log likelihood function to simplify the equation

2.3) Choose values of β’s to maximize log likelihood function

3) Logistic Regression: Some Key Assumptions

3.1) Dependent variable has to be categorical (dichotomous for binary logistic regression)

3.2) P(Y=1) is the probability of occurrence of event

  • Dependent variable is to be coded accordingly
  • For a binary logistic regression, the class 1 of the dependent variable should represent the desired outcome

3.3) Error terms need to be independent. Logistic regression requires each observation to be independent.

3.4) Model should have little or no multicollinearity.

3.5) Logistic regression assumes linearity of independent variables and log odds.

3.6) Sample size should be large enough.

4) Odds Ratio

An odds ratio for a predictor is defined as the relative amount by which the odds of the outcome increase (Odds Ratio > 1) or decrease (Odds Ratio < 1) when the value of the predictor variable is increased by 1 unit.

4.1) Interpretation

Interpretation of odds ratio depends on the type of predictor: binary or continuous

Odds ratio

5) Known-how of Model Validation

During the process of model building, the modeler must be constantly concerned with how closely the model reflects the system definition. This process of determining the degree to which a statistical software generated model (based on input data) is an accurate representation of the real world.

5.1) Why is Validation Needed?

5.1.1) Generalization

To ascertain whether predicted values from the model are likely to accurately predict responses on future subjects or subjects not used to develop the model.

5.1.2) Stability Check

To test how consistently the model is going to perform over time.

5.1.3) Robustness Check

To test whether the model is an appropriate representation of the real world for the stated purpose and whether the model is acceptable for its intended use.

5.2) Components of Model Validation

5.2.1) Sampling Strategies

  • Sampling strategies are aimed at addressing the uncertainty that can arise in tests using empirical data.
  • Examples: Cross Validation, Bootstrapping, Out-of-Sample

5.2.2) Power-Testing

  • Power-testing techniques are aimed at measuring model’s goodness-of-fit.
  • Examples: Classification Table, K-S Statistic, AUC and Concordance for a classification model

5.2.3) Calibration

  • Calibration techniques are aimed at assessing how closely the model’s predictions match with the actual (i.e. observed) values.
  • Examples: Hosmer-Lemeshow test for a classification model

Different Type of Validation Methods

1. Classification Table (Confusion Matrix)

  • 2x2 matrix of actual and predicted classes
  • Also known as Confusion Matrix or Contingency Table
  • Greater the sum of primary diagonal (TP + TN), higher the degree of classification accuracy.
Classification/Confusion Matrix

2) Concordance and Discordance

  • Concordant : A pair of an event and a non-event is said to be a concordant pair if the event observation has higher predicted event probability than the non-event observation.
  • Discordant : A pair of an event and a non-event is said to be a discordant pair if the event observation has lower predicted event probability than the non-event observation.
  • Tied : A pair of an event and a non-event is said to be a tied pair if the predicted event probability for both the event and the non-event observations is exactly same.

3) Receiver Operating Characteristics (ROC)

ROC graph is a 2-dimensional graph in which:

  • True positive rate is plotted on the Y-axis.
  • False positive rate is plotted on the X-axis.
ROC Curve

4) Gini Coefficient

Gini coefficient is a measure of degree of discrimination between goods (non-events) and bads (events).

  • Gini coefficient is twice the area between ROC curve and 45° random line of equality
  • Gini coefficient varies between 0 and 1

Gini = 0 implies no discrimination

Gini = 1 implies perfect discrimination

Gini Coefficient

5) Kolmogorov-Smirnov (K-S) Statistic

K-S statistic is the maximum vertical difference between the cumulative lift curve for events (goods) and the cumulative lift curve for non-events (bads).

KS curve is shown below. It is drawn by plotting Cumulative % of population. Better the KS, better the model.

Be careful: — K-S is based on a single point on the good and bad distributions — the point where the cumulative distributions are the most different. It shouldn’t be relied upon without carefully looking at the distributions.

6) Decile-wise Event Rate Chart

A decile-wise event rate chart is plotted to gauge if the event rate rank orders well.

  • Moving down from Decile 1 to Decile 10, average value of target (i.e. event rate) should ideally fall monotonically.
  • Moving down from Decile 1 to Decile 10, average value of target (i.e. event rate) should ideally fall monotonically.

7) Hosmer — Lemeshow Test

  • Hosmer - Lemeshow test is a goodness-of-fit test for a binary target variable.
  • Unlike many other goodness-of-fit measures, it does not focus on gauging model’s discriminatory power but aims at judging how closely the observed and the predicted values match.

Practical Implementation

By now you are aware of the most common and widely used ML Algorithms in the industry. Now let’s apply the ML model on a sample dataset for banking customers and build an end-to-end pipeline in Google AI platform.

Use Case:

The goal is to predict the probability of default based on a loan service dedicated to provided lines of credit (loans) to the unbanked population. Predicting whether or not a client will repay a loan or have difficulty.

The activities around data understanding, cleaning, validation, feature engineering ,model fitting has been separately performed. This part of article is more specific to Model deployment in cloud.

About the data:

  • SK_ID_CURR — Customer ID
  • TARGET — Will Pay or Not
  • CNT_CHILDREN — Count of childrens
  • AMT_INCOME_TOTAL — Total Income
  • AMT_CREDIT — Amount Credited every month
  • And other features on customer demographics, Employment and Source of Income.

Procedure:

  1. Data cleaning: the dataset is very neat, little modification is needed.
  2. EDA: By looking at the column names, I noticed there are columns with very similar names, which imply a potential multicollinearity problem may exist. So I made some plots of features with similar names, and the plots showed a strong correlation between each other, and that indicates feature selection is needed as the model I intended to use is regression. Here is one plot :

3. Feature engineering: Before fitting my model, there are two things I need to do: remove or combine features and label encoding for my categorical features.

4. Once the model is finalized with all the necessary validations and we have check the performance of it against the train, test and validation dataset. Now it’s time to deploy in GCP AI platform as batch prediction.

PROCEDURE

  1. Upload the dataset in GCP Bucket along with SQL script. This Sql script will be helpful in creating tables in cloud sql which acts as a data source to Build the model. Prediction on new dataset will be stored as output in cloud sql as well. Please find the below screen-shot.

2. Once we create Database in Cloud Sql , our SQL scripts will create the tables.

CREDIT_RISK IS OUR DATABASE

3. Now we first run our model development script and generate a pickle file. i.e Model_development.py

The complete script is at GitHub please refer at the bottom of the article for the link.

4. Now we generate the image of pickle file in AI Platform and create model version. For this we will be using Bash commands (Refer to GitHub).

When creating the model version, make sure you specify the directory in your GCP Bucket, where your model.joblib file is in.

Model(Pickle) Image created
Model Version

5. Once our model is deployed and ready to use , we will get the below success message.

6. Our last and final step is to run the deployment script. This will ingest the new data ,make prediction using the model image from AI Platform and store the output in Cloud Sql as shown below.This is how we can get the batch prediction within GCP platform.

Deployment Script
Output within Cloud Console
Sample output In Cloud Sql

Now that we have set up our model and deployed it with GCP we can run it with the assurance that our user data will be secure.

Feel free to improvise on this approach. Suggestion are always welcome.

Full code:- https://github.com/himswamy/GCP_Batch_Prediction.git

Link to the GitHub :- https://github.com/himswamy/GCP_Batch_Prediction

Happy Learning!!

--

--

Himanshu Swamy
Analytics Vidhya

Data Scientist |Machine Learning Engineer| Deep Learner | NLP