Machine Learning — 2 (Regression)

Gaurav Madan
18 min readDec 17, 2019

--

Key Topics

  • Relationship between Input and Output variables of type Y = F(X)
  • Modeling Techniques: Statistical Modeling, Mathematical Modeling, Stochastic Modeling
  • Understanding the accuracy and stability of the model

Techniques

  • Linear Regression
  • Logistic Regression

What is Modeling?

It is a mathematical expression that specifies relationships / dependencies. The relationship does not necessarily signify causality.

X represents a Random Variable

x represents one value of a Random Variable

Types of Regression Techniques

  • When Y is numerical, we use Linear Regression
  • When Y is categorical, we use Logistic Regression

There are 30+ more techniques in Regression, as most of the problems can be solved using these. These are open techniques with good interpretability. Other specific techniques lack interpretability.

Linear Regression

  1. First we need to establish the input / predictor variables that have a strong relationship with Y

There can be 2 scenarios

  • Y is Numerical — X is Numerical
  • Y is Numerical — X is Categorical

Case: Both X and Y are Numerical

We only calculate pearson correlation, if the relationship between X and Y are linear. To identify if the relationship is linear, use a scatter plot. Correlation is a good measure of relationship only when the relationship is linear.

Caveat: most of the relationships that we see in real world will not appear to be linear. However, we will assume it to be linear, to keep things simpler.

If we intuitively know that it makes sense to predict X, but the correlation of close to zero and scatter plot does not show correlation, then we should check if Log(X) or Exponent(X) might have a relation with Y.

  • abs(CR) > 0.5 — strong correlation
  • abs(CR) between 0.25 and 0.5 — some correlation
  • abs(CR) < 0.2 — No correlation

Once you have done this exercise for all variables, we Rank Order the variables based on Correlation with Y and choose Top 20–30 variables.

Case: Both X and Y are Categorical. X has 2 categories

  1. Use a T-Test to identify the relationship / dependence

H0: mean(Y) when X = category-1 EQUALS mean(Y) when X = category-2

If Null Hypothesis is true, there is no relationship

H1: 2 tailed test

In case, when X can have large number of categories, we do Anova.

When to use Linear Regression

  • Predict Y where X is known but Y is unknown
  • To explain and quantify the dependence of Y on different X, for strategic decision making

Real World Examples

  1. Market Mix Models / Marketing Mix Models

A company can spend their marketing / advertizing budget on TV, Radio and Newspapers. The target is to maximise sales

TV Spend | Radio Spend | Newspaper Spend | Sales
500 100 250 1500
250 100 500 1600
100 250 500 1700

To begin with, we plot one of the Spends, say TV Spend against Sales. Through this scatter plot, we need to find the predictor line.

This Y = aX + b

Linear Regression is not about predicting the exact value of Y for a given X; it is more about getting predictions where on an overall basis, I minimize the error. Measure of error used here is MSE or RMSE, usually MSE.

MSE = average((actual — predicted)²)

= average ((actual — aX — b)²)

We need to minimise MSE, wrt to both a and b

# Techniques to Minimize something
----------------------------------
if there is graph between X and some f(X), which looks like a normal curve
To identify the max value of the graph, we need to find the point where slope = 0, because after that, f(X) starts to decrease
  • Calculate MSE
  • Differentiate MSE wrt to a
  • Find a such that derivative = 0
  • Find b = actual — aX

This is called Ordinary Least Square (OLS) method for minimizing MSE

a = tan(θ) of the angle between regression line and a horizontal line

Building this practically — in Excel

If there is a single X

  • Option 1: Right click on a scatter plot Add trendline, option = linear, display equation on chart = true. The coefficients a and b are displayed on the chart
  • Option 2: Use Solver
  • Option 3: Data Analysis \ Regression

If there are multiple X

  • Use Solver, minimise MSE by changing a, b, c, d from O = ax + by + cz + d
  • Data Analysis \ Regression. Coefficients column gives the values of a, b, c, d. + other Regression Statistics, Anova Statistics etc.

Understanding the accuracy and stability of the model

Regression Stats

  • Multiple R — provides a correlation between multiple input variables and output variable. Pearson’s correlation only gives correlation between single X and Y. A high “Multiple R” signifies a good model
  • R Square

Var(Y) = average((Y — average(Y))²)

= average((Y — Pred(Y) + Pred(Y) - average(Y))²)

= average((Y — Pred(Y) )² + (Pred(Y) - average(Y))² + 2 (Y — Pred(Y) )² (Pred(Y) - average(Y))²)

= average((Y — Pred(Y) )²) + average((Pred(Y) - average(Y))²

Variance = Information

total information = information that we captured + information that we missed

TSS = SSE + MSS

R² = MSS / TSS

is an indicator of % information captured in the model

For single X, R = pearson’s correlation coefficient

  • Adjusted R square — with more X variables, R square will keep on increasing as each variable will contribute some random information. Adjusted R square is a penalized version of R square, which shows we have included some X variables, which do not contribute much. If this is much less than r square, we need to remove those less-contributing X variables
  • Standard Error = TODO

ANOVA Metrics

Since the model is built using sample, whether it is suitable for the population, can be tested with ANOVA metrics.

  • H0: All coefficients are equal to 0. H1: Atleast one coefficient is not equal to zero. If H0 is rejected, we know atleast one corefficient is non-zero. If H0 is accepted, we don’t use this model.
  • Total Sum of Squares = Regression SS + Residual SS. Regression SS / TSS = R Square
  • MS = SS / df
  • F = MS(REgression) / MS(Residual)
  • This F follows an F Distribution, so we can calculate the P value, shown as “Significance F”. If this is < 0.05, then atleast one of the coefficients are non zero.

Now we need to check which one is non-zero?

  • T Stat follows a T Distribution
  • P Value — H0: Beta is zero. H1: Beta is non-zero
  • Once you identify the variables which are insignificant, remove the variable one-by-one and re-run the regression.
  • After multiple iterations, if we realize that R Square and Adjusted R Square are still far apart, then look at Absolute value of Coefficients. Then drop the one with least value.

Elements of a Good Regression model

  • All variables should be significant
  • R Square should be as high as possible
  • Adjusted R Square should be as close to R Square as possible. There is no well defined threshold for this variation, but 2–5% is an acceptable variation.

Marketing Mix — Real Life

In real life, there are 2 factors that we need to consider

  • The Sales/Spend graph looks like an S curve. Initial small spends results in large gain in sales. Subsequently, we get lower benefits from increased spends. For simplification, we choose Linear Regression
  • There is a time lag factor between spend and converted sales. To cater to this, we take the Beta (coefficient) from a distribution between Sales vs Weeks since Spend. The Beta would not be constant.

If the model is over predicting or under predicting, then we can tune the model by changing the intercept or the slope. If nothing works, we rebuild the model

Assumptions of Linear Regression

Assumption | Check | Rectifier

  1. Y should have a linear relationship with all X | Scatter Plot | Log / Exponent / Power transformation

2. Y follows normal distribution / residuals (Actual — Predicted) should be normal | Histogram, Q-Q Plot |Log / Exponent / Power / Box-Cox transformation

3. X values should not be inter-related with each other. No multi collinearity | Variance Inflation Factor (VIF) | Drop variables that have high VIF, Ridge Regression, Lasso Regression, Elastic Net Regression (its a combination of Ridge & Lasso)

4. Variance of Residual should be constant across all values of predicted (also called Homoschedasticity) | Scatter Plot | Log / Exponent / Power

5. Residuals of one observation should not have correlation with residuals from following observations (also called No Auto Correlation / Correlation with its own values) |Durbin-Watson test | Log / Exponent / Power / Box-Cox transformation

Details of each Assumption

1. Y should have a linear relationship with all X

self explanatory / TODO

2. Y follows normal distribution / residuals (Actual — Predicted) should be normal

This is important because, if Residual MS is not normal, then F does not follow F distribution, and “Significance F” is not predicted correctly.

Q-Q Chart

X axis: Theoretical Quantiles = Quantiles of N(0,1)

Y axis: Values From Observations = Quantiles from ( R — Mean(R)) / Sd(R)

If both X and Y axis points come from a Normal Distribution, it will form a straight line.

Rectifier

If residuals don’t follow Normality, then you can transform Y (predicted variable) using Log/Exp/Power. Worst case, use Box-Cox, which can convert any variable to a Normal Distribution.

3. X values should not be inter-related with each other. No multi collinearity

We can check correlation between using correlation matrix. However, it only tells the correlation between 2 variables. There might be CR between more than one variable. If there is multi collinearity in the model, the signs between Xn and Y will keep changing. We will not know the directionality between relationships and this will reduce the confidence in the model

VIF is a better technique to identify multi collinearity

VIF = 1 / 1 — (R Square)

R Square is from Linear Regression

Variable | Regression with Other Variables | VIF
X1 | X1 = b0 + b1 * X2 + b2 * X3 + b4 * X4 | 0 to 100
X2
X3
X4

IF VIF of a particular variable is > 10, then 90% of the information captured in this variable is already captured in the other variables.

Rectifier

If this is the case, drop the variable with highest VIF, then re-do regression. Drop variables one-by-one.

If the variable cannot be dropped, because it has business significance, then make its Beta very small. How? Using Ridge / Lasso regression techniques.

4. Variance of Residual should be constant across all values of predicted

Plot Predicted (x-axis) vs Residual (y-axis)

This is an example of heteroskedasticity. In such cases, Beta will be unstable.

It is not as bad a problem as multi collinearity, so we can live with a little bit of heteroskedasticity

Rectifier

Take a Log / Exp of Y and check if the problem is resolved.

5. Residuals of one observation should not have correlation with residuals from following observations

We find the correlation between 0 to N-1 with 1 to N values to identify if residuals don’t have a correlation with lagged values of observations.

When we do Regression in Python, we get a stat called Durbin-Watson. If its value is close to 2, there is no auto correlation. If this value is drastically different from 2, there is high auto correlation.

How to select variables if there are lot of variables

If there are K potential variables, use Max(20, Sqrt(k)) variables in regression.

Logistic Regression

Predict Y when Y is categorical. It is not continuous. This is a classification problem, as we need to predict the category of Y

we cannot use linear regression, if Y is not continuous. This is because, Y will not be normal.

Also, in linear regression, we try to predict the line. In this case, there is no line.

The line is not bounded between 0 and 1. It spreads between +/- infinity. In case Y is categorical, it is finite and bounded between 0 and 1 (mathematically)

Terminology

  • Odds of Horse A winning = # time A won / # A did not win = P(A) / (1 — P(A))
  • Odds Ratio A winning wrt B = Odds of A / Odds of B

Logistic Regression states:

In case of Binary Classification problem,

log (P (Y=1) / (1 — P(Y=1)) = B0 + B1X1 + B2X2 + …

Which means, log of odds is a linear model.

We can build a model on Y=0, but as convention we always build for Y =1. Y=1 is the scenario which you are trying to predict.

  • P(Y=1) will always lie between 0 and 1.
  • Log of odds can vary between 0 and infinity

The function log (P (Y=1) / (1 — P(Y=1)) is also called Logic.

Since the equation is complicated, of form log (P (Y=1) / (1 — P(Y=1)) = B0 + B1X1 + B2X2 + …, we cannot predict Betas using OLS algorithm.

So we need to rewrite it in below manner

P (Y=1) = exp(B0 + B1X1 + B2X2 + …) / (1 + exp(B0 + B1X1 + B2X2 + …))

If P < 0.5, we can say, Y = 0, else Y = 1

Errors in Logistic Regression

Since we cannot use OLS(Ordinary Least Squares) to get the Beta values in this model, we use a method called Maximum Likelihood.

Maximum Likelihood is the probability of getting a particular observation.

p = P(Y=1)

Likelihood of an observation = p ^ Y * (1-p) ^ (1-Y)

If we have N observations,

Likelihood of all observations = Product of ( p ^ Yi * (1-p) ^ (1-Yi) ), where i is the i’th element in the observation.

If this likelihood can be maximized to 1, then thats the best model and those Beta values are the ones we need.

Rationale behind maximizing to 1: Since the trainings data that we have, already has the observations, the probability of those observation happening in real life is 1, as it has already happened.

Maximizing the Likelihood

Newton-Raphson, is an iterative method to maximize an expression; it is used when taking the derivative of an expression is very complex.

If P(Y=1) > 0.5, then Y = 1, else Y = 0

The 0.5 comes from the scenario, when probability of 0 and 1 is equal and same as 0.5

If there is a data set, there % occurrences of 0 is 99% and % occurrences of 1 is 1%, then 0.5 is not valid in such a case. In this case,

  • we calculate the probabilities of a random sample P(Y=1) = 1 % = 0.01
  • If P(Y=1) > 0.01, then Y = 1 else Y = 0

Getting Predicted Value From Probability

  • One way is, to use 50% probability
  • Second way is, to take the probabilities of occurrence of 0 and 1 in the random sample
  • Third way is, to take a threshold based on past experience and use that as cutoff for classifying Y as 0 or 1. If higher precision is required, we can keep a value close to 80%.
  • Fourth way is, use the cutoff values that make your predictions accurate

Steps to achieve the 4th method

  • First decide a cutoff
  • Predict Y
  • Create a confusion matrix

Accuracy = (TP + TN) / (TP + FP + TN + FN)

If the model is accurate in predicting all the cases correctly, then accuracy =1

Sensitivity = TP / (TP + TN)

Out of actual 1’s, how many did the model predict correctly. Low sensitivity = 0, High sensitivity = 1

Specificity = TN / (TN + FP)

Out of actual 0’s, how many did the model predict correctly. Low specificity = 0, High sensitivity = 1

Precision = TP / (TP + FP)

Out of all the people the model predicted as 1, what % were actually 1.

Recall = TN / (TN + FN)

Out of all the people the model predicted as 0, what % were actually 0.

Steps:

  • Calculate the confusion matrix for various P between 0 and 1
  • Calculate above metrics for each confusion matrix
  • Depending on what metric do you want to optimize for, you can select the probability cutoff.

In a general case, choose the cutoff where sensitivity + specificity is maximum. This is also the point, where the sensitivity and specificity lines intersect.

However, all these are metrics are useful, when the underlying model is good in predicting the Y. How to do we check if the model is good?

Goodness of fit of a Classification Model

There are 4 measures of goodness

  1. AUC — area under the curve

2. C measure or concordance measure

3. KS of Kolgorov — Smirnoff measure

4. Gini

These measures are valid for all classification models, where the algorithm gives us a probability. E.g. Logistics Regression, Decision Tree, Random Forest etc.

  1. AUC
  • Plot Sensitivity on y axis and (1 — Specificity) on x axis
  • This is call ROC curve
  • AUC is the area between the curve and the 45% line
  • If greater is the area, better is the model
  • This is an indicator of very high true positive and very low true positive

2. Concordance Measure

If the model is good, then P(Y=1) when Y is actually 1 >> P(Y=1) when Y = 0. Therefore, if you take

  • A random observation A from training data, such that actual Y = 1
  • Another random observation B, such that actual Y = 0

and you find that P(Y=1) for A > P(Y=1) for B, then the model is concordant.

There can be 3 possible outcomes: the observations can be concordant, discordant or Tie.

C measure = % concordance for all possible pairs of observations

Ideally, C measure should be more than 50% in good models

3. KS

  • Sort the predictions based on probabilities, high to low
  • Group the data into deciles
  • Find number of 1s in each decile
  • Find number of 0s in each decile
  • Find cumulative number of 1s in each decile
  • Find cumulative number of 0s in each decile
  • Find % cumulative 1s till each decile = A
  • Find % cumulative 0s till each decile = B
  • Find A — B. This is KS

This defines the separation that model is able to achieve between zeroes and ones.

The decile in which the KS is achieved, is the point till which the model is performing well. In most practical cases, we are not really interested in finding the exact separation between 1s and 0s. All we need is the Top 1s, which we are able to confidently identify.

  • KS can be used as a cutoff point also
  • As a convention, if KS is not achieved within the first 3 deciles, the model is rejected
  • The exception to this rule is: the % of 1s is more than 30%. In general, we assume that % of 1s in the development data is < 10%
  • A minimum of 50% KS needs to be achieved. Otherwise, the model is rejected.

4. Gini

We plot the Deciles on x axis and plot “cumulative % of 1s” on y axis

Higher the area under the curve, the better the model. This is because, the 1 values will be concentrated towards the top of the list (in this list sorted by highest probability first)

Gini tells how good the model is?. Higher value = better model.

How to choose when there are lot of variables?
x values can be continuous or categorical

  • If the variable is continuous, do a T-Test and check mean at Y=0 and Y= 1. If the means are same, then the variables are not of much use and can be dropped. Keep variables, which have very low P values
  • For categorical variables, do a Chi Square test, Keep variables that have high chi-sq test stat = very low P values
  • Calculated VIF for continuous variables and remove multi collinear variables.
  • Build a logistic regression and drop var that have high p-values.

Business Use Cases of Logistic Model

  1. Product propensity models — how will be the uptake of a new product? Who are the people who will be interested? X axis depicts previous behaviours of these customers, Y depicts the uptake for this model
  2. Fraud
  3. Bankruptcy
  4. Default on Loan Payments
  5. Choice (Given a customer, which of the car models will be more interesting to the customer)

Case Study — Banking

Credit Risk Modeling

There are 3 types of models are created for modeling the risk of credit.

  1. Probability of Default Model (PD Model) — This is the probability of default of particular bank loan / credit

2. Loss Given Default Model (LGD Model) — if customer defaults with X on your credit card, the bank expects that Y can be recovered using debt collection agencies.

LGD model predicts (Y-X) / X * 100%

Based on #1 and #2

Loss % = PD * LGD

What % of money is the customer likely to default with.

3. Exposure at Default — Amount of money that a customer defaults with.

Eg.

  • 1-Jan | DPDX (Days Past Due for X amount)
  • 1-Feb | DPD30 (customer has not paid for atleast 30 days)
  • 1-Mar | DPD60
  • 1-Apr | DPD90
  • 1-May | DPD120 (This is called Default)

On DPD120, the customer is marked as default and reported to credit bureaus. The case is passed to Recovery Agents.

  • EAD | X (The amount of principal at the time of default)
  • Recovery | Y (assume)
  • Loss Amount | X — Y
  • LGD | (X — Y) / X
  • Expected overall loss across customers | EAD * LGD * PD
  • ALL (Allowance for Lease and Losses) | amount set apart from profits in Balance Sheet, to cater to such losses.

Steps for building a model

1 . Do an Exploratory Data Analysis

2. Share the output with business

3. Outlier treatment

For Continuous variables, if the variables have outliers — we have 3 options

  • Drop the observation
  • we can replace by mean
  • we can do flooring / capping

We do not do outlier treatment for all data. We only treat variables that indeed have outliers.

Categorical variables cannot have outliers. If we change categorical variables to numerical variables (say A =1, B = 2, C = 3, D = 4 and D being rare), rare categories will get replaced with less rare categories (say 3 or C), if we apply the 99 percentile value while applying capping.

4. Missing value treatment

  • In 99% cases, missing values are less than 10%. In such cases, we can simply use mean or mode and replace those values.
  • If the missing values are > 10% , we can use MICE / Random Forest model to predict those missing values and replace them in data.

5. Variable Selection

We are interested in looking at finding X variables which have high correlations with Y.

There are 4 numerical tests that can be used:

  • T Test
  • Chi Square
  • Somer’s D
  • Variance Inflation Factor (VIF)

For less number of values, we can create heat maps. It can only show one to one relationships though.

For larger values, we can calculate VIF. Many to One relationships are also identified here.

Do a T test for all the numerical variables. Then take the Top X with low P values

Do a Chi Square test for all the numerical variables. Then take the Top X with high test statistic value

Logit Plot — Weight of Evidence.

Identify the monotonicity of the relationship between X and Y. Either the Y increases or decreases, with an increase in X. In Logistic regression, we assume monotonic relationship, so might have to apply some transformations at this step.

Binning and Bucketing

SomersD is a test statistic, that can tell how interrelated X(continuous variables) and Y (categorical variables) are. Higher the value, higher the (pseudo) correlation, better is the variable.

Variance Inflation Factor (VIF)

How to select from 2 buckets — continuous and categorical?

For categorical and numerical variables, we apply variable selection independently. How do we indentify the individual cutoff?

Select variables in same proportion as overall distribution. Eg. There are 200 variables in total — 50 categorical and 150 numerical. If you have to select 20 in total, select 5 categorical and 15 numerical variables. This is not a hard and fast rule though.

How to go from 500 variables to 20 variables?

Dont go from 500 to 20 variables using just test statistics. Build a Funnel. 500 to 200 to 50 to 20. At each stage, reassess if you need to reduce further or can you manage this volume of variables.

6. Build the Model

Training and Testing split

R Square can only be calculated for numerical Y. For categorical Y, we can only get an indicative number. Psuedo R Square — gives an indicative value of R Square. It is used to prioritize between 2 models. The higher the value, the better is the model.

Log Likelihood — Log of the “likelihood” of the observation that we already have.

LLR — Log Likelihood Ratio test — This is equivalent to Anova test from Linear Regression. If this p-value is less than 0.1 (90% significance level) or 0.01 (at 99% significance level), the model is good. This concludes atleast one Beta is non-zero.

Z Test

In the P > [Z] column, for variables that have value < 0.01 (whatever significance level chosen), drop variables with highest p-value. Run this iteratively, for all variables, just like Linear Regression.

7. Calculate Gini / AUC

Use this to find the goodness of the model

8. Decide cutoff using Sensitivity / Specificity

Use this to decide, at which point, we choose to predict the Y as 0 or 1.

9. Confusion Matrix

--

--