Machine Learning — 2 (Regression)

18 min readDec 17, 2019

Key Topics

Relationship between Input and Output variables of type Y = F(X)
Modeling Techniques: Statistical Modeling, Mathematical Modeling, Stochastic Modeling
Understanding the accuracy and stability of the model

Techniques

Linear Regression
Logistic Regression

What is Modeling?

It is a mathematical expression that specifies relationships / dependencies. The relationship does not necessarily signify causality.

X represents a Random Variable

x represents one value of a Random Variable

Types of Regression Techniques

When Y is numerical, we use Linear Regression
When Y is categorical, we use Logistic Regression

There are 30+ more techniques in Regression, as most of the problems can be solved using these. These are open techniques with good interpretability. Other specific techniques lack interpretability.

Linear Regression

First we need to establish the input / predictor variables that have a strong relationship with Y

There can be 2 scenarios

Y is Numerical — X is Numerical
Y is Numerical — X is Categorical

Case: Both X and Y are Numerical

We only calculate pearson correlation, if the relationship between X and Y are linear. To identify if the relationship is linear, use a scatter plot. Correlation is a good measure of relationship only when the relationship is linear.

Caveat: most of the relationships that we see in real world will not appear to be linear. However, we will assume it to be linear, to keep things simpler.

If we intuitively know that it makes sense to predict X, but the correlation of close to zero and scatter plot does not show correlation, then we should check if Log(X) or Exponent(X) might have a relation with Y.

abs(CR) > 0.5 — strong correlation
abs(CR) between 0.25 and 0.5 — some correlation
abs(CR) < 0.2 — No correlation

Once you have done this exercise for all variables, we Rank Order the variables based on Correlation with Y and choose Top 20–30 variables.

Case: Both X and Y are Categorical. X has 2 categories

Use a T-Test to identify the relationship / dependence

H0: mean(Y) when X = category-1 EQUALS mean(Y) when X = category-2

If Null Hypothesis is true, there is no relationship

H1: 2 tailed test

In case, when X can have large number of categories, we do Anova.

When to use Linear Regression

Predict Y where X is known but Y is unknown
To explain and quantify the dependence of Y on different X, for strategic decision making

Real World Examples

Market Mix Models / Marketing Mix Models

A company can spend their marketing / advertizing budget on TV, Radio and Newspapers. The target is to maximise sales

TV Spend | Radio Spend | Newspaper Spend | Sales
500        100           250               1500
250        100           500               1600
100        250           500               1700

To begin with, we plot one of the Spends, say TV Spend against Sales. Through this scatter plot, we need to find the predictor line.

This Y = aX + b

Linear Regression is not about predicting the exact value of Y for a given X; it is more about getting predictions where on an overall basis, I minimize the error. Measure of error used here is MSE or RMSE, usually MSE.

MSE = average((actual — predicted)²)

= average ((actual — aX — b)²)

We need to minimise MSE, wrt to both a and b

# Techniques to Minimize something
----------------------------------
if there is graph between X and some f(X), which looks like a normal curve
To identify the max value of the graph, we need to find the point where slope = 0, because after that, f(X) starts to decrease

Calculate MSE
Differentiate MSE wrt to a
Find a such that derivative = 0
Find b = actual — aX

This is called Ordinary Least Square (OLS) method for minimizing MSE

a = tan(θ) of the angle between regression line and a horizontal line

Building this practically — in Excel

If there is a single X

Option 1: Right click on a scatter plot Add trendline, option = linear, display equation on chart = true. The coefficients a and b are displayed on the chart
Option 2: Use Solver
Option 3: Data Analysis \ Regression

If there are multiple X

Use Solver, minimise MSE by changing a, b, c, d from O = ax + by + cz + d
Data Analysis \ Regression. Coefficients column gives the values of a, b, c, d. + other Regression Statistics, Anova Statistics etc.

Understanding the accuracy and stability of the model

Regression Stats

Multiple R — provides a correlation between multiple input variables and output variable. Pearson’s correlation only gives correlation between single X and Y. A high “Multiple R” signifies a good model
R Square

Var(Y) = average((Y — average(Y))²)

= average((Y — Pred(Y) + Pred(Y) - average(Y))²)

= average((Y — Pred(Y) )² + (Pred(Y) - average(Y))² + 2 (Y — Pred(Y) )² (Pred(Y) - average(Y))²)

= average((Y — Pred(Y) )²) + average((Pred(Y) - average(Y))²

Variance = Information

total information = information that we captured + information that we missed

TSS = SSE + MSS

R² = MSS / TSS

is an indicator of % information captured in the model

For single X, R = pearson’s correlation coefficient

Adjusted R square — with more X variables, R square will keep on increasing as each variable will contribute some random information. Adjusted R square is a penalized version of R square, which shows we have included some X variables, which do not contribute much. If this is much less than r square, we need to remove those less-contributing X variables
Standard Error = TODO

ANOVA Metrics

Since the model is built using sample, whether it is suitable for the population, can be tested with ANOVA metrics.

H0: All coefficients are equal to 0. H1: Atleast one coefficient is not equal to zero. If H0 is rejected, we know atleast one corefficient is non-zero. If H0 is accepted, we don’t use this model.
Total Sum of Squares = Regression SS + Residual SS. Regression SS / TSS = R Square
MS = SS / df
F = MS(REgression) / MS(Residual)
This F follows an F Distribution, so we can calculate the P value, shown as “Significance F”. If this is < 0.05, then atleast one of the coefficients are non zero.

Now we need to check which one is non-zero?

T Stat follows a T Distribution
P Value — H0: Beta is zero. H1: Beta is non-zero
Once you identify the variables which are insignificant, remove the variable one-by-one and re-run the regression.
After multiple iterations, if we realize that R Square and Adjusted R Square are still far apart, then look at Absolute value of Coefficients. Then drop the one with least value.

Elements of a Good Regression model

All variables should be significant
R Square should be as high as possible
Adjusted R Square should be as close to R Square as possible. There is no well defined threshold for this variation, but 2–5% is an acceptable variation.

Marketing Mix — Real Life

In real life, there are 2 factors that we need to consider

The Sales/Spend graph looks like an S curve. Initial small spends results in large gain in sales. Subsequently, we get lower benefits from increased spends. For simplification, we choose Linear Regression
There is a time lag factor between spend and converted sales. To cater to this, we take the Beta (coefficient) from a distribution between Sales vs Weeks since Spend. The Beta would not be constant.

If the model is over predicting or under predicting, then we can tune the model by changing the intercept or the slope. If nothing works, we rebuild the model

Assumptions of Linear Regression

Assumption | Check | Rectifier

Y should have a linear relationship with all X | Scatter Plot | Log / Exponent / Power transformation

2. Y follows normal distribution / residuals (Actual — Predicted) should be normal | Histogram, Q-Q Plot |Log / Exponent / Power / Box-Cox transformation

3. X values should not be inter-related with each other. No multi collinearity | Variance Inflation Factor (VIF) | Drop variables that have high VIF, Ridge Regression, Lasso Regression, Elastic Net Regression (its a combination of Ridge & Lasso)

4. Variance of Residual should be constant across all values of predicted (also called Homoschedasticity) | Scatter Plot | Log / Exponent / Power

5. Residuals of one observation should not have correlation with residuals from following observations (also called No Auto Correlation / Correlation with its own values) |Durbin-Watson test | Log / Exponent / Power / Box-Cox transformation

Details of each Assumption

1. Y should have a linear relationship with all X

self explanatory / TODO

2. Y follows normal distribution / residuals (Actual — Predicted) should be normal

This is important because, if Residual MS is not normal, then F does not follow F distribution, and “Significance F” is not predicted correctly.

Q-Q Chart

X axis: Theoretical Quantiles = Quantiles of N(0,1)

Y axis: Values From Observations = Quantiles from ( R — Mean(R)) / Sd(R)

If both X and Y axis points come from a Normal Distribution, it will form a straight line.

Rectifier

If residuals don’t follow Normality, then you can transform Y (predicted variable) using Log/Exp/Power. Worst case, use Box-Cox, which can convert any variable to a Normal Distribution.

3. X values should not be inter-related with each other. No multi collinearity

We can check correlation between using correlation matrix. However, it only tells the correlation between 2 variables. There might be CR between more than one variable. If there is multi collinearity in the model, the signs between Xn and Y will keep changing. We will not know the directionality between relationships and this will reduce the confidence in the model

VIF is a better technique to identify multi collinearity

VIF = 1 / 1 — (R Square)

R Square is from Linear Regression

Variable | Regression with Other Variables | VIF
X1 | X1 = b0 + b1 * X2 + b2 * X3 + b4 * X4 | 0 to 100
X2
X3
X4

IF VIF of a particular variable is > 10, then 90% of the information captured in this variable is already captured in the other variables.

Rectifier

If this is the case, drop the variable with highest VIF, then re-do regression. Drop variables one-by-one.

If the variable cannot be dropped, because it has business significance, then make its Beta very small. How? Using Ridge / Lasso regression techniques.

4. Variance of Residual should be constant across all values of predicted

Plot Predicted (x-axis) vs Residual (y-axis)

This is an example of heteroskedasticity. In such cases, Beta will be unstable.

It is not as bad a problem as multi collinearity, so we can live with a little bit of heteroskedasticity

Rectifier

Take a Log / Exp of Y and check if the problem is resolved.

5. Residuals of one observation should not have correlation with residuals from following observations

We find the correlation between 0 to N-1 with 1 to N values to identify if residuals don’t have a correlation with lagged values of observations.

When we do Regression in Python, we get a stat called Durbin-Watson. If its value is close to 2, there is no auto correlation. If this value is drastically different from 2, there is high auto correlation.

How to select variables if there are lot of variables

If there are K potential variables, use Max(20, Sqrt(k)) variables in regression.

Logistic Regression

Predict Y when Y is categorical. It is not continuous. This is a classification problem, as we need to predict the category of Y

we cannot use linear regression, if Y is not continuous. This is because, Y will not be normal.

Also, in linear regression, we try to predict the line. In this case, there is no line.

The line is not bounded between 0 and 1. It spreads between +/- infinity. In case Y is categorical, it is finite and bounded between 0 and 1 (mathematically)

Terminology

Odds of Horse A winning = # time A won / # A did not win = P(A) / (1 — P(A))
Odds Ratio A winning wrt B = Odds of A / Odds of B

Logistic Regression states:

In case of Binary Classification problem,

log (P (Y=1) / (1 — P(Y=1)) = B0 + B1X1 + B2X2 + …

Which means, log of odds is a linear model.

We can build a model on Y=0, but as convention we always build for Y =1. Y=1 is the scenario which you are trying to predict.

P(Y=1) will always lie between 0 and 1.
Log of odds can vary between 0 and infinity

The function log (P (Y=1) / (1 — P(Y=1)) is also called Logic.

Since the equation is complicated, of form log (P (Y=1) / (1 — P(Y=1)) = B0 + B1X1 + B2X2 + …, we cannot predict Betas using OLS algorithm.

So we need to rewrite it in below manner

P (Y=1) = exp(B0 + B1X1 + B2X2 + …) / (1 + exp(B0 + B1X1 + B2X2 + …))

If P < 0.5, we can say, Y = 0, else Y = 1

Errors in Logistic Regression

Since we cannot use OLS(Ordinary Least Squares) to get the Beta values in this model, we use a method called Maximum Likelihood.

Maximum Likelihood is the probability of getting a particular observation.

p = P(Y=1)

Likelihood of an observation = p ^ Y * (1-p) ^ (1-Y)

If we have N observations,

Likelihood of all observations = Product of ( p ^ Yi * (1-p) ^ (1-Yi) ), where i is the i’th element in the observation.

If this likelihood can be maximized to 1, then thats the best model and those Beta values are the ones we need.

Rationale behind maximizing to 1: Since the trainings data that we have, already has the observations, the probability of those observation happening in real life is 1, as it has already happened.

Maximizing the Likelihood

Newton-Raphson, is an iterative method to maximize an expression; it is used when taking the derivative of an expression is very complex.

If P(Y=1) > 0.5, then Y = 1, else Y = 0

The 0.5 comes from the scenario, when probability of 0 and 1 is equal and same as 0.5

If there is a data set, there % occurrences of 0 is 99% and % occurrences of 1 is 1%, then 0.5 is not valid in such a case. In this case,

we calculate the probabilities of a random sample P(Y=1) = 1 % = 0.01
If P(Y=1) > 0.01, then Y = 1 else Y = 0

Getting Predicted Value From Probability

One way is, to use 50% probability
Second way is, to take the probabilities of occurrence of 0 and 1 in the random sample
Third way is, to take a threshold based on past experience and use that as cutoff for classifying Y as 0 or 1. If higher precision is required, we can keep a value close to 80%.
Fourth way is, use the cutoff values that make your predictions accurate

Steps to achieve the 4th method

First decide a cutoff
Predict Y
Create a confusion matrix

Accuracy = (TP + TN) / (TP + FP + TN + FN)

If the model is accurate in predicting all the cases correctly, then accuracy =1

Sensitivity = TP / (TP + TN)

Out of actual 1’s, how many did the model predict correctly. Low sensitivity = 0, High sensitivity = 1

Specificity = TN / (TN + FP)

Out of actual 0’s, how many did the model predict correctly. Low specificity = 0, High sensitivity = 1

Precision = TP / (TP + FP)

Out of all the people the model predicted as 1, what % were actually 1.

Recall = TN / (TN + FN)

Out of all the people the model predicted as 0, what % were actually 0.

Steps:

Calculate the confusion matrix for various P between 0 and 1
Calculate above metrics for each confusion matrix
Depending on what metric do you want to optimize for, you can select the probability cutoff.

In a general case, choose the cutoff where sensitivity + specificity is maximum. This is also the point, where the sensitivity and specificity lines intersect.

However, all these are metrics are useful, when the underlying model is good in predicting the Y. How to do we check if the model is good?

Goodness of fit of a Classification Model

There are 4 measures of goodness

AUC — area under the curve

2. C measure or concordance measure

3. KS of Kolgorov — Smirnoff measure

4. Gini

These measures are valid for all classification models, where the algorithm gives us a probability. E.g. Logistics Regression, Decision Tree, Random Forest etc.

AUC

Plot Sensitivity on y axis and (1 — Specificity) on x axis
This is call ROC curve
AUC is the area between the curve and the 45% line
If greater is the area, better is the model
This is an indicator of very high true positive and very low true positive

2. Concordance Measure

If the model is good, then P(Y=1) when Y is actually 1 >> P(Y=1) when Y = 0. Therefore, if you take

A random observation A from training data, such that actual Y = 1
Another random observation B, such that actual Y = 0

and you find that P(Y=1) for A > P(Y=1) for B, then the model is concordant.

There can be 3 possible outcomes: the observations can be concordant, discordant or Tie.

C measure = % concordance for all possible pairs of observations

Ideally, C measure should be more than 50% in good models

3. KS

Sort the predictions based on probabilities, high to low
Group the data into deciles
Find number of 1s in each decile
Find number of 0s in each decile
Find cumulative number of 1s in each decile
Find cumulative number of 0s in each decile
Find % cumulative 1s till each decile = A
Find % cumulative 0s till each decile = B
Find A — B. This is KS

This defines the separation that model is able to achieve between zeroes and ones.

The decile in which the KS is achieved, is the point till which the model is performing well. In most practical cases, we are not really interested in finding the exact separation between 1s and 0s. All we need is the Top 1s, which we are able to confidently identify.

KS can be used as a cutoff point also
As a convention, if KS is not achieved within the first 3 deciles, the model is rejected
The exception to this rule is: the % of 1s is more than 30%. In general, we assume that % of 1s in the development data is < 10%
A minimum of 50% KS needs to be achieved. Otherwise, the model is rejected.

4. Gini

We plot the Deciles on x axis and plot “cumulative % of 1s” on y axis

Higher the area under the curve, the better the model. This is because, the 1 values will be concentrated towards the top of the list (in this list sorted by highest probability first)

Gini tells how good the model is?. Higher value = better model.

How to choose when there are lot of variables?
x values can be continuous or categorical

If the variable is continuous, do a T-Test and check mean at Y=0 and Y= 1. If the means are same, then the variables are not of much use and can be dropped. Keep variables, which have very low P values
For categorical variables, do a Chi Square test, Keep variables that have high chi-sq test stat = very low P values
Calculated VIF for continuous variables and remove multi collinear variables.
Build a logistic regression and drop var that have high p-values.

Business Use Cases of Logistic Model

Product propensity models — how will be the uptake of a new product? Who are the people who will be interested? X axis depicts previous behaviours of these customers, Y depicts the uptake for this model
Fraud
Bankruptcy
Default on Loan Payments
Choice (Given a customer, which of the car models will be more interesting to the customer)

Case Study — Banking

Credit Risk Modeling

There are 3 types of models are created for modeling the risk of credit.

Probability of Default Model (PD Model) — This is the probability of default of particular bank loan / credit

2. Loss Given Default Model (LGD Model) — if customer defaults with X on your credit card, the bank expects that Y can be recovered using debt collection agencies.

LGD model predicts (Y-X) / X * 100%

Based on #1 and #2

Loss % = PD * LGD

What % of money is the customer likely to default with.

3. Exposure at Default — Amount of money that a customer defaults with.

Eg.

1-Jan | DPDX (Days Past Due for X amount)
1-Feb | DPD30 (customer has not paid for atleast 30 days)
1-Mar | DPD60
1-Apr | DPD90
1-May | DPD120 (This is called Default)

On DPD120, the customer is marked as default and reported to credit bureaus. The case is passed to Recovery Agents.

EAD | X (The amount of principal at the time of default)
Recovery | Y (assume)
Loss Amount | X — Y
LGD | (X — Y) / X
Expected overall loss across customers | EAD * LGD * PD
ALL (Allowance for Lease and Losses) | amount set apart from profits in Balance Sheet, to cater to such losses.

Steps for building a model

1 . Do an Exploratory Data Analysis

2. Share the output with business

3. Outlier treatment

For Continuous variables, if the variables have outliers — we have 3 options

Drop the observation
we can replace by mean
we can do flooring / capping

We do not do outlier treatment for all data. We only treat variables that indeed have outliers.

Categorical variables cannot have outliers. If we change categorical variables to numerical variables (say A =1, B = 2, C = 3, D = 4 and D being rare), rare categories will get replaced with less rare categories (say 3 or C), if we apply the 99 percentile value while applying capping.

4. Missing value treatment

In 99% cases, missing values are less than 10%. In such cases, we can simply use mean or mode and replace those values.
If the missing values are > 10% , we can use MICE / Random Forest model to predict those missing values and replace them in data.

5. Variable Selection

We are interested in looking at finding X variables which have high correlations with Y.

There are 4 numerical tests that can be used:

T Test
Chi Square
Somer’s D
Variance Inflation Factor (VIF)

For less number of values, we can create heat maps. It can only show one to one relationships though.

For larger values, we can calculate VIF. Many to One relationships are also identified here.

Do a T test for all the numerical variables. Then take the Top X with low P values

Do a Chi Square test for all the numerical variables. Then take the Top X with high test statistic value

Logit Plot — Weight of Evidence.

Identify the monotonicity of the relationship between X and Y. Either the Y increases or decreases, with an increase in X. In Logistic regression, we assume monotonic relationship, so might have to apply some transformations at this step.

Binning and Bucketing

SomersD is a test statistic, that can tell how interrelated X(continuous variables) and Y (categorical variables) are. Higher the value, higher the (pseudo) correlation, better is the variable.

Variance Inflation Factor (VIF)

How to select from 2 buckets — continuous and categorical?

For categorical and numerical variables, we apply variable selection independently. How do we indentify the individual cutoff?

Select variables in same proportion as overall distribution. Eg. There are 200 variables in total — 50 categorical and 150 numerical. If you have to select 20 in total, select 5 categorical and 15 numerical variables. This is not a hard and fast rule though.

How to go from 500 variables to 20 variables?

Dont go from 500 to 20 variables using just test statistics. Build a Funnel. 500 to 200 to 50 to 20. At each stage, reassess if you need to reduce further or can you manage this volume of variables.

6. Build the Model

Training and Testing split

R Square can only be calculated for numerical Y. For categorical Y, we can only get an indicative number. Psuedo R Square — gives an indicative value of R Square. It is used to prioritize between 2 models. The higher the value, the better is the model.

Log Likelihood — Log of the “likelihood” of the observation that we already have.

LLR — Log Likelihood Ratio test — This is equivalent to Anova test from Linear Regression. If this p-value is less than 0.1 (90% significance level) or 0.01 (at 99% significance level), the model is good. This concludes atleast one Beta is non-zero.

Z Test

In the P > [Z] column, for variables that have value < 0.01 (whatever significance level chosen), drop variables with highest p-value. Run this iteratively, for all variables, just like Linear Regression.

Machine Learning — 2 (Regression)

Key Topics

Linear Regression

Real World Examples

Understanding the accuracy and stability of the model

Details of each Assumption

1. Y should have a linear relationship with all X

2. Y follows normal distribution / residuals (Actual — Predicted) should be normal

3. X values should not be inter-related with each other. No multi collinearity

4. Variance of Residual should be constant across all values of predicted

5. Residuals of one observation should not have correlation with residuals from following observations

How to select variables if there are lot of variables

Logistic Regression

Maximizing the Likelihood

Getting Predicted Value From Probability

Goodness of fit of a Classification Model

Business Use Cases of Logistic Model

Case Study — Banking

Steps for building a model

1 . Do an Exploratory Data Analysis

2. Share the output with business

3. Outlier treatment

4. Missing value treatment

5. Variable Selection

6. Build the Model

7. Calculate Gini / AUC

8. Decide cutoff using Sensitivity / Specificity

9. Confusion Matrix

Written by Gaurav Madan