Linear Regression - Part 1 | Opening The Blackbox
There are a lot of classification and regression techniques that we use as part of our jobs as data scientists. A lot of it is intuitive and a lot of it is not so much. However, its very important to build intuition around these techniques to use it wisely. As part of Opening the blackbox series, we would be trying to do just that. We will be using visualisations and codes to build intuition around these techniques.
In the first post of this series, I will talk about Linear Regression. In our interviews at Simpl, when we the ask potential candidates about Linear Regression, the general reaction that we get is “Ah, I have nailed that one down”. However, more often than not we get wrong answers when we try to drill down further in the working of Linear Regression.
So, what’s the first thing that pops into your mind, when you hear about Linear Regression ?
Mostly it would be something similar to the above graph, which means you are not completely oblivious to the term. The post forward would be focused on getting elements of linear regression right and then coding some linear regression models form scratch.
BTW, if you are wondering how did I generate the above graph, here is a small snippet in python that would help you to replicate something similar.
import pandas as pd
import numpy as np
import seaborn as sns; sns.set(color_codes=True)def create_dummy_dataset(std_ev=20, b=30):
x = np.random.rand(100)*10
error = np.random.randn(100)*std_ev
return x,(b*x)+errorx,y = create_dummy_dataset()data = pd.DataFrame(
{'x': x,
'y': y})
g = sns.lmplot(x='x', y='y', data=data, height=10, aspect=1)
What is Linear Regression?
Linear regression as the name suggests is a linear way to model relationships between independent variable(s) and dependent variable(s). A Simple Linear Regression models the relationship between one independent variable and one dependent variable, while a Multiple Linear Regression models the relationship between multiple independent variables and a dependent variable. There are many more extensions for Linear Regression like Multivariate Linear Regression, Generalised Linear Models, Hierarchical Linear Models, however, for the cope of this post, we will develop our intution around Simple and Multiple Linear Regression.
Before we proceed further, notice how I have repeat the term “independent” and dependent” variables so many time above. For the simplicity of writing and also following the standard convention lets call Independent Variables as (X0, X1, X2….Xn) and dependent Variable as Y. Symbols are a great way to scale mathematical formulas and theory within a limited space of writing. So we will follow some of those standard notations wherever need be going forward.
Simple Linear Regression
Lets take an example dataset and problem to understand Simple Linear Regression.
We will use the Boston House Price dataset that is available as part of the Scikit Learn library in python.
from sklearn.datasets import load_boston
boston = load_boston()
The object boston
now contains the dataset and some additional stuff that can be accessed via different methods. Lets explore the methods that are available on this object—
If you want to explore the meaning of different features in the dataset, you could do print (boston.DESCR)
Lets go ahead and create a dataframe from this dataset so that we can start doing some fun stuff on this dataset.
import pandas as pd
data = pd.DataFrame(boston.data)
data.columns = boston.feature_names
data['target'] = boston.target
The above lines will create a master dataset in the form of a pandas dataframe data
for us to work on.
Now lets visualise every dependent variables (X) against the independent variable (y) in this case target
import seaborn as sns; sns.set(style="ticks", color_codes=True)g = sns.pairplot(data,
aspect=0.5,
height=10,
x_vars=["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT"],
y_vars=["target"])
By looking at the above plots, one could easily see a clear trend in plot 6 and plot 13 (last plot). These plots correspond to the variable RM
and variable LSTAT
. For the purpose of understanding Simple Linear Regression, we would use the variable RM
to predict the target.
A Simple Linear Regression could be represented as the below formula :
y = β0+β1X+e
In the above equation,
- y is our dependent variable or target, in this particular case Median value of owner-occupied homes in $1000's
- X is the independent variable, in our case RM
- β0 is the intercept intercept of the regression line
- β1 is the slope of the regression line
- e is the error term that explains the deviation between actual and predicted value
e = y-β0-β1X
or
e = y-(β0+β1X)
or
e = y-ŷ
The objective is to chose a β0
and β1
in order to minimise e. In order to minimise e
, we would need to know its definition first. The error term for individual observations might be positive and negative and might cancel each other. To prevent this from happening we would minimise the square of error i.e. e²
We will be using Ordinary Least Square approach to minimise the above expression. While selecting the regression technique like OLS, we should take a note of the underlying assumptions for every technique. If we do the mathematics of minimising the above equation, we would reach the below result . You can have a look at the pen & paper derivation for the below equation in my post here. Although, I have added an overview of the OLS assumptions in the appendix at the end of this post.
The numerator term of β1
is called covariance, and the denominator term as you would probably know is variance.The below code example will implement the above formulas into pandas functions.
import pandas
import matplotlib.pyplot as pltdef variance(df, x):
"""calculates variance"""
return (pow(df[x]-df[x].mean(),2)).sum()def covariance(df, x, y):
"""calculates variance"""
return ((df[x]-df[x].mean())*(df[y]-df[y].mean())).sum()def calc_beta1(df,x,y):
"""calculates beta1 (β1)"""
var = variance(data,x)
cov = covariance(data,x, y )
return cov/vardef calc_beta0(df,x,y):
"""calculates beta1 (β0)"""
beta1 = calc_beta1(df,x,y)
return df[y].mean()-(beta1*df[x].mean())def simple_linear_regression(df,x,y):
"""perform regression"""
beta0 = calc_beta0(df,x,y)
beta1 = calc_beta1(df,x,y)
print ("y = "+str(beta0)+"+"+str(beta1)+"X")
df['y_hat'] = beta0+(beta1*df[x])
return dfdf_with_prediction = simple_linear_regression(df=data,x='RM',y='target')max_x = np.max(df_with_prediction['RM']) + 100
min_x = np.min(df_with_prediction['RM']) - 100x = df_with_prediction['RM']
y = df_with_prediction['y_hat']g = sns.FacetGrid(data, size = 6)
g = g.map(plt.scatter, "RM", "target", edgecolor="w")
plt.plot(x, y, color='r')
plt.show()
Output —
And just like that, you have created your own Simple Linear Regression model.
Its important for us to also know and evaluate the performance of the prediction that we just did. We would look at the 2 widely used methods namely Root Mean Square Error
and R-Squared
Root Mean Square Error
RMSE compares the predicted and observed value and summarises the difference. It is expressed as
import mathdef rmse(df, actual, prediction): rmse = (pow(df[prediction]-df[actual],2)).sum()
return math.sqrt(rmse/df[prediction].count())rmse(df_with_prediction,'target','y_hat')
We would get an RMSE of ~6.60. Now lets calculate the R² value.
R-Squared
R-Squared is the measure of how close are the actual values from the fitted regression line. It is also known as coefficient of regression
. It is defined as the fraction of variance in the data that is explained by the fitted model. Generally, It ranges between 0 and 1, if its negative the fitted model is completely wrong. A value of 0 indicates that the model does-not capture any variance at all, and a value of 1 indicates that the model explains all the variability in the data.
Generally speaking, higher the R-Squared better is the fit of your model. However, this should not be taken as the gospel of truth and one should look at plots as well to see if the fit is actually making sense.
def r_squared(df, actual, prediction) : sst = ((df[actual]-(df[actual].mean()))**2).sum()
ssr = ((df[prediction]-(df[actual].mean()))**2).sum()
return ssr/sstr_squared(df_with_prediction,'target','y_hat')
This would give an R² of ~0.49 which is not too bad for the simple linear regression that we have just done.
Congrats, we have just created our own codes for Simple Linear Regression from scratch. I hope this would have helped you in developing some good intuitions and it wont be a blackbox any more.
In the next part of this post, we would do a similar exercise for Multiple Linear Regression to get better results on the same dataset. And this time we would use Gradient Boosting to find the coefficients that minimise our error term.
Appendix
Assumptions of OLS:
- The regression model is linear in the coefficients and the error term.
- The error term has a population mean of zero.
- Independent variables are uncorrelated with the error term.
- Observations of the error term are uncorrelated with each other
- The error term has a constant variance (no heteroscedasticity) i.e. spread of residuals should not change with changing predictions
- Independent variables should not be a perfect linear function of other independent variables
- The error term is normally distributed