Revisiting Regression Analysis

Published in

The Startup

6 min readAug 24, 2020

In Supervised Learning, we mostly deal with two types of variables i.e numerical variables and categorical variables. Wherein regression deals with numerical variables and classification deals with categorical variables. Where,

Regression is one of the most popular statistical techniques used for Predictive Modelling and Data Mining in the world of Data Science. Basically,

Regression Analysis is a technique used for determining the relationship between two or more variables of interest.

However, Generally only 2–3 types of total 10+ types of regressions are used in practice. Linear Regression and Logistic Regression being widely used in general. So, Today we’re going to explore following 4 types of Regression Analysis techniques:

Simple Linear Regression
Ridge Regression
Lasso Regression
ElasticNet Regression

We will be observing their applications as well as the difference among them on the go while working on Student’s Score Prediction dataset. Let’s get started.

1. Linear Regression

It is the simplest form of regression. As the name suggests, if the variables of interest share a linear relationship, then Linear Regression algorithm is applicable to them. If there is a single independent variable(here, Hours), then it is a Simple Linear Regression. If there are more than 1 independent variables, then it is a Multiple Linear Regression. The mathematical equation that approximates linear relationship between independent (criterion ) variable X and dependent(predictor) variable Y is:

where, β0 and β1 are intercept and slope respectively which are also known as parameters or model co-efficients.

Regression analysis is used to find equations that fit data. Once we have the regression equation(the one as above to observe correlation between X and Y), we can use the model to make predictions. When a correlation coefficient shows that data is likely to be able to predict future outcomes and a scatter plot of the data appears to form a straight line , you can use simple linear regression to find a predictive function. Let’s try this out, shall we?

Overview of Dataset

Since we are focusing on regression analysis, I have chosen relatively small dataset. Hence, we won’t be needing the Data Wrangling part.

Visualizing Data Distribution

From the above plots, We can see that the distribution of data is a little skewed. We can say this because we observe the data distribution here with the classic ‘bell-shaped’ data distribution curve as a reference.

Visualizing Relationship Between Variables

In the above plot, we can see that the relationship between Hours and Scores is linear. The green points are the actual observations while the green line is the best fit line of regression. This shall give us a strong positive correlation between Scores and Hours (since the slope of the line is increasing in the positive direction). Let’s verify.

Visualizing Degree of Correlation Between Variables

Clearly, scores and hours have very strong correlation coefficient i.e 0.98. Now, let’s apply and explore our regression algorithms.

First of all, Let’s fit our data to the Linear Regression model and check out how the accuracy of the model turns out.

Now, let’s check out the performance of other models.

2. Ridge Regression

The Ridge regression is again a statistical technique which is used to analyze multiple regression data which is multicollinear in nature or when the number of predictor(dependent) variables are more than the number of criterion(independent) variables. The concept of multicollinearity occurs when correlation occurs among two or more predictor(dependent) variables.

Ridge regression performs L2 Regularization technique.

A regularization technique is a technique used to:

Minimize the error between estimated and actual values/observations
To reduce(regularize) the magnitude of the co-efficients of features/parameters or ultimately the cost function per se.

As ridge regression shrinks(regularizes/reduces) the co-efficients of parameters of it’s equation(as shown in the snippet below) towards zero, it introduces some bias. But it can reduce the variance to a great extent which will result in a better mean-squared error i.e increasing the accuracy. The amount of shrinkage is controlled by parameter λ which multiplies the ridge penalty(i.e introduces bias). Higher the λ, Higher the shrinkage. Thus,we can get different coefficient estimates for the different values of λ.

Cost Function of Ridge regression can be mathemattically represented as:

where, RSS=Residual Sum of Squares which is nothing but the sum of square of deviation between actual values and the values predicted by the model.

For finding λ, cross-validation technique is used. So, Let’s find the best estimate of λ for our dataset and see how the accuracy turns out.

Here, RidgeCV has built-in cross-validation for ridge regression algorithm which is used to find the best estimate for our λ parameter(here, known as alpha). Hence, we define a list alphas consisting lowest to highest range of values. Then, we pass this list as a parameter in RidgeCV and find out our best estimate. We got alpha=0.01155 for our dataset. Then, we pass this alpha as a paremeter in our Ridge regression model and evaluate it’s accuracy.

This surely optimized our Mean Absolute Error from 4.18 to 4.05 and Root Mean Squared Error from 4.65 to 4.56.

3. Lasso Regression

The acronym ‘LASSO’ stands for Least Absolute Shrinkage and Selection Operator. As the name indicates, this algorithm can perform in-built variable(feature) selection as well as parameter shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean.

So, Lasso Regression uses L1 Regularization technique. The objective function of Lasso Regression is:

where, RSS= Residual Sum of Squares and λ is the shrinkage parameter.

The difference between Ridge & Lasso regression is as follows:

Cost function of Lasso regression also contains shrinkage parameter but unlike Ridge regression, the estimated co-efficient βj’s absolute value is considered and not squared.
In Ridge regression, the shrinkage parameter regularizes the value of estimated co-efficient to tend to zero but not absolute zero. Whereas, in Lasso regression, it can make the estimated co-efficient absolute zero in accordance with the feature selection.

The ability of Lasso regression mentioned in the second point thus makes it easier to interpret.

Now, Let’s implement a Lasso regression model and observe how the accuracy of the predictions turns out.

Here, the LassoCV library is used for estimating the best value for λ paremeter(also known as alpha as shown in the parameters in the below code snippet). We have used 10-fold cross-validation for finding our best estimate. Now, let’s look at the evaluation metrics.

The Mean Absolute Error has been reduced to 4.03 from 4.05 and The Root Mean Squared Error has been reduced to 4.55 from 4.56. Note that both Lasso & Ridge optimized the evaluation metrics more effectively than the Simple Linear Regression.

4. ElasticNet Regression

ElasticNet Regression uses both L1 and L2 Regularization techniques.
It can perform feature selection and regularization(shrinking the parameters) simultaneously.
ElasticNet model can select ’n’ number of features until saturation whereas Lasso model tends to choose one feature from various correlated groups of features.

Hence, we can say that it draws the advantages of both the worlds i.e Lasso & Ridge regressions. ElasticNet regression can be mathematically represented as:

where, λ1 and λ2 are L1 & L2 norms respectively. We can clearly see in the equation that cost function of ElasticNet regression consists of estimated co-efficients from both Lasso and Ridge regressions alongwith the first term as RSS.

Here, we use GridSearchCV to perform 10-fold cross-validation and pass a list of smallest to largest possible values of alpha in the alpha parameter so that it gives us the best estimate for alpha. Now, let’s implement the ElasticNet model and check how the evaluation metrics turns out.

The Mean Absolute Error reduced from 4.03 to 3.94 and the Root Mean Squared Error reduced from 4.55 to 4.5.

Thus, we can optimize a Regression problem’s performance.