Step-by-Step Regression Analysis

What is Regression Analysis?

Great Learning
12 min readMay 28, 2020

Contributed by: Rahul Singh

Regression analysis is a statistical method performed to estimate the level effect of an independent variable (x) on a dependent variable (y). It helps us to estimate the contribution of independent variable/variables (X or group of Xs) on the dependent variable (Y).

In other words, it is used to understand or describe the relationships between a set of independent variables and dependent variables.

As an outcome of regression analysis, we get a mathematical equation often called a regression equation.

α and β in the above equations are parameters and they remain constant as x and y changes.

“α” is the intercept and “β” is the slope.

By determining the values of “α” and “β” we can calculate the value of “y” for a given value of “x”.

Regression analysis is a predictive modelling technique, used to analyse the cause and effect. It is primarily used for:

  • Prediction and Forecasting
  • Inferring relationships between the independent and dependent variables.

Where can one apply Regression?

We can apply regression to understand how the attributes of a dataset pertaining to a problem are related to each other. For example, we can use it to determine to what extent the kerb weight of a car impacts its performance in terms of mileage. In nutshell, it is a study of how some phenomena influence others.

Regression is also useful when we attempt to estimate(predict) the value of a dependent variable using one or more predictors(independent variables). For example, on the basis of outdoor temperature, the hour of the day, and the number of members in the family, we will be able to predict the consumption of electricity for that hour of the day.

Also Read: Logistic Regression in Python & R — With examples

Regression is greatly used in the fields of Medical Science, Finance, Environmental Science, Econometrics, Social Science & Computer Science.

What is Linear Regression?

As now we are familiar with Regression Analysis, let’s understand Linear Regression.

Linear Regression is a regression analysis of dependent and independent variables when they exhibit a linear relationship. Linear regression is one of the most popular machine learning algorithms. Its popularity is due to the fact that this technique has been around for the past 200 years and is one of the most comprehensible algorithms.

Linear regression is a supervised learning technique and it assumes that the dependence of Y on X1, X2, . . . Xp is linear.

Linear regression has many practical uses. Most applications fall into one of the following two broad categories:

If the goal is prediction, linear regression can be used to fit a predictive model to an observed data set of values of the response and explanatory variables. After developing such a model, if additional values of the explanatory variables are collected without an accompanying response value, the fitted model can be used to make a prediction of the response.

If the goal is to explain variation in the dependent variable that can be attributed to variation in the independent variables, linear regression analysis can be applied to quantify the strength of the relationship between the response and the explanatory variables.

Also read: Linear Regression in Machine Learning — Advantages & Uses

What is Simple Linear Regression?

Linear regression is called to be a simple linear regression if there is only one independent variable. And mathematically it can be represented as

y=b0+b1x1+E

Where :

y: dependent variable

b0: intercept

b1: coefficient of x1(independent variable)

E: error

What is Multiple Linear Regression?

Linear regression is called multiple linear regression if there is more than one independent variable. And mathematically it can be represented as

y= b0+b1X1+b2X2+…+bnXn+E

Where :

y: dependent variable

b0: intercept

b1: coefficient of x1(independent variable)

b2: coefficient of x2(independent variable)

bn: coefficient of xn (independent variable)

E: Error

What is Regression Line?

The Regression line is a straight line that best fits the data, such that the overall distance from the line to the points (variable values) plotted on a graph is the smallest. The formula for the best-fitting line (or regression line) is

y = a + bx,

where:

“b” is the slope of the line

“a” is the y-intercept.

“x” is an explanatory variable.

“y” is a dependent variable

Regression line attempts to define the predicted value of “y” (dependent variable) for a given value of “x” (independent variable). The best-fit regression line attempts to minimise the sum of the squared distances between the observed(actual) data points and the predicted ones.

The intercept of regression lines helps us to estimate the value of “y” (dependent variable), having no effects of “x” (independent variable).

What are the assumptions of Linear Regression?

Linear regression with standard estimation technique makes numerous assumptions about the independent variables and dependent variables.

Following is the list of major assumptions made by linear regression model:

  1. Linearity:- Linear Regression model assumes that the dependent variable is a linear combination of the regression coefficients and independent variables.
  2. Lack of perfect multicollinearity in independent variables:- To understand this, let’s first understand

Why is multicollinearity a problem in linear regression?

If independent variables are not purely independent of each other than they are correlated. And as a result, it leads to change in one variable that will induce the shift in associated correlated variables. If they possess a strong correlation, then it is more difficult to keep one variable unchanged with a change to the other variable.

Hence, this causes the problem for linear regression models to estimate the relationship between a dependent variable and independent variables, as correlated independent variables change simultaneously.

Multicollinearity reduces the power of linear regression models to identify significantly important independent variables.

  1. Constant variance for different values of the dependent variable will have the same variance in their error: This is also called homoscedasticity in error. Let’s understand, Why homoscedasticity in error is important in linear regression?

Heteroscedasticity is the antonym of homoscedasticity. Due to heteroscedasticity, it becomes difficult to determine the coefficients of standard errors. Also, we get an unreliable standard error.

  1. Independence of errors: Errors are the deviation of predicted values from the actual values. This assumes that the errors of dependent variables are random and are in no correlation with each other
  2. The quantitative data condition: Regression can only be performed on quantitative data. Regression analysis is not a good technique to find the trend in qualitative data.

How to build a Linear Regression Model in Python using SKlearn library and statsmodels

Step#1 Importing the required libraries

Step#2 Loading the dataset

Step#3 Let’s check for any missing or NA values in the training and testing data set

We can see that there is a missing value for Y.

Let’s check for any missing or NA values in the training and testing data set

There is no missing value in the test data.

Step#4 Let’s drop the record with missing value in the training dataset. As it is only one record, removing it will not be much of concern.

Now the x and y values are equal in the training data set.

Step#5 Let’s check for useful descriptive statistical values

Applying pandas “describe()” function

We have total records count, mean, median, standard deviation, and quartiles for our training data.

Step#6 Let’s define our dependent and independent variable for training and testing data

Step#7 Let’s explore the relationship between the dependent variable and the independent variable.

Above graphical depictions is clearly showing a very strong relationship between the dependent and independent variable

Step#8 Let’s add a constant, to add a constant we will create a new variable.

As we know linear regression is typically as follows:

y = a + bx,

Since we already have “y” and “x”, here we are trying to create “a” by adding a constant to our dataset. It is computationally important. We will be leveraging the “add_constant()” function of statsmodel library.

Above image shows the “1” added as a constant

Step#9 Let’s define the model and fit it.

We will be defining our first model and for this model, we will be leveraging “statsmodels” library.

Defining a variable named “Model1” to store the result

Step#10 Let’s look at different parameters of the model summary and interpret it:

From “Model1” summary we got “const” and “x1”, which are helping to create our final regression equation

“y=-0.1073+1.0007x”

Let’s look into the details of the above results:

R-squared:

This is called a coefficient of determination and it is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable

An R2 of 0 means that the dependent variable cannot be predicted from the independent variable.

An R2 of 1 means the dependent variable can be predicted without error from the independent variable.

An R2 of 0.991 means that 99.1% of the variance in “y” is predictable from “x”;

Adj. R-squared

The adjusted R2 tells you the percentage of variation explained by only the independent variables that actually affect the dependent variable.

Adjusted R2 will penalise you for adding independent variables

Adjusted R2 value 0.991 tells us the model is very well fitted and it does not contain any attribute which is not helping to predict “y”.

If the “R2” and “Adjusted R2” values are close to each other, then that means selected features are relevant and doing great. If they are poles apart then it is a clear indication that the features selected are not relevant.

F-statistic:

F-statistic is used to assess the significance of the overall model.

F-Stat: It is a statistical test that compares the fit of the intercept-only model with your model. In simple words,

If P-value for the F-Stat is less than your significance level, one can reject the null hypothesis that an intercept-only model is better.

Prob (F-statistic)

It tests the overall significance of the regression model.

It tests the null hypothesis that all of the regression coefficients are equal to zero. This tests the full model against a model with no variables and with the estimate of the dependent variable being the mean of the values of the dependent variable.

The F value is the ratio of the mean regression sum of squares divided by the mean error sum of squares. Its value will range from zero to an arbitrarily large number.

AIC:

The Akaike Information Criterion (AIC) lets you test how well your model fits the data set without overfitting it.

The AIC score rewards models that achieve a high goodness-of-fit score and penalises them if they become overly complex.

Prob(Omnibus):

The Prob (Omnibus) performs a statistical test indicating the probability that the residuals are normally distributed.

Skew:

A measure of data symmetry. We want to see something close to zero, indicating the residual distribution is normal.

Kurtosis:

It is a measure of “peak-ness’’ of the data. Higher peaks lead to greater Kurtosis.

Condition Number:

It is the measure of the sensitivity of a function’s output as compared to its input.

In the case of multicollinearity, we could observe much higher fluctuations to small changes in the data, hence, we hope to see a relatively small number

Almost every parameter indicates that the model is the best-fitted model in the training dataset.

R-squared and Adj. R-squared helps us in concluding that the model is very well fitted on the data set.

Step#11 Let’s define our final regression equation using model output parameters

Step#12: Now let’s visualise the regression equation fitment on the data

Step#13: Now let’s check how our model is doing on the testing data, which we kept aside for testing our model performance

Here, we have defined a variable named “df” to store the actual and predicted values in a data frame.

Step#14: Now let’s visualise using bar plots, how far the actual and predicted values are:

The figure depicts the first 75 observations from the test data. Orange coloured bars show the predicted value and blue bars show the actual value. We can clearly see the actual and predicted values are very close to each other, due to some amount of error involved in our prediction we still are very close to our predictions.

Step#15: Scatter plot visualisation of actual values of dependent variable vs the predicted value

Step#16: Final step to visualise the Model1 performance against various well-established evaluation metrics

For regression problems following three metrics are used:

  1. Mean Absolute Error (MAE) is the mean of the absolute value of the errors. It is calculated as:

Mean Absolute Error

Best value for this is 0.0

  1. Mean Squared Error (MSE) is the mean of the squared errors and is calculated as:

Best value for this is 0.0

3. Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors.

Best value for this is 0.0

We got following :

Mean Absolute Error: 2.4157718500412586

Mean Squared Error: 9.432922192039317

Root Mean Squared Error: 3.0713062680298293

Before we end our long journey, let’s quickly build Model2 leveraging the Sklearn “LinearRegression()” function.

Step#17: Defining the Model2 having SKLearn LinearRegression implementations

It’s great to see that both the models have “coefficient of determination” or R2 value very close on training data.

Model1(statsmodel) has R2 value “.991”

Model2(SK Learn) has R2 value “.9907”

Step#18: Model2 performance numbers

From above code snippet output, we can conclude that both the models have an identical output for evaluation metrics.

This brings our long journey to an end. In this endeavour we created two linear regression models on the same data set. We also covered the basics of Linear regression. I advise you to repeat the same steps if you want to build the Multiple Linear Regression model.

--

--

Great Learning

Great Learning is an ed-tech company for professional and higher education that offers comprehensive, industry-relevant programs.