How to do Linear Regression using Python

alok ranjan
4 min readNov 29, 2019

--

Before starting our Python code for linear regression, first we try to understand “What is linear Regression ? Why we need it ? When we can use it ?”

Linear Regression is used to predict target variable for given feature variables using best fit model ( or equation).

We need linear regression to know future value of target variable or to find target variable for a new value of feature variable.

You can use linear regression to make your decision more accurate and to reduce your business loss. For example, I am owner for a retail outlet, and I want to know how much quantity I should order for inventory for upcoming week or month. Then you need to know number of customer going to visit to your outlet and their average purchasing quantity. Normally bread and milk are mostly used in every household on daily basis. And these items are having very small shelves life. In case you order more then you need to bring loss for unsold items and in case you order less then it incur loss of business. In both scenario there will loss for merchant. How you can maintenance balance in this. Here your linear regression will help you to find exact or closer value as per your demand.

Below you are seeing a graphical representation of linear regression model. Blue line drawn is best fit linear regression line. And black Dot are different observations of taken dataset.

Scatter plot with Regression line

Linear model will be in the form of

Here Y: target

W: Weight

X: feature

B: Bias

In case of multiple feature we will get multiple weight corresponding to each feature.

There is few assumption for linear regression. These assumptions are as follows

1. Linear relationship between X and Y

2. X and Y must be multivariate normal

3. Homoscedasticity (means same variance) — Residual should have same variance at all point of prediction

Residual = actual — predicted

4. No Multicollinearity — feature variables should not be correlated among themselves

5. No Autocorrelation — Autocorrelation refer degree of correlation between values of same variables. Residual values should not be correlated.

Now we are going to perform regression. Here I am giving you steps to do regression. These steps can help you to develop your own code and play around different part of program.

from sklearn import datasets

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

boston_data=datasets.load_boston()

boston_data.DESCR # this works for only in built dataset

boston_data.keys()

X = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)

Y = pd.DataFrame(boston_data.target,columns=[‘MEDV’])

X.columns

X.head()

X.shape

X.isnull().sum()

train_x,test_x,train_y,test_y=train_test_split(X,Y,test_size=0.3,

random_state=10)

model1=LinearRegression().fit(train_x,train_y)

Calculate R-Squared value

R-Squared gives percentage variation target variable explained by derived equation. You can consider this percentage accuracy of prediction result from derived model.

model1.score(train_x,train_y) # 0.75

Interpretation of R-Squared value

R-square >= 0.7 — good fit model — accepted

R-square >= 0.85 — best fit model — accepted

R-square < 0.5 — poor fit model — rejected

If your derived model has R-Squared above 0.7, then it will be accepted for prediction.

Predicting target value for test data

pred_y = model1.predict(test_x)

Calculating Mean Square Error (MSE)

mean_squared_error(test_y,pred_y) # 29.511

Mean Squared Error is used to give average error in predicting target value. And if this value is lower values close to 0 then it is more preferred.

In case you want to perform regression using only feature than above code is not going to work. So here I am giving an little modified code to do Bivariate regression.

# Bivariate regression

# i want to find cost of house based on number of rooms.

# target variable — cost_of_house

# feature — no_of_rooms

x = np.array(X[‘RM’])

y = np.array(Y[‘MEDV’])

# You need to reshape your x and y variable as they are taken as Series value because of single column present in it.

x=x.reshape(len(x),1)

y=y.reshape(len(y),1)

model2=LinearRegression().fit(x,y)

model2.coef_ # 9.10

model2.intercept_ # -34.67

# your equation will be in t form of

# y = 9.10*x+(-34.67)

# finding y for given x

x=6.575

y = 9.10*x+(-34.67)

print(x,y)

I know many of you want to know how to find significance of each variable used in model. In my next articles I will cover this.

I hope you find this articles helpful to you. Please try above program and let me know your feedback.

Please share this with your friends. I request you please share your valuable feedback and suggestion. Thank You.

Happy Learning

Alok Ranjan

--

--

alok ranjan

COO, Co-founder of Nikhil Analytics Bangalore India, Data Scientist, love to explore emerging technology