A Beginner’s Guide to Linear Regression in Python with Scikit-Learn

Prabhat Pathak
Analytics Vidhya
Published in
6 min readJun 28, 2020

Most import Algorithm

Photo by h heyerlein on Unsplash

What is Machine Learning

ML is a field of study that gives computers the ability to learn without being programmed.

That means the Program will train by himself with the experience, which means by observing data they can make decisions.

Supervised Learning

In Supervised learning, we are given a dataset and we already know the relationship with the variables, we train the machine using the training dataset, and based on that machine will predict the result for unseen data.

Example :

(a) Regression —In a regression problem, we are trying to predict results within a continuous output.

(b) Classification — we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.

Unsupervised Learning

Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don’t necessarily know the effect of the variables.

We can derive this structure by clustering the data based on relationships among the variables in the data.

With unsupervised learning, there is no feedback based on the prediction results.

TYPES OF UNSUPERVISED LEARNING

Unsupervised learning has two types. They are:

  • Clustering: Clustering is used for analyzing and grouping data which does not include pre-labeled class or class attributes.
  • Association: Association discovers the probability of the co-occurrence of items in a collection.

Linear Regression is part of Supervised learning

Regression analysis is a statistical technique used to describe relationships among variables

The simplest case to examine is one in which a variable Y, referred to as the dependent or target variable, related to one variable X, called an independent or explanatory variable, or simply a repressor.

If the relationship between Y and X is believed to be linear, then the equation for a line may be appropriate:

Y = β1 + β2X, where β1 is an intercept term and β2 is a slope coefficient.

Libraries used :

Scikit-learn: This is an open-source Machine learning library used for various algorithms such as Regression, Classification, and clustering.

seaborn: Seaborn stand for statistical data visualization, used for interactive visualization.

Lets’s Start:

Pip install sklearn

Once you run above code Scikit learn library will be installed in your system.

In this tutorial, I will be using the Boston housing dataset which contains information about different houses in Boston. In this dataset, there are 506 samples and 14 feature variables in this dataset including the target variable.

First, I will be importing the important libraries :

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Next, I will be importing Boston dataset :

# importing boston dataset
df=pd.read_excel('Boston_Housing.xls')
df.head()
Data frame
Variable description.

MEDV is our target variable, which means Y.

Data preprocessing:

it’s a good practice to see if there are any missing values in the data. We count the number of missing values for each feature using isnull()

df.isnull().sum()

Which means we do not have any null values in this dataset.

Once we have a data frame loaded, Now the most important thing is we need to identify which feature (Variable we will need in regression algorithm) this is the one area I believe where we should focus because the accuracy of the Regression model will depend on the variables we will be using.

EDA (Exploratory data analysis)

EDA is a very important step to get a proper understanding of the data, we will be creating some visualization and seeing the correlation as well to see which variables are more relevant for the model.

df.describe()
Description of the data

From the above output, I will try to see the nature of the dataset, when I am saying nature that means is dataset is following normal distribution or not?or is dataset is noisy, in this scenario we will see the trend column-wise.

So for that, we will observe the values of Mean, Std, Min, Median(median represents the 50th percentile or the middle value of the data), and Max.

If we see for CRIM columns there is a huge difference between Min and MAX, their difference in Mean and median too, which means this column is noisy. similarly, if you observe for all columns, RM seems like following more or less normal distribution, for more clarity lets create some visualization.

sns.pairplot(df,height=1)# there are lots of featers in the graph so its hard to see the trend tough if you observe RM ,
#LSTAT are following normal distrinution other than that NOX , LETS SEE THE Correlation matrix
col_study = ['ZN', 'INDUS', 'NOX', 'RM']
sns.pairplot(df[col_study], size=2.5);
plt.show

From the above visualization, you can see RM is following a normal distribution, though there are some outliers.

Other filters we use correlation coefficient for feature selection as I discussed earlier, let's try to see the correlation coefficient :

df.corr()
Correlation Matrix

from the above matrix, if you see RM has 0.7 coefficient against MEDV which is positively correlated, that means if there will be more no. of rooms in the apartment the price will be high which is obvious.

if you see for LSTAT which has -0.74 Coefficient, that means negatively correlated.

one more point in selecting features for a linear regression model is to check for multi-co-linearity. The features RAD, TAX have a correlation of 0.91. These feature pairs are strongly correlated with each other. We should not select both these features together for training the model. Same goes for the features DIS and AGE which has a correlation of -0.75.

plt.figure(figsize=(16,10))
sns.heatmap(df.corr(), annot=True)
plt.show()
Heatmap

So for this tutorial, I am selecting RM as X ( independent variable), for the practice, you can use multiple features maybe you can include LSTAT and try to see the prediction. for now, let's go with RM.

Y (MEDV)= β1 + β2X(RM)

Photo by Kevin Ku on Unsplash
X= df['RM'].values.reshape(-1,1)
X
from sklearn import linear_model
from sklearn.linear_model import LinearRegression

In this tutorial I am not splitting the dataset into train and test, I will demonstrate that in next tutorial.

model=LinearRegression()
model.fit(X,Y)
plt.figure(figsize=(10,5))
sns.regplot(X,Y)
plt.xlabel('RM')
plt.ylabel('MEDV')

In the above graph, you can see there are some outliers. so let's predict the price based on RM.

k=np.array([5,10,15,2,1]).reshape(-1,1)
model.predict(k)

Output :

array([ 10.83992413, 56.35046904, 101.86101394, -16.46640281, -25.5685118 ])

Conclusion :

Linear regression is really simple and amazing Algorithm, The biggest advantage of linear regression models is linearity: It makes the estimation procedure simple and, most importantly, these linear equations have an easy to understand interpretation on a modular level. in the next tutorial, we will talk about the accuracy and performance of the model and much more.

I hope this article will help you and save a good amount of time. Let me know if you have any suggestions.

HAPPY CODING.

Prabhat Pathak (Linkedin profile) is an Associate Analyst.

Photo by Evie S. on Unsplash

--

--