Linear Regression on Boston Housing Dataset

Published in

Analytics Vidhya

3 min readJun 3, 2020

In my previous blog, I covered the basics of linear regression and gradient descent.

Today we will implement Linear Regression on one of the famous housing dataset which contain information about different houses in Boston.We will perform Linear Regression on the Boston Housing Dataset which is present inside the scikit learn package .

(https://rajivsworklife.files.wordpress.com/2018/02/boston.jpg?w=675&h=448)

We will perform Linear Regression on the Boston Housing Dataset which is present inside the scikit learn package .

First we have to import all the neccessary libraries that we will use are NumPy , Pandas , Matplotlib , Seaborn and scikit-learn. Then we load the housing dataset from scikit-learn.The load_boston method is used to load the dataset.

After importing the dataset, we print the field names of the dataset using the keys() function.

Here, data contains the information or data of different houses, target contains the prices of the house , feature_names contains the names of the feature or column of the data and DESCR describes the dataset .

Now let’s convert it into pandas! It’s simple, just call the pd.DataFrame() method and pass the boston.data. And we can check the first 5 data with bos.head().

To understand the relation between the target variable and the features we will plot the distribution of the target variable PRICE . We will use the displot function from the seaborn library and then we will create a correlation matrix. We will use heap function to plot it using seaborn library.

The correlation coefficient ranges from -1 to 1. If the value is close to 1, it means that there is a strong positive correlation between the two variables. When it is close to -1, the variables have a strong negative correlation.

Strongest positive correlations are displayed in blue while the strongest negative correlations are displayed in cream color. These are the features we’d like to use in our model.

Now as we can easily observe that RM and LSTAT are the strongly correlated to the target. So we will use RM and LSTAT as our features and plot them .