Introduction to Linear Regression with Boston Dataset

Mohd Saquib
The Wisdom
Published in
4 min readAug 4, 2019

Linear regression is popular regression technique to find the relationship between a dependent variable and one or more independent variable, if there is only one independent variable it is called as simple linear regression and for multiple independent variable it is called as multiple linear regression.

Let us say we have a data set which contains the following features- height, weight, gender, ethnicity, and hair colour.

The data set can be represented as D=[xi , yi]nwhere yi belongs to R(regression).

Whats the difference between classification and regression?

When the are only two possible value like 0 & 1 or True & False it is classification and when there are more than two variables it is called as regression.

Geometric Intuition from an Example-

Let us assume we want to predict height of student given weight , gender , ethnicity, hair colour.

In linear regression we try to find a line thats fits the given data as show below, here we find the relation between height and weight-

From above the line equation can be written as :

height = w1 * weight + wo

If we have to find relationship between more independent variables then the equation of plane(remember in 2D there is always line and in 3D a plane is considered) would be written as :

height = w1*f1 +w2*f2 + w0 , where f1, f2 are features.

So the final linear equation is written as :

Yi = wt xi + wo

The values of wt and wo are chosen in such a way that they minimize the error-

Mathematical formulation for linear regression :

where yit = wtxi + xo

It is the optimization problem so we use regularization therefore the equation will be -

Let us understand Linear regression with a Boston Dataset in Python

About the data set-

Boston Data Set -The Boston data frame has 506 rows and 14 columns.

This data frame contains the following columns:

crim : per capita crime rate by town.

zn : proportion of non-retail business acres per town.

chas : Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox : nitrogen oxides concentration (parts per 10 million).

rm : average number of rooms per dwelling.

age : proportion of owner-occupied units built prior to 1940.

dis : weighted mean of distances to five Boston employment centres.

rad : index of accessibility to radial highways.

tax : full-value property-tax rate per $10,000.

ptratio : pupil-teacher ratio by town.

black : 1000(Bk — 0.63)² where Bk is the proportion of blacks by town.

lstat : lower status of the population (percent).

medv : median value of owner-occupied homes in $1000s.

In above code-

1. Importing data.

2.Printing the shape of data set.

3.Getting the features name

4.Printing the values.

In the above codes -

1. We are using linear regression to find the relation between Price and Predicted Prices.

2. We split the data into train and test.

3. Applied the linear regression on X and Y.

Output-

This is how we can use linear regression in finding relationship between independent variables in Python.

Thanks for your time, if you like this please give a clap. :)

--

--

Mohd Saquib
The Wisdom

“The goal is to turn data into information, and information into insight.”