Identifying Linear Relationships Between Variables In Machine Learning

Daniel Morales
Feb 2 · 8 min read

How to identify linear relationships?

Image for post
Image for post

Linear models assume that the independent variables, X, take on a linear relationship with the dependent variable, Y. This relationship can be dictated by the following equation (Equation of a Straight Line)

Image for post
Image for post
Source here.

Here, X specifies the independent variables and β are the coefficients that indicate a change of unit in Y for a change of unit in X. If this assumption is not met, the performance of the model may be poor. Linear relationships can be evaluated using scatter plots and residual plots. Scatter plots result in the relationship of the independent variable X and the target Y.

The residuals (Loss — error) are the difference between the linear estimation of Y using X and the real target

Image for post
Image for post
Image by Author

Linear models assume that the dependent variables X take on a linear relationship with the dependent variable Y. If the assumption is not true, the model may show poor performance. Let's visualize the linear relationships between X and Y. Let's import the following libraries: pandas, numpy, matplotlib, seaborn, and sklearn LinearRegression

Find the notebook here:

In [1]:

import pandas as pd
import numpy as np
# for plotting
import matplotlib.pyplot as plt
import seaborn as sns
# for linear regression
from sklearn.linear_model import LinearRegression

Let’s import the Boston Houses dataset from skit-learn

In [2]:

# the dataset for the demo
from sklearn.datasets import load_boston

this is how we load the dataframe from scikit-learn

In [3]:

boston_dataset = load_boston()

Then, we create the dataframe with the independent variables as follows

In [4]:

boston = pd.DataFrame(boston_dataset.data,
columns=boston_dataset.feature_names)

In [5]:

boston.head()

Out[5]:

Image for post
Image for post
Image By Author

To access the values of ‘y', we do it in the following way = boston_dataset.target. Create a new column called MEDV with the function we just showed you, and show again the boston dataframe

In [6]:

# add the target
boston['MEDV'] = boston_dataset.target
boston.head()
Image for post
Image for post
Image by Author

Here is the information about the data set. Familiarize yourself with the variables before continuing with the exercise.

The objective is to predict the “Mean House Value” The MEDV column in this data set and we have variables with features about the houses and neighborhoods. Run the following line

In [7]:

print(boston_dataset.DESCR).. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

.. topic:: References

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

Now create a dataframe with the variable x that follows a normal distribution and shows a linear relationship with y. Create a random seed of 29 to ensure reproducibility

In [8]:

np.random.seed(29)

We define a variable n with the value 200, then a variable x with a numpy randn for a number of n samples. Finally we create the variable y by multiplying x by 10 and adding a numpy randn for a number of n samples multiplied by 2

In [9]:

n = 200

In [10]:

x = np.random.randn(n)

In [11]:

x
Image for post
Image for post
Image By Author
y = x * 10 + np.random.randn(n) * 2y
Image by Author

Now we create a dataframe with Pandas with the values of x and y

In [14]:

data = pd.DataFrame([x, y]).T
data.columns = ['x', 'y']
data.head()

Out[14]:

Image for post
Image for post
Image By Author

Then, we create a scatterplot with Seaborn for x, y, data and with an order=1

In [15]:

sns.lmplot(x="x", y="y", data=data, order=1)plt.ylabel('Target')
plt.xlabel('Independent variable')
Image for post
Image for post
Image By Author

As you saw, so far we generated x and y randomly with Numpy, but we have not used the Boston House Prices dataset. We know now that the value of y in that dataset is MEDV because it is the variable we want to predict, that is, the price! While the values of x are all the other columns of the dataset. Scatterplots only allow us to compare two variables, so we must make a scatterplot for each variable x. Let's look at two of them

Graph a scatterplot with seaborn that takes the value of x to the LSTAT column and the value of y to MEDV

In [16]:

sns.lmplot(x="LSTAT", y="MEDV", data=boston, order=1)
Image for post
Image for post
Image By Author

Although not perfect, the relationship is quite linear. But notice that it is a negative linear relationship! because as LSTAT increases, the MEDV price decreases

Draw now the relationship between CRIM and MEDV

In [17]:

sns.lmplot(x="CRIM", y="MEDV", data=boston, order=1)

Out[17]:

Image for post
Image for post
Image by Author

As we have already seen, linear relationships can also be assessed by evaluating the residues. The residues are the difference between the estimated (predicted) and real value. If the relationship is linear, the residuals should be normally distributed and centered around zero.

Create the model by instantiating scikit learn’s LinearRegression() and assigning it to the variable linreg

In [18]:

linreg = LinearRegression()

Let’s continue working with the dataframe we created with Numpy of only two columns x and y. Train the model with the scikit-learn fit method. Remember to pass the values of x as a DataFrame and not as a Serie

In [19]:

# fit the model
linreg.fit(data['x'].to_frame(), data['y'])

Out[19]:

LinearRegression()

Let’s get the predictions by calling the predict method of scikit-learn and passing as a parameter the values of x as a dataframe. Assign it to the predict variable

In [20]:

pred = linreg.predict(data['x'].to_frame())

In [21]:

pred
Image for post
Image for post
Image by Author

Calculate the residual values, and store them in a variable called error

In [22]:

error = data['y'] - pred

In [23]:

error
Image for post
Image for post
Image by Author

Plot now the residual values with a Matplotlib scatterplot between pred and y

In [24]:

plt.scatter(x=pred, y=data['y'])
plt.xlabel('Predictions')
plt.ylabel('Real value')
Image for post
Image for post
Image by Author

Let’s see now the distribution of the residuals with another Matplotlib scatterplot between error and x

In [25]:

plt.scatter(y=error, x=data['x'])
plt.ylabel('Residuals')
plt.xlabel('Independent variable x')

Out[25]:

Image for post
Image for post
Image by Author

Let’s now plot the distribution of the errors by drawing a Histogram with Seaborn displot, and with a bins of 30

In [26]:

sns.distplot(error, bins=30)
plt.xlabel('Residuals')
Image for post
Image for post
Image by Author

Very well, we have done all the analysis of variables that have a linear relationship with the dataset created with Numpy. Now let’s do the same steps but with the Boston Houses dataset, and taking into account only one Variable/Column = LSTAT. Then follow all the previous steps of training the model, predicting the model and plotting the relationship and the residuals

In [27]:

# call the linear model from sklearn
linreg = LinearRegression()

In [28]:

# fit the model
linreg.fit(boston['LSTAT'].to_frame(), boston['MEDV'])

Out[28]:

LinearRegression()

In [29]:

# make the predictions
pred = linreg.predict(boston['LSTAT'].to_frame())

In [30]:

# calculate the residuals
error = boston['MEDV'] - pred

In [31]:

# plot predicted vs real
plt.scatter(x=pred, y=boston['MEDV'])
plt.xlabel('Predictions')
plt.ylabel('MEDV')

Out[31]:

Image for post
Image for post
Image by Author
# Residuals plot

# if the relationship is linear, the noise should be
# random, centered around zero, and follow a normal distribution

plt.scatter(y=error, x=boston['LSTAT'])
plt.ylabel('Residuals')
plt.xlabel('LSTAT')
Image for post
Image for post
Image by Author
# plot a histogram of the residuals
# they should follow a gaussian distribution
sns.distplot(error, bins=30)
Image for post
Image for post
Image by Author

Conclusion

In this particular case, the residues are centered around zero, but are not distributed homogeneously among the LSTAT values. Larger and smaller LSTAT values show higher residual values. Furthermore, we see in the histogram that the residuals do not adopt a strictly Gaussian distribution.

Note: we are building a private community in Slack of data scientist, if you want to join us you can register here: https://www.datasource.ai/en#slack

I hope you enjoyed this reading! you can follow me on twitter or linkedin

Thanks for reading!

Other posts written by me in Medium

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Daniel Morales

Written by

Data Scientist. ML Engineer. Co-founder at DataSource.ai, Linkedin https://www.linkedin.com/in/danielmorales1/, Twitter https://twitter.com/daniel_moralesp

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Daniel Morales

Written by

Data Scientist. ML Engineer. Co-founder at DataSource.ai, Linkedin https://www.linkedin.com/in/danielmorales1/, Twitter https://twitter.com/daniel_moralesp

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store