# Identifying Linear Relationships Between Variables In Machine Learning

How to identify linear relationships?

Linear models assume that the independent variables, `X`

, take on a linear relationship with the dependent variable, `Y`

. This relationship can be dictated by the following equation (Equation of a Straight Line)

Here, `X`

specifies the independent variables and `β`

are the coefficients that indicate a change of unit in `Y`

for a change of unit in `X`

. If this assumption is not met, the performance of the model may be poor. Linear relationships can be evaluated using scatter plots and residual plots. Scatter plots result in the relationship of the independent variable `X`

and the target `Y`

.

The residuals (Loss — error) are the difference between the linear estimation of `Y`

using `X`

and the real target

Linear models assume that the dependent variables `X`

take on a linear relationship with the dependent variable `Y`

. If the assumption is not true, the model may show poor performance. Let's visualize the linear relationships between `X`

and `Y`

. Let's import the following libraries: pandas, numpy, matplotlib, seaborn, and sklearn LinearRegression

In [1]:

importpandasaspdimportnumpyasnp# for plottingimportmatplotlib.pyplotaspltimportseabornassns# for linear regressionfromsklearn.linear_modelimportLinearRegression

Let’s import the Boston Houses dataset from `skit-learn`

In [2]:

*# the dataset for the demo*

**from** sklearn.datasets **import** load_boston

this is how we load the dataframe from scikit-learn

In [3]:

`boston_dataset `**=** load_boston()

Then, we create the dataframe with the independent variables as follows

In [4]:

`boston `**=** pd**.**DataFrame(boston_dataset**.**data,

columns**=**boston_dataset**.**feature_names)

In [5]:

`boston`**.**head()

Out[5]:

To access the values of ‘`y`

', we do it in the following way = `boston_dataset.target`

. Create a new column called `MEDV`

with the function we just showed you, and show again the boston dataframe

In [6]:

# add the target

boston['MEDV']=boston_dataset.targetboston.head()

Here is the information about the data set. Familiarize yourself with the variables before continuing with the exercise.

The objective is to predict the “Mean House Value” The MEDV column in this data set and we have variables with features about the houses and neighborhoods. Run the following line

In [7]:

print(boston_dataset.DESCR).. _boston_dataset:

Boston house prices dataset

---------------------------

**Data Set Characteristics:**

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

:Attribute Information (in order):

- CRIM per capita crime rate by town

- ZN proportion of residential land zoned for lots over 25,000 sq.ft.

- INDUS proportion of non-retail business acres per town

- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

- NOX nitric oxides concentration (parts per 10 million)

- RM average number of rooms per dwelling

- AGE proportion of owner-occupied units built prior to 1940

- DIS weighted distances to five Boston employment centres

- RAD index of accessibility to radial highways

- TAX full-value property-tax rate per $10,000

- PTRATIO pupil-teacher ratio by town

- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

- LSTAT % lower status of the population

- MEDV Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.

https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic

prices and the demand for clean air', J. Environ. Economics & Management,

vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics

...', Wiley, 1980. N.B. Various transformations are used in the table on

pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression

problems.

.. topic:: References

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.

- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

Now create a dataframe with the variable `x`

that follows a normal distribution and shows a linear relationship with `y`

. Create a random seed of 29 to ensure reproducibility

In [8]:

`np`**.**random**.**seed(29)

We define a variable `n`

with the value 200, then a variable `x`

with a numpy `randn`

for a number of `n`

samples. Finally we create the variable `y`

by multiplying `x`

by 10 and adding a numpy `randn`

for a number of `n`

samples multiplied by 2

In [9]:

`n `**=** 200

In [10]:

`x `**=** np**.**random**.**randn(n)

In [11]:

`x`

y=x*10+np.random.randn(n)*2y

Now we create a dataframe with Pandas with the values of `x`

and `y`

In [14]:

`data `**=** pd**.**DataFrame([x, y])**.**T

data**.**columns **=** ['x', 'y']

data**.**head()

Out[14]:

Then, we create a scatterplot with Seaborn for `x`

, `y`

, `data`

and with an `order=1`

In [15]:

sns.lmplot(x="x", y="y", data=data, order=1)plt.ylabel('Target')

plt.xlabel('Independent variable')

As you saw, so far we generated `x`

and `y`

randomly with Numpy, but we have not used the Boston House Prices dataset. We know now that the value of `y`

in that dataset is `MEDV`

because it is the variable we want to predict, that is, the price! While the values of `x`

are all the other columns of the dataset. Scatterplots only allow us to compare two variables, so we must make a scatterplot for each variable `x`

. Let's look at two of them

Graph a scatterplot with seaborn that takes the value of `x`

to the `LSTAT`

column and the value of `y`

to `MEDV`

In [16]:

`sns`**.**lmplot(x**=**"LSTAT", y**=**"MEDV", data**=**boston, order**=**1)

Although not perfect, the relationship is quite linear. But notice that it is a negative linear relationship! because as `LSTAT`

increases, the `MEDV`

price decreases

Draw now the relationship between `CRIM`

and `MEDV`

In [17]:

`sns`**.**lmplot(x**=**"CRIM", y**=**"MEDV", data**=**boston, order**=**1)

Out[17]:

As we have already seen, linear relationships can also be assessed by evaluating the residues. The residues are the difference between the estimated (predicted) and real value. If the relationship is linear, the residuals should be normally distributed and centered around zero.

Create the model by instantiating scikit learn’s `LinearRegression()`

and assigning it to the variable `linreg`

In [18]:

`linreg `**=** LinearRegression()

Let’s continue working with the dataframe we created with Numpy of only two columns `x`

and `y`

. Train the model with the scikit-learn fit method. Remember to pass the values of `x`

as a DataFrame and not as a Serie

In [19]:

*# fit the model*

linreg**.**fit(data['x']**.**to_frame(), data['y'])

Out[19]:

`LinearRegression()`

Let’s get the predictions by calling the predict method of scikit-learn and passing as a parameter the values of `x`

as a dataframe. Assign it to the predict variable

In [20]:

`pred `**=** linreg**.**predict(data['x']**.**to_frame())

In [21]:

`pred`

Calculate the residual values, and store them in a variable called `error`

In [22]:

`error `**=** data['y'] **-** pred

In [23]:

`error`

Plot now the residual values with a Matplotlib scatterplot between `pred`

and `y`

In [24]:

`plt`**.**scatter(x**=**pred, y**=**data['y'])

plt**.**xlabel('Predictions')

plt**.**ylabel('Real value')

Let’s see now the distribution of the residuals with another Matplotlib scatterplot between `error`

and `x`

In [25]:

`plt`**.**scatter(y**=**error, x**=**data['x'])

plt**.**ylabel('Residuals')

plt**.**xlabel('Independent variable x')

Out[25]:

Let’s now plot the distribution of the errors by drawing a Histogram with Seaborn displot, and with a bins of 30

In [26]:

`sns`**.**distplot(error, bins**=**30)

plt**.**xlabel('Residuals')

Very well, we have done all the analysis of variables that have a linear relationship with the dataset created with Numpy. Now let’s do the same steps but with the Boston Houses dataset, and taking into account only one Variable/Column = `LSTAT`

. Then follow all the previous steps of training the model, predicting the model and plotting the relationship and the residuals

In [27]:

*# call the linear model from sklearn*

linreg **=** LinearRegression()

In [28]:

*# fit the model*

linreg**.**fit(boston['LSTAT']**.**to_frame(), boston['MEDV'])

Out[28]:

`LinearRegression()`

In [29]:

*# make the predictions*

pred **=** linreg**.**predict(boston['LSTAT']**.**to_frame())

In [30]:

*# calculate the residuals*

error **=** boston['MEDV'] **-** pred

In [31]:

*# plot predicted vs real*

plt**.**scatter(x**=**pred, y**=**boston['MEDV'])

plt**.**xlabel('Predictions')

plt**.**ylabel('MEDV')

Out[31]:

*# Residuals plot*

*# if the relationship is linear, the noise should be*

*# random, centered around zero, and follow a normal distribution*

plt**.**scatter(y**=**error, x**=**boston['LSTAT'])

plt**.**ylabel('Residuals')

plt**.**xlabel('LSTAT')

*# plot a histogram of the residuals*

*# they should follow a gaussian distribution*

sns**.**distplot(error, bins**=**30)

# Conclusion

In this particular case, the residues are centered around zero, but are not distributed homogeneously among the `LSTAT`

values. Larger and smaller `LSTAT`

values show higher residual values. Furthermore, we see in the histogram that the residuals do not adopt a strictly Gaussian distribution.

**Note**: we are building a private community in Slack of data scientist, if you want to join us you can register here: https://www.datasource.ai/en#slack

I hope you enjoyed this reading! you can follow me on twitter or linkedin

Thanks for reading!

**Other posts written by me in Medium**