Basic of Correlations and Using Pandas and SciPy for calculating correlations.
--
Hello guys,
Today I will be going to explain Basic of Correlations and using in python. I think this is important to know the concept of correlation while handling simple and multiple Linear regression.
Correlation is a measure of how strongly two variables are related to one another. The most common measure of correlation is the Pearson correlation coefficient, which, for two sets of paired data 𝑥𝑖 and 𝑦𝑖 is defined as
here 𝑟 is a number between 1 and -1, with 𝑟>0 indicating a positive relationship (𝑥 and 𝑦 increase together) and 𝑟<0 a negative relationship (𝑥 increases as 𝑦 decreases). When |𝑟|=1, there is a perfect linear relationship, while if 𝑟=0there is no linear relationship (𝑟 may fail to capture non-linear relationships). In practice, 𝑟 is never exactly 0, so 𝑟 with small magnitude are synonymous with “no correlation”. |𝑟|=1 does occur, usually when two variables effectively describe the same phenomena (for example, height in meters vs. height in centimeters, or grocery bill and sales tax).
Now using boston house prices dataset.-The Boston housing prices dataset is included with sklearn as a “toy” dataset (one used to experiment with statistical and machine learning methods). It includes the results of a survey that prices houses from various areas of Boston, and includes variables such as the crime rate of an area, the age of the home owners, and other variables. While many applications focus on predicting the price of housing based on these variables, I’m only interested in the correlation between these variables (perhaps this will suggest a model later).
Below I load in the dataset and create a Pandas DataFrame
from it.
from sklearn.datasets import load_boston
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
%matplotlib inlineboston = load_boston()
print(boston.DESCR)
boston.data
boston.feature_names
boston.target
# adding all the features to data frame
temp = DataFrame(boston.data,columns=pd.Index(boston.feature_names))
boston = temp.join(DataFrame(boston.target, columns=["PRICE"]))
boston # final data in dataframe formate
Correlation Between Two Variables
We could use NumPy’s corrcoef()
function if we wanted the correlation between two variable, say, the local area crime rate (CRIM) and the price of a home (PRICE)
from numpy import corrcoefboston.CRIM.as_matrix() # As a NumPy arraycorrcoef(boston.CRIM.as_matrix(), boston.PRICE.as_matrix())output:
array([[ 1. , -0.38583169],
[-0.38583169, 1. ]])
The numbers in the off-diagonal entries correspond to the correlation between the two variables. In this case, there is a negative relationship, which makes sense (more crime is associated with lower prices), but the correlation is only moderate.
Computing a Correlation Matrix
When we have several variables we may want to see what correlations there are among them. We can compute a correlation matrix that includes the correlations between the different variables in the dataset.
When loaded into a Pandas DataFrame
, we can use the corr()
method to get the correlation matrix.
boston.corr()
While this has a lot of data it’s not easy to read. Let’s visualize the correlations with a heatmap.
import seaborn as sns # Allows for easy plotting of heatmapssns.heatmap(boston.corr(), annot=True)
plt.show()
The heatmap reveal some interesting patterns. We can see
- A strong positive relationship between home prices and the average number of rooms for homes in that area (RM)
- A strong negative relationship between home prices and the percentage of lower status of the population (LSTAT)
- A strong positive relationship between accessibility to radial highways (RAD) and property taxes (TAX)
- A negative relationship between nitric oxides concentration (NOX) and distance to major employment areas in Boston
- No relationship between the Charles River variable (CHAS) and any other variable
Statistical Test for Correlation
Now we will know about statistics test for correlation.Suppose we want extra assurance that two variables are correlated. We could perform a statistical test that tests
(Where 𝜌 is the population, or “true”, correlation.) This test is provided for in SciPy.
from scipy.stats import pearsonr# Test to see if crime rate and house prices are correlated
pearsonr(boston.CRIM, boston.PRICE)OUTPUT:
(-0.38583168988399053, 2.0835501108141935e-19)
The first number in the returned tuple is the computed sample correlation coefficient 𝑟, and the second number is the p-value of the test. In this case, the evidence that there is any non-zero correlation is strong. That said, just because we can conclude that the correlation is not zero does not mean that the correlation is meaningful.
That’s all about correlation.
I hope you really enjoyed this article — please leave your feedback and suggestions below.
Thanks for reading.