# Basic of **Correlations and Using Pandas and SciPy for calculating correlations.**

--

Hello guys,

Today I will be going to explain Basic of Correlations and using in python. I think this is important to know the concept of correlation while handling simple and multiple Linear regression.

**Correlation** is a measure of how strongly two variables are related to one another. The most common measure of correlation is the **Pearson correlation coefficient**, which, for two sets of paired data *𝑥𝑖* and *𝑦𝑖* is defined as

here *𝑟* is a number between 1 and -1, with *𝑟*>0 indicating a positive relationship (*𝑥* and *𝑦* increase together) and *𝑟*<0 a negative relationship (*𝑥* increases as *𝑦* decreases). When |*𝑟*|=1, there is a perfect *linear* relationship, while if *𝑟*=0there is no *linear* relationship (*𝑟* may fail to capture non-linear relationships). In practice, *𝑟* is never exactly 0, so *𝑟* with small magnitude are synonymous with “no correlation”. |*𝑟*|=1 does occur, usually when two variables effectively describe the same phenomena (for example, height in meters vs. height in centimeters, or grocery bill and sales tax).

Now using boston house prices dataset.-The Boston housing prices dataset is included with **sklearn** as a “toy” dataset (one used to experiment with statistical and machine learning methods). It includes the results of a survey that prices houses from various areas of Boston, and includes variables such as the crime rate of an area, the age of the home owners, and other variables. While many applications focus on predicting the price of housing based on these variables, I’m only interested in the correlation between these variables (perhaps this will suggest a model later).

Below I load in the dataset and create a Pandas `DataFrame`

from it.

from sklearn.datasets import load_boston

import pandas as pd

from pandas import DataFrame

import matplotlib.pyplot as plt

%matplotlib inlineboston = load_boston()

print(boston.DESCR)

boston.data

boston.feature_names

boston.target

# adding all the features to data frame

temp = DataFrame(boston.data,columns=pd.Index(boston.feature_names))

boston = temp.join(DataFrame(boston.target, columns=["PRICE"]))

boston # final data in dataframe formate

**Correlation Between Two Variables**

We could use NumPy’s `corrcoef()`

function if we wanted the correlation between two variable, say, the local area crime rate (CRIM) and the price of a home (PRICE)

from numpy import corrcoefboston.CRIM.as_matrix() # As a NumPy arraycorrcoef(boston.CRIM.as_matrix(), boston.PRICE.as_matrix())output:

array([[ 1. , -0.38583169],

[-0.38583169, 1. ]])

The numbers in the off-diagonal entries correspond to the correlation between the two variables. In this case, there is a negative relationship, which makes sense (more crime is associated with lower prices), but the correlation is only moderate.

**Computing a Correlation Matrix**

When we have several variables we may want to see what correlations there are among them. We can compute a **correlation matrix** that includes the correlations between the different variables in the dataset.

When loaded into a Pandas `DataFrame`

, we can use the `corr()`

method to get the correlation matrix.

`boston.corr()`

While this has a lot of data it’s not easy to read. Let’s visualize the correlations with a heatmap.

import seaborn as sns # Allows for easy plotting of heatmapssns.heatmap(boston.corr(), annot=True)

plt.show()

The heatmap reveal some interesting patterns. We can see

- A strong positive relationship between home prices and the average number of rooms for homes in that area (RM)
- A strong negative relationship between home prices and the percentage of lower status of the population (LSTAT)
- A strong positive relationship between accessibility to radial highways (RAD) and property taxes (TAX)
- A negative relationship between nitric oxides concentration (NOX) and distance to major employment areas in Boston
- No relationship between the Charles River variable (CHAS) and any other variable

**Statistical Test for Correlation**

Now we will know about statistics test for correlation.Suppose we want extra assurance that two variables are correlated. We could perform a statistical test that tests

(Where *𝜌* is the population, or “true”, correlation.) This test is provided for in SciPy.

from scipy.stats import pearsonr# Test to see if crime rate and house prices are correlated

pearsonr(boston.CRIM, boston.PRICE)OUTPUT:

(-0.38583168988399053, 2.0835501108141935e-19)

The first number in the returned tuple is the computed sample correlation coefficient *𝑟*, and the second number is the p-value of the test. In this case, the evidence that there is *any* non-zero correlation is strong. That said, just because we can conclude that the correlation is not zero does not mean that the correlation is meaningful.

That’s all about correlation.

I hope you really enjoyed this article — please leave your feedback and suggestions below.

Thanks for reading.