Correlation in data
Perhaps the first step towards getting useful information out of a dataset is knowing how the individual parts correlate with one another.
In a nutshell, correlation describes how one set of numbers relates to the other, if they then show some relationship, we can use this insight to explore and test causation and even forecast future data.
Dataset: Our dataset was cobbled together from monthly average Ice Cream Production in 2011 and the average monthly temperature for the US.File: correlation/weatherIceCream.csv
Scatter plots are great for visualizing these types of relationships, or at least identify if there is some relationship in the first place, notice that in the following example we got rid of the month column, and are only plotting temperature and ice cream production since those are our target variables.
File: Correlation/scatters1.py#Import Libraries
from bokeh.models import HoverTool
from bokeh.plotting import figure, show, output_file
import pandas as pd# Read Data
df = pd.read_csv("correlation/weatherIceCream.csv", usecols=['Date','AVG temp C','Ice Cream production'])hover = HoverTool(tooltips=[
("(Temp,Ice Cream Production", "($x, $y)")
])p = figure(x_range=(-10, 30),y_range=(35, 90), tools=[hover])# Main chart definition
p.scatter(df['AVG temp C'], df['Ice Cream production'],size=10)p.background_fill_color = "mintcream"
p.background_fill_alpha = 0.2
p.xaxis.axis_label = "Avg Temp C"
p.yaxis.axis_label = "Ice Cream Production (1000, Gallons)"show(p)
Check the live chart or run it from the code sample/github, you can hover over the data points,which makes it easier to understand the relationships.
If you are a visual person like me, this might be all the proof you need, these 2 variables are related, when temperature goes up, so does Ice Cream production, but is there any statistical measurement we can use to corroborate ? The most used one is Pearson r :
A measurement of the linear correlation of 2 variables X,Y. or if you are into formal definitions it is the covariance of two variables divided by the product of their standard deviations.
File: pearsonr.pyfrom scipy.stats import pearsonr
import pandas as pd# Read Data
df = pd.read_csv("correlation/weatherIceCream.csv", usecols=['Date','AVG temp C','Ice Cream production'])# pearson in the hauz !
print(pearsonr(df['AVG temp C'],df['Ice Cream production']))
scipy.stats.pearsonr - SciPy v0.14.0 Reference Guide
The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson's…
The output is a correlation coefficient value of 0.72 ( and a 2 tailed p-value of 0.0072 , see reference guide for description ) , but what does it mean ? let’s look at the possible values and how they relate to correlation:
These 2 extreme scatter plots and correlation coefficients imply a perfect linear relationship that can be postive or negative.
Less extreme values also denote correlation, but imply a less than perfect one.
And as they both move towards zero they become less related, zero being no correlation at all:
So according to our pearson R value of 0.72 Ice Cream production and monthly tempereatures are highly correlated.
Causation and other traps
Looking into the correlation of a set of variables is just the start of a journey of discovery, along this journey there are dead ends and wrong paths which we’ll briefly mention.
The easiest trap to fall into is to assume causation, just because 2 variables show some sort of correlation or relationship, it doesn't prove that changes to one will affect the other.
For example, ice cream production and bikini sales are probably correlated, but one does not cause the other.
Bi directionality is also a common wrong end, it is safe to assume that hot and cold weather influences ice cream production, but not that ice cream production influences the weather !
Past performance is no guarantee of future results
We live in a universe ruled by systems, systems that tend to behave in a number of ways, some are linear for a brief time or at a certain scale, some alternate in between states, most we barely understand and thus perceive as either chaotic or narrowly defined.
What this means for out correlation discussion, is that we can correctly perceive a relationship, but unfortunately there is no guarantee that the relationship will hold and that it is the real distribution, we are getting into the weeds of correlation, but they are important, let’s look at 2 brief examples:
We have somehow succesfully established that ice cream production increases with the average weather temperature, but what if we suddenly ( and tragically ) discover that ice cream causes cancer or consumer preferences change to a new hot weather snack ? The distribution would then change and for a period our forecasts and models would be mistaken.
In a more general sense, the data itself hints at correlation but there is no linearity, Anscombe’s quartet for instance, these datasets all have the same correlation value of 0.816, yet as you can see on the scatter plots the data is not linear and the relationship is more complex :
Correlation is a useful statistics metric for finding linear relationships in your datasets, it is also fraught with nuance, so rather than a definitive tool, it is a good starting point along with scatter plots to start gaining insights into causation and forecasting.
I hope this short overview was helpful !
About the Author :
Born Eugenio Noyola Leon (Keno) I am a Designer, Developer/programmer, Artist and Inventor, currently living in Mexico City, you can find me at: www.k3no.com