Introduction to Statistics for Data Science
Intermediate Level — The Fundamentals of Descriptive Statistics
In the last post I’ve been through some introductory but important statistics concepts to get you started in Data Science. The terms population and sample were analysed, the types of data you might work with and the different types of measures you might perform to your data such as measure of central tendency (mean, median, mode), measure of variability (variance, standard deviation) and measure of asymetry (skewness and modality).
This is a more intermediate level post where will see some statistical concepts which are less known but still important in your initial exploratory data analysis. Therefore, this will cover the following:
- Pearson Correlation
- Spearman Correlation
- Correlation matrix
- Spurious correlations
The simplest way to understand these concepts is with a practical example. Let’s take a look at the EPL 2014–2015 Player Heights and Weights dataset, which shows information about English Premier League player’s height, weight and age as well as the name, number, position and team.
Starting simple, what do you think is the relationship between
Weight for football players? You probably assumed that the higher the player is the more heavier he will be. So, we can see here a positive relationship between height and weight.
And it is clear that higher players present a higher weight, with some exceptions.
Covariance is a measure that indicates how two variables are related. A positive covariance means the variables are positively related, while a negative covariance means the variables are inversely related. The formula for calculating covariance of sample data is shown below.
x = the independent variable
y = the dependent variable
n = number of data points in the sample
x̅ = the mean of the independent variable x
y-bar= the mean of the dependent variable y
Using pandas covariance formula we can calculate the value for variable
covariance = data['Height'].cov(data['Weight'])
print('Covariance with Pandas = %0.2f' % covariance)
So, we have a covariance of 34,43. Since the covariance is positive, the variables are positively related — they move together in the same direction.
Nevertheless, with covariance we have an issue… with the units. If you calculated the covariance by end you might have noticed this issue but with pandas is not possible. Take a look again at the formula. We’ve chosen two variable
Height measured in centimeters (cm) and
Weight measured in kilograms (Kg). Notice the denominator, where for each data point you subtract the mean of the respective variable and later multiple both values. In the end, our value for the covariance will be 34,43 cm.kg. This is not very informative! First of all, it seems that our covariance depends on the magnitude of our variables. If we’ve used the american metric system for
Weight, the covariance would probably return a different value and deceive us on the covariance’s strength.
So, our metric is is showing us to what extent these variables are changing together, which is good, but it is dependent on the magnitude of the variables themselves which generally does not give us what we want. A better question instead of “How do our variables relate?” is “How strong is the relationship between our variables?”. For that, Correlation is the best answer.
As seen before, covariance measures how variables, with different units of measure, relate. With this measure, we can determine whether units were increasing or decreasing, but it was impossible to measure the degree to which the variables moved together because covariance does not use one standard unit of measurement.
Correlation standardizes the measure of interdependence between two variables and, consequently, tells you how closely the two variables move. The correlation measurement, called a correlation coefficient, will always take on a value between 1 and -1:
- If the correlation coefficient is one, the variables have a perfect positive correlation. This means that if one variable moves a given amount, the second moves proportionally in the same direction.
- If correlation coefficient is zero, no relationship exists between the variables. If one variable moves, you can make no predictions about the movement of the other variable; they are uncorrelated.
- If correlation coefficient is –1, the variables are perfectly negatively correlated (or inversely correlated) and move in opposition to each other. If one variable increases, the other variable decreases proportionally.
To calculate the correlation coefficient for two variables, you would use the correlation formula, shown below.
r(x,y) = correlation of the variables x and y
COV(x, y) = covariance of the variables x and y
sx = standard deviation of the variable x
sy = standard deviation of the variable y
With correlation we do not have the same issue as with covariance since our numerator will be cm.kg and our denominator cm.kg, thus being unitless.
Therefore, the correlation will be the same regardeless of the unit system you are working with.
# calculate the correlation in the metric dataset, in Pandas
correlation = data['Height'].corr(data['Weight'])
# print results
print('Correlation in metric system: %0.2f' % correlation)
The best way to think about Spearman’s correlation, also known as Rank correlation, is to forget the variables values and consider only their relative position (rank).
This is the original data distribution:
By ranking the data, the lowest player will have his
Heightranked as 1 and the lighest player will have his
Weight ranked as 1, as well. When there is a tie, like there is two 221 none assume the value and both become 221.5 .
data[["Height", "Weight"]].rank().plot.scatter(x ="Weight", y="Height")
From this plot we can see a positive correlation between both variables.
It is the most commonly used and due to being widely used, people usely call it only correlation. It is simply the correlation between the rank coefficients, or in simpler terms the correlation between the “positions on the podium” for
Height, and “positions on the podium” for
We can obtain the same correlation value with Pandas by doing the follow.
ranked_data = data[[“Height”, “Weight”]].rank()
Or if you’re just really lazy.
What if, instead of looking at two variables, we take a look a the correlation between all the variables?
We can also see this correlation matrix in a more visual pleasant way.
One thing you should always have in mind, during your analysis, is that correlation does not imply causation. Take a look at the following example from tylervigen.com
I doubt that the higher or lower consumption of margarine is one of the main causes of divorce rates in Maine. Nevertheless, these two variables present a correlation of 99%. As Data Scientist, you have the tasks of always questioning the data, your results and to be skeptical. Find a lot of proof that support your statement before making it. Imagine if you would come to your boss and say “The more margarine you eat the more probable you’ll end up getting a divorce”.
Continue learning with the next post:
Advanced Level — The Fundamentals of Inferential Statistics with Probability Distributionsmedium.com
If you liked it, follow me for more publications and don’t forget, please, give it an applause!