Basic Statistics for Data Analysis With Python

Ritchie Pulikottil
The Startup
Published in
5 min readOct 16, 2020

When dealing with data, we often need to be familiar with statistics. Here, we will go through some of the absolute basics of statistics, you are indeed supposed to know!

Picture Courtesy: clipart-library.com

Before we begin, let us consider a set of age groups sorted in ascending order, just to make things a little easier to understand, 10, 20, 30, 40, 50, 60, 70, 80, 90.

Picture Courtesy: clipart-library.com

Average

In statistics, we have the mean and the median, and both are referred to as averages. However, apart from statistics, the mean is commonly referred to as the average.

Picture Courtesy: clipart-library.com

The mean is the most commonly known average, and if you take our previously defined set of age groups, add them all together, and divide by the total number of values, the result obtained is nothing but the mean of the age groups! (10 +20+30+40+50+60+70+80+90)/ 10 = 45.

Similarly, the median refers to the middle value of a set of data. In our example, since there are 10 values, the middle value is the 5th element, which is 50. If there is an even number of data points, you could take the mean of the two values in the middle, to find the median.

Percentiles

Picture Courtesy: clipart-library.com

You can also visualize the median as the 50th percentile or 50% of the given set of data. This means that 50% of the data is less than the median and 50% of the data is greater than the median. This tells us where the middle of the data is, but to gain a deeper understanding of the data distribution, we often tend to look at the 25th percentile and the 75th percentile of the given set of data.

The 25th percentile is basically 25% of the given set of data, which is one-quarter of the total set of data. Similarly, the 75th percentile is 75% of the given set of data, which is three-quarters of the total set of data.

If we look at our ages again:10, 20, 30, 40, 50, 60, 70, 80, 90,
Here, we have 9 values, hence 25% of the data would be approximately 2 data points, so the 3rd data point would be greater than 25% of the total data, which gives us the 25th percentile as 30 (the 3rd data point).
Similarly, 75% of the data is approximately 6 data points, hence the 7th data point is greater than 75% of the data, so, the 75th percentile would be 70 (the 7th data point).

We can clearly see that our data ranges between 10 and 90. The 25th and 75th percentile tell us that, nearly half of our age groups lie between 30 and 70, which helps us gain a better understanding of how the data is distributed.

Standard Deviation & Variance

Picture Courtesy: clipart-library.com

You need Standard Deviation and Variance if you wish to dig a little deeper into understanding the distribution of your data, it basically helps us to understand how our data is spread.

Standard Deviation is basically the square root of the Variance, so first, let us go back to our age group example: 10, 20, 30, 40, 50, 60, 70, 80, 90. Now, let us find the mean of the data, which is 45. Then, we are supposed to calculate how far each value in our dataset is, from the mean. For instance, our second element (20) is 25 away from the mean (45–20=25). Similarly, find the distance for each and every element in the dataset.

Here’s a list of all these distances:35, 25, 15, 5, 5, 15, 25, 35, 45 and then we square these values and add them together, which gives us 1225+625+225+25+25+225+625+1225+2025=6225. We now divide this value by the total number of elements in our dataset and that gives us the Variance, 6225/9 =666.66. To get the standard deviation, we just take the square root of this number and get 25.81.

Since the mean is 45and the standard deviation is 26.29, we can say that most of the population is between 19.19(45-25.81) and 70.81(45+25.81), and that is how we use standard deviation to get a better insight into how the data is distributed in our dataset.

Statistics with Python

Picture Courtesy: Lynda.com

I don't believe the fact that you need to be a statistics whiz kid to become a Data Scientist. I don't deny it as well… If you are a genius in statistics it’s always better, but it isn't mandatory! If you aren't familiar with statistics, it’s fine! You are still eligible to become a Data Scientist! Just go through some of the basic statistics, although this article doesn't cover the entire basics, you are still allowed to start with this article and move on with better resources. Once you are confident with the base, you can always use Python or other programming languages like R to do your work. Yes! we can calculate all of the operations we have discussed so far, with Python. We will use the Python package NumPy. There are way more useful things you can do with NumPy, but for now, we will just use a few functions for statistical calculations: mean, median, percentile, std, var.

Try it out:

import numpy as np

data = [11, 22, 33, 44, 55, 66, 77, 88, 99]

print(“mean:”, np.mean(data))
print(“median:”, np.median(data))
print(“50th percentile (median):”, np.percentile(data, 50))
print(“25th percentile:”, np.percentile(data, 25))
print(“75th percentile:”, np.percentile(data, 75))
print(“standard deviation:”, np.std(data))
print(“variance:”, np.var(data))

And with that, we have covered the absolute basics of statistics with python, It’s definitely not enough, so make sure to explore more. We might have further articles on this topic until then take care :)

GitHub

LinkedIn

Twitter

Instagram

--

--