Measures of Center with Python
When analyzing both discrete and continuous quantitative data, we generally discuss 4 main aspects: Center, Spread, Shape and Outliers.
The focus of this article is —
- Understand the measures of center, and
- Use some basic Python commands to compute these measures of center
There are 3 widely accepted measures of center, also known as measures of central tendency:
- Mean
- Median
- Mode
The goal of each is to get an idea of a “typical” value in the data set.
Consider this table of the number of cars we see passing by from a coffee shop in a week.
In Python, we can save this data in the form of a list:
cars = [5,10,8,7,12,34,28]
If we were asked — ‘How many cars would you expect to see on any given day?’, we could answer the question in a few different ways. We could say that it depends on the day or depends on the week, etc. However, to better answer such questions, we commonly use the concept of average or mean.
Mean
The mean is calculated by adding all of the values in the data set and dividing by the number of data points. The mean of our data set can be calculated as the sum of the number of cars observed each day, divided by the number of days.
Here is how we can accomplish the same using Python:
# Importing libraries
import numpy# Storing the data in a list
cars = [5,10,8,7,12,34,28]# Calculating the mean and saving the result in a variable
mean = numpy.mean(cars)# Printing the mean
print(“The mean of this data set is = “, mean)
If we take a closer look at our data set, there are only two of the seven days that have recorded more cars than the calculated mean. In this particular case, the mean doesn’t seem like it is in the middle of the data.
The calculated mean, in this case, is also a real number with decimal values, which isn’t a valid way to measure the number of cars. It is important to note here that the mean isn’t always the best measure of the center. A more appropriate measure, in this case, might be the median.
Median
The median is a value that divides our data set such that 50% of the values are larger and the remaining 50% are smaller.
For our data set, we have a median of 10, which is a much better response than ‘14.8 cars’, as calculated by the mean.
The actual calculation of the median depends on whether we’re working on a data set with an even number of values or an odd number of values.
To calculate median:
Step 1: Order the values of the data set from smallest to largest.
Step 2: Determine if the data set has an even or an odd number of values.
Step 3: If we have an odd number of observations, the median is simply the number in the direct middle. In this particular case, as the number of values is odd, the median is 10, i.e., the 4th value when the numbers are ordered from smallest to largest.
If we have an even number of observations, the median is the average of the two values in the middle. For example, if we have 8 observations, we average the fourth and fifth values together when our numbers are ordered from smallest to largest.
For example, if we had not collected any data on Sunday, our data set would only have 6 values, which is an even number of values, and look something like this:
On ordering the values from smallest to largest, we notice that the two values in the middle are 8 and 10.
So, the median can be calculated as the average of 8 and 10, which gives us 9.
Here is how the Median can be calculated using Python:
# Importing libraries
import numpy# Storing the data in a list
cars = [5,10,8,7,12,34,28]# Calculating the median and saving the result in a variable
median = numpy.median(cars)# Printing the mean
print(“The median of this data set is = “, median)
Mode
The mode is the most frequently observed value in the data set.
For example, in the data set below, we can see that 10 is the most frequently observed value as it occurs 3 times while the other values only occur once. Hence, the Mode for this data set is 10.
A data set could have multiple modes or even no mode at all.
If all the values in our data set are observed with the same frequency, there is no mode.
For example, in our data set, a different number of cars were observed on each day of the week. The frequency of each value in the data set is 1. As there are no specific values that occur more frequently than the others, there is no mode in this data set.
If two or more values in a data set share the same maximum frequency, then we could have multiple modes for the same data set.
In the example below, we can see that there are two values (10 and 8) with a frequency of 3. Hence, this data set has two modes (10 and 8).
Here is how the Mode can be calculated using Python:
# Importing libraries
from scipy import stats# Storing the data in a list
cars = [10,1,10,10,4,8,8,9,8]# Calculating the mode and saving the result in a variable
mode = stats.mode(cars_2)[0][0]# Printing the mean
print(“The mode of this data set is = “, mode)
Summary
- Measures of central tendency are essential because they help give us an idea of what the “most” common, normal, or representative answers/values might be.
- The Mean is calculated by taking all of the values in a set and dividing them by the total number of values in that set.
- The Median is the number that is in the middle of a data set that is organized from lowest to highest or from highest to lowest.
- The Mode is the number that repeats most often in a data set. It is not very frequently used in statistics as a reliable measure of center.