Stats for Marketers: Mean, Median and Mode
Learn the fundamentals of descriptive statistics before moving on to more advanced material.
T his is the first article in a series on probability and statistics aimed at marketing and general business users. We’ll be covering off essential concepts at around the level of introductory and intermediate college courses.
Where possible we’ll look at the intuition behind the concept, the math and then apply it practically using statistical and math packages like Python’s Pandas, Numpy and Scipy.
Data and statistical literacy is essential in all business settings but one of the things that distinguishes professional marketers from advanced Instagram users is confidence in understanding and applying statistical methods to uncover insights and drive strategy. And you know, to see whether a campaign actually worked.
To be specific, ‘worked’ means that with a standard 95% confidence interval, there’s only a 5% chance that the data incorrectly indicates that the campaign worked. But we’ll come to all that later.
Let’s start with the basics…
I have a friend Eunice. Eunice has just exported her transaction report from her CRM for the month and now has a list of 300 sales from different clients. The report has a column of client names and a column of transaction values.
Reading all 300 transactions records is tedious and a person is unlikely to remember all of them so her manager has asked for a summary of the report.
Since just handing all 300 records over won’t cut it, one way to summarise would be to add up all the transaction values into a total of gross sales. This total is a simple way to describe the 300 data points by literally sum-marising the information into a single number.
Condensing information into more economical packages is essential for us to be able to grasp the data, especially as the dataset gets larger. If Eunice’s company grew very quickly and they were in the fortunate position of having 3,000,000 transactions per month it would be impossible to get a sense of the data by visually inspecting the raw output. Even opening this dataset in trusty Excel is going to be a struggle. Describing this data with a few standard numbers is going to be key.
Besides totals, there are other more sophisticated ways of describing the data points such as the average transaction size, the most common transaction value or how different the order size is between clients. Together these values help describe the data and are called summary statistics.
The mean, specifically the arithmetic mean, is the average value of a series of numbers. It’s calculated by taking the sum of all the values and dividing it by the count of all the values. The mean is often written as x̄ and pronounced as ‘x bar’.
The mean is a measure of the ‘centre’ or ‘location’ of the data. This is easier to see when illustrated.
The larger the magnitude of the value, the larger contribution it has towards the mean. If the data is more or less evenly distributed, the mean is a good measure to work with. But, there’s a subtle trap if the data isn’t.
Going back to Eunice’s 300 transaction records, if we suppose that there are three product tiers in her company: the individual user package, the business package and enterprise and they respectively cost $100, $1,500 and $5,000 per month.
Let’s further suppose that there are 214 individual users, and 86 enterprise customers. Note that there weren’t any business packages sold at all.
If we take the mean of this data, we get a value of $1,505 for the transactions for the month. Without knowing the structure of the underlying data, we might conclude that the average customer at Eunice’s company is likely a business user. However, the mean doesn’t tell us anything about how the data varies and we can easily draw incorrect conclusions like ‘This product is popular with business users’ and ‘We should allocate more of our media budget towards business users, since it will be more successful’ when clearly this isn’t the case at all.
Another way to think about the central point of a data series is the value which is located physically in the middle of the series. If we take the data and order it from the smallest value to the largest value, the median is the number halfway along the series.
For an odd number of values, say the numbers from 11 to 15, there are 5 numbers (n=5) we put them in sequence and then just need to find the (n+1)/2 th number, or in this case, the third number which happens to be 13.
For an even number of values there’s a little more to it. Let’s say we now look at the numbers from 11 to 16. Since there is an even number, if we apply the same formula with n=6, we now need to locate the 3.5th number.
Obviously there isn’t a number conveniently located at the 3.5th position but we can get an answer by interpolating between the values of third and fourth position.
The third value is 13, the fourth value is 14, so halfway between those two is 13.5. Notice that this is also technically the mean of these two values since 13+14=27 and 27/2 = 13.5.
Using this approach with Eunice’s transaction data, we find that the median is located at the 301/2 = 150.5 position which is going to be in the middle of the 150th and 151st position. Both of those values are $100 and so the median is $100.
Unlike the mean, the median value is less sensitive to very large or very small values changing the result dramatically. These very large or very small values are commonly called outliers and we’ll return to these in the next article. We’ll also spend some time on what differences between the mean and median value can tell us about the shape of the underlying data.
The mode is another way of describing the data although it’s a bit less useful as we go deeper into statistics than the concepts of mean and median.
The mode is the most commonly occurring — or highest frequency — value in a series of data. If we were to plot the data from Eunice’s 300 transactions, we would have:
Given what we already know about how lopsided the transaction data is, it perhaps isn’t surprising that the Individual package at $100 is the mode of this data.
To find the mode of more complex data, counting the occurrences of individual values and selecting the one with the highest count is the general approach. We will see a lot more of this technique of frequency counts and frequency distributions as we get further into the material.
Applying this in Python
Before analysing the data in Python, we’ll need to read it into a Pandas data frame from the file. It’s a common convention to alias Pandas as ‘pd’ and to refer to a data frame as ‘df’.
import pandas as pd# Read the data from the csv text file
df = pd.read_csv('transactions.csv')
If this is the first time we’re seeing this data, we can run the head() method to get the first five values of the data and the column names.
0 145308 100
1 129658 100
2 135004 100
3 130357 100
4 170503 100
Now that we have a better idea of what our data looks like we can begin to do some basic analysis. Since we’ll just be using the Value column in this example, we’ll pull the series out as a separate variable.
sales = df['Value']
The mean of the transactions using Pandas:
The median of the transactions using Pandas:
The mode of the transactions using Pandas:
That’s it for this article on the fundamental descriptive statistics of mean, median and mode. We’ve looked at the formulas, some of the intuition behind the math and applied what we’ve learned in Python.
We’ve just scratched the surface of statistics and how much we can accomplish with Python!
Next time, we’ll continue on further with descriptive statistics by looking at the concepts of quartiles, box plots, interquartile range and outliers.