Beyond AVERAGE(): What you need to consider before reporting an average calculation in your next data analysis

Stacy Giroux
Learning Data
6 min readMar 7, 2024

--

Average is a common go-to calculation for data analysts. It’s so powerful because it allows us to summarize large data sets into a single number making it easier to communicate and be understood by our audience.

Because we are so used to seeing data summarized into an average, it’s not surprising that we can fall into the trap of thinking that it’s always appropriate to use. However, experience has taught me that defaulting to using an average calculation without first evaluating if it’s appropriate can [inadvertently] result in misleading summarizations.

So, the next time you’re thinking of calculating the average of a dataset, pause first and consider the following:

1. What data type am I working with?

It can be tricky to wrap our mind around, but sometimes a value we see in a data set as a number is actually a categorical label. This commonly occurs in survey analysis where responses are given on a Likert / numerical scale. Think of a past survey you may have filled out where you were asked to rate something on a scale of 1 to 5... or maybe 1 to 10.

While these values do exist in an ordinal or logical order (for example, 1 being the lowest and 5 being the highest rating), individuals taking the survey are not usually given an option to choose in-between any two of the values. So, when it comes to analyzing this data, it can be inadvertently misleading to report an average because it can imply a choice that didn’t really exist.

And for those surveys that also include a text descriptor with the number scale, reporting an average has the potential to become even more confusing. What is the correct interpretation of a calculated average rating of 4.3 in the example below? What text descriptor would you use to describe 30% between Agree and Strongly Agree?

Instead, my preference is to report which value had the most responses (mode) or breakdown the percentage of responses for each value. I find that these methods maintain the integrity of the categorical, ordinal nature of the data while still allowing me to summarize in a concise manner.

Pro Tip: Another data type you want to pause before averaging is when the values themselves are an average or a percentage. Average of averages or percentages may not always be appropriate. Or you may need to go deeper and consider applying a weighted average calculation to be accurate. Alice Zhao expertly lays out this scenario and the appropriate weighted average calculation in her video The ONE Mistake MOST Submissions Made in Our Latest Challenge (and How to Fix It)

2. Is this data normally distributed?

When I’m thinking of calculating an average, I envision I’m taking the collective lump sum from a group and dividing it equally amongst them so that every single person gets the exact same amount (central tendency or level setting for the data set). In my imagined scenario, some people end up with a little more than what they started with… some a little less but overall, I expect the number of people who get either more or less after I’ve done the calculation would be about equal.

Now envision that same scenario but this time one person had started out with 100 times more than anyone else. In that case everyone other than that one person would end up with more than when they started while that one person would have significantly less. Is the average still an effective summarization of this data set?

This scenario outlines one of the reasons I’ve made it part of my practice to visualize the raw data set when deciding to calculate the average. The presence of outliers (like described in the scenario above), or data sets that are heavily condensed to one side (skewed) or contain multiple peaks (multi-modal) can dramatically impact whether an average is an effective summarization. Being able to detect these in your raw data set through visualization or statistical testing can help guard against inadvertently presenting an average that may not really be representative of the overall data.

So, what might you use instead of average if you’ve identifed your data has outliers? A common alternative you may already be familiar with is calculating the median. Because median gives you the midpoint of the data set it is not susceptible to being influenced by a few outliers making it more representative of the entire data set. This is why it’s often used when reporting things like national household income, for example. For skewed or multimodal data, it may be more effective to report the value or values that occurs most frequently in the data set (mode).

Pro Tip: When it comes to visualizing the distribution of raw data, Excel has a default graph type called a histogram that makes it very quick and easy. Check out this helpful Microsoft support reference: Create a histogram

3. How much variability is there in the dataset?

One last thing to consider when summarizing using average involves a little bit of statistics — something called the coefficient of variation or CV. CV provides context on how much variability there is around the calculated average. Like almost every aspect of data analysis and visualization, providing context here can also be incredibly useful to the audience!

Consider two sets of 100 data points, both with a calculated average of 55. One of the data sets has values ranging from 1 to 100, while the other has ranges from 53 to 56.

Now, imagine we need to make a critical decision that hinges on that calculated average. Which would you feel more confident about? As the decision maker, knowing the data points are much tighter around the average in the second data set may provide a boost in confidence in making that critical decision. Providing the audience with the CV is a way to express this information mathematically. See the pro tip below for how to calculate and interpret %CV.

Pro Tip: Excel provides a relatively easy way to calculate the CV in two simple steps:

1. Calculate the standard deviation for the data set using the STDEV() [or one of the other standard deviation] functions. For more information on the syntax and selecting the appropriate function check out this helpful Microsoft support reference: STDEV() function

2. Divide the calculated standard deviation by the calculated average. You can then multiply the resulting CV ratio by 100 to convert to %CV.

The higher the resulting %CV, the more variability there is in the dataset around the average. The closer the value is to zero, the tighter the values are to the calculate average.

Wrapping Up

Summarizing large data sets into concise insights is a key outcome that high-performing teams rely on data analysts to deliver. So, when it comes to selecting the appropriate calculations and data transformations, it’s critically important to be purposeful and considerate. Going beyond the default by taking a moment to pause and consider these checkpoints can help you ensure you’re selecting an appropriate measure and ultimately raise the quality of your next data analysis.

Interested in learning more from Stacy and the other experts at Maven Analytics?

Don’t miss her other articles here on Medium, and check out our course catalog!

Happy Learning!

-Team Maven

--

--