Statistical Deep Dive: The Arithmetic Mean

Diving deep into the mathematics behind the humble mean

Rob Taylor, PhD
10 min readSep 21, 2022
Photo by Richard Horvath on Unsplash

Introduction

The arithmetic mean is typically introduced in statistics courses — along with the mode and median — when discussing measures of central tendencies. Though, even before this stage, computing averages is a relatively straightforward, intuitive, concept to most: take all the observations you have, sum them, and then divide by the total number of observations. What could be simpler?

But what might not be obvious to all is that the average is actually a parameter in a statistical model; not a particularly complex model, granted, but there are some pretty interesting (and likely familiar) things going on that form the foundation for more complex analyses. Like all models, when we compute a mean we estimate an unknown parameter from data subject to some conditions; and, as we’ll soon see, the mean is actually the best parameter for a particular loss function (this is also true for the mode and median, but I won’t be discussing those today).

Before diving in, I do want to clarify some terminology and draw a distinction between outcomes and observations. Here, I’ll use the term outcome to refer to all possible outcomes of a random process, or experiment. This is more formally referred to as the sample space of an experiment and is denoted by S. For example, the sample space for a coin is either heads or tails, which can be denoted S = {H, T}. Similarly, the sample space for a die is S = {1, 2, 3, 4, 5, 6}.

Observations, on the other hand, refer to the outcomes of a single experiment and are unknown in advance. For example, if I threw a die 10 times I might observe the following: X = {1, 4, 2, 6, 5, 4, 5, 2, 3, 1}. On each throw, I cannot know what result I’m going to get, but I can record each observation, and then make some inferences about the statistical characteristics of the process under observation.

Computing the Mean

First, let’s just re-familiarise ourselves with how the arithmetic mean is computed. To keep things nice and simple, let’s assume we have a finite set of observations X = {x₁, x₂, x₃, …, xₙ} sampled from a population and we wish to calculate the sample mean of these values. Symbolically, the arithmetic mean is defined as:

The canonical formula for the arithmetic mean (image by author).

There is nothing too crazy going on here and the formula just restates the verbal description given in the introduction. But what we can also see is that each observation is weighted by a constant amount. This weight, which is the term in front of the summation sign, is the reciprocal of the sample size n.

If this fact isn’t entirely clear to you, let me explain. Suppose instead we have a set of observations where each value is multiplied by the same amount c. Let cX = {cx₁, cx₂, cx₃, …, cxₙ} be the new set of observations. If we were to add all these values together then we’d get the following sequence:

The sum of constantly weighted observations (image by author).

Note that the constant term does not depend on the subscript i thereby implying that c is the same (i.e., is constant) for each observed value. Given this fact, c can actually be pulled out of the summation like so:

Reexpression of the sum of constantly weighted observations (image by author).

This gives the exact same result which yields the following equality:

Constants can be pulled out of the sum (image by author).

So, when dealing with constants, we observe that the sum of the weighted values is the same as the weighted sum of values. If we now let c = 1 / n and substitute we get back our formula for the mean:

Reexpression of the arithmetic mean (image by author).

For example, let’s compute the mean for the set of die throws introduced earlier: X = {1, 4, 2, 6, 5, 4, 5, 2, 3, 1}. There are n = 10 observations in total so each result is weighted by 1 / 10 = .10. Substituting everything in, we get the following:

Computing the arithmetic mean using observed die outcomes (image by author).

Weighted Means

You’ll no doubt note that our observations are nothing more than a sequence of values drawn from the sample space. Given there is only a finite number of outcomes we’ll always wind up with repeated observations. Another way to think of the arithmetic mean, then, is as a weighted sum of outcomes, where the weights reflect the frequency with which particular observations occur.

We can generalise this idea of a weighted sum by allowing the weights to vary for each outcome. To demonstrate let's define a new set of data X′ = {w₁x₁, w₂x₂, w₃x₃, …, wₙxₙ} where wᵢ is the weight for the iᵗʰ outcome. Here, because the weights vary we must account for these when computing the mean. The resultant weighted mean can then be expressed like so:

The weighted mean (image by author).

Applying this to our die roll data, we arrive at the following:

Computing the weighted mean using observed die outcomes (image by author).

This just gives the same answer, so no surprises there. But, note that in the denominator we sum all of the weights together which, because the weights are frequencies, is just the total number of observations, n. However, because the weights are also included in the numerator this has the effect of normalising individual weights to a value between zero and one. Realising this we can re-express the weight for each outcome like so:

Re-expressing outcome weights as a proportion (image by author).

It might not be immediately obvious, but all this is equivalent to weighting each observation equally and then summing the weights associated with each distinct outcome. So, if an outcome occurs more often than another it will naturally receive a higher weight. To illustrate, if we let mᵢ denote the number of times an observation occurs in X (where mᵢ < n), and each observation is weighted equally such that c = 1 / n, then we can further re-express pᵢ like so:

A further reexpression of the outcome weights (image by author).

The weight for each outcome is simply the constant weight multiplied by the number of times each outcome occurs. If we then plug this back into the expression for the weighted mean then we arrive at the definition for the expected value:

The expected value (image by author).

Applying this to our die-throw data, we can see that the outcomes {1, 2, 4, 5} each occur twice, while outcomes {3, 6} only occur once. This results in the following set of weights P = {0.2, 0.2, 0.1, 0.2, 0.2, 0.1}. Substituting these values into the expected value we obtain the following:

Computing the expected value using observed die outcomes (image by author).

Now, typically the expected value, or expectation operator, is defined using a large (possibly infinite) population of outcomes given some probability model. In this case, the expectation is the weighted mean of all possible outcomes weighted by the probabilities associated with each. Definitionally, it’s what we expect to see, on average, based on the known characteristics of the process. Now, for random processes like coin flips and die throws the long-run probabilities for each outcome are identical and symmetrical, but this need not be the case. In fact, any probability distribution can be used to calculate expectations.

However, when dealing with smaller samples from a wider population, sampling variability ensures the weight of each outcome will not always match those long-run probabilities. Specifically, with a finite set of observations, the expected value is the weighted average of all possible outcomes in X, where the frequency distribution for each unique value in X provides the weighting values.

The Middle

Whether you compute the mean using the complete sequence of observations or use the observed frequency distribution for each possible outcome, you’ll get the same result. In either case, the effect is to reduce an n-dimensional set of observations down to a single, representative, value that coincides with some middling, or centre, value.

But here’s the catch — what the average considers the middle to be may not agree with what you intuit the middle to be. To most, the average indicates a typical value — something that is representative of all. And when outcomes are nice and symmetric — like random processes are — this view of the mean is pretty accurate. Though as I’m sure many of you are aware, the mean can be adversely affected by the presence of extreme observations, and this can have significant implications on the perceived representativeness of the statistic.

Ultimately, the goodness, or representativeness, of the mean as a summary statistic is a function of the data itself. For example, consider the difference between average and median house prices. Because the pricing distribution is right-skewed, houses that sell for larger values draw the average price upwards. It’s for this reason that the median is typically used because it is less sensitive to extreme values.

But why does the mean behave this way? Well, to understand the effect extreme observations have, we first need to understand how the mean works.

Balancing Act

Consider computing the deviation, or residuals, between the mean and each observation. Some observations will be larger than the mean, and others will fall below. But if we were to add together all the residuals, it turns out that the sum is zero:

Summed deviations between the mean and data (image by author).

So, why is this the case? Well, this occurs because the residuals are perfectly balanced on either side of the mean. For the mean, the “middle” is the point where the residuals to the left of the mean are cancelled out by the residuals to the right. One neat way to interpret this fact is the location of the mean denotes the centre of mass.

To help build an intuition, imagine you have a plank of wood balancing on a fulcrum with weights positioned on either end. The centre of mass is the point along the plank the fulcrum would need to be positioned such that the weight on each end cancel out and the plank remains horizontal. Now, if the weight on each end is equal, then the best position for the fulcrum is smack bang in the middle of the plank. But, if the weight on either end is different — with one much heavier than the other — then the fulcrum will need to shift away from the midpoint and toward the heavier end so that things balance out. That’s effectively what the arithmetic mean does.

This balancing act is a property of ordinary least squares (OLS) which is an estimation procedure that minimises the squared deviations between an unknown parameter and data. Many of you will no doubt be familiar with this method; particularly its application in linear regression which models the conditional mean. But the same principle applies when estimating the unconditional mean, too, and it turns out that the arithmetic mean also satisfies the least squares criterion.

Formally, we can cast the computation of the mean within a model-based framework, and as I mentioned earlier, the model is very straightforward and can be written like so:

A model for estimating the mean (image by author).

This simply states that each value in X can be modelled using a single parameter θ, plus an error term, which can be interpreted as an intercept-only model. We next need to find the value of θ with respect to a defined loss function. If we were to use the least squares criterion as our loss function (sometimes called L2 loss) then we’d find that the mean is the only estimator that minimises this function:

The mean is the only value that minimizes squared error loss (image by author).

I won’t provide a proof of this result here, but you can check for yourself that this is true by using the die-throw data and computing the sum of squares under different parameterisations. You should see that the mean produces the smallest sum.

Understanding this also helps intuit why the mean is so sensitive to extreme values. Squared loss means that large deviations incur large penalties, so the parameter must shift in a way that minimises overly large deviations. Consequently, this results in the parameter shifting toward those more extreme values. This is just natural behaviour for the mean.

Comparison to the Median

I just briefly touch on this, but fundamentally the reason the median is less sensitive to extreme values is that it satisfies a different loss function. While it is true that the mean and median coincide in cases where data are symmetrical and balanced, this is not a general result. Instead, if we use absolute error in place of squared error, it turns out that the median is the only value that minimises the sum of absolute errors:

The median is the only value that minimizes absolute error loss (image by author).

In this case, larger deviations are not being amplified by taking squares which elicits more stable behaviour from the parameter and that is why the median is preferred when data contain extreme values.

Thanks for reading!

If you enjoyed this post and would like to stay up to date then please consider following me on Medium. This will ensure you don’t miss out on any new content.

To get unlimited access to all content consider signing up for a Medium subscription.

You can also follow me on Twitter, LinkedIn, or check out my GitHub if that’s more your thing 😉

--

--

Rob Taylor, PhD

Former academic, now data professional and writer. Writing about statistics, data science, and analytics. Machine learning enthusiast and avid football fan.