A (Very) Quick Primer on Descriptive Statistics
When computing descriptive statistics and summaries of our dataset, we will often publish the following:
- Mean: A calculated central value in the dataset
- Median: A value lying at the midpoint of the dataset
- Mode: The most frequently occurring value in the dataset
- Range: Difference between largest & smallest values in the dataset
This is by no means a complete list. See here for more definitive descriptions.
However, it is worth noting that we can also make used of the trimmed mean to understand our dataset.
What is a Trimmed Mean?
A trimmed mean removes a proportion of the largest and smallest observations and then takes the average of the numbers that remain in the dataset (Wilcox, 2005).
For example, imagine a dataset as follows:
Using R, we can compute the untrimmed mean:
dataset <- c(1,2,5.2,6,7.1,7.5,7.8,8.2,8.4,15)
A trimmed mean which removes 20% of a dataset would remove the first (lower) and last (upper) value out of our ordered dataset of n=10.
This can be confirmed by manually removing the 20% from our dataset and calculating the non-trimmed mean:
trimmed <- c(2,5.2,6,7.1,7.5,7.8,8.2,8.4)
Why use a trimmed mean?
Working from our earlier definition of the median as the value lying at the midpoint of a dataset, we can consider that the median is effectively a trimmed mean (Lane).
For example, consider a trimmed mean which removes 50% of the lower and upper values in a dataset:
We can compare this to the output of the median:
However, why do we use a trimmed mean when we could use a median?
First, a quick recap on sampling and standard error.
If we are taking a sample from a population, our goal is to have as low a standard error (SE) as feasibly possible. If our data is normally distributed, then the mean will have a low SE.
As the data begins to move away from a normal distribution, the mean is no longer optimal. In comparison, the median at this point will have a lower SE, but under a normal distribution, the SE of the median will be higher than that of the mean.
This can be confirmed by creating a normal distribution in R and computing the SE of both the median and mean:
library(WRS2) ## Load in library for median SE
std <- function(norm) sd(norm)/sqrt(length(norm)) ## function to calculate mean SE
norm <- rnorm(50) ## Create dataset
hist(norm) ## View histogram
summary(norm) ## View summary statistics
std(norm) ## SE of the mean
msmedse(norm) ## SE of the median
trimse(norm,tr=0.10) ## SE of trimmed mean at 10%
The trimmed mean acts as a compromise, allowing us to establish a relatively low SE for both normal and non-normal distributions of data (Wilcox, 2005).
Another point of comparison for the trimmed mean against the median is that, with the median we are heavily protected against outliers, as we take only a single value in a middle. In effect, we are saying that everything, but the midpoint is contaminated. While this offers greater protection against outliers, it is at the expense of statistical power. A trimmed mean effectively allows us to throw away less of the data. While this is quite a broad definition, it is one that is useful to consider.
Where trimmed means are used
Trimmed means are used in multitude of different disciplines. In sports scoring, such as the Olympics, trimmed means are used to help reduce the effects of outlier bias in a sample (Lane). Trimmed means might also be used in consumer price indexes, in order to reduce volatility.
A note on standard error calculations:
It is worth noting that there are situations where the SE of the median can be lower than the mean. Wilcox discusses situations of this in Understanding and Applying Basic Statistical Methods Using R.
Furthermore, where there are tied values in a dataset, there is not currently a known way of estimating the SE of the median in an accurate way.