Calculating median with MongoDB’s aggregation pipeline
Finding the minimum, maximum or average values using a MongoDB aggregation is fairly easy, since it has built-in pipeline operators like $min
, $max
, and $avg
respectively. In many cases, however, these values do not describe the given set of data accurately. The median can provide a better representation, especially when there are extremely small or large numbers in the list, that distort the mean.
Median
In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample. The basic feature of the median in describing data compared to the mean (often simply described as the “average”) is that it is not skewed by a small proportion of extremely large or small values, and therefore provides a better representation of a “typical” value.
By definition, median is the middle value in a set of ordered numbers, which means half of the numbers are below and half are above this number. If the data set has odd elements the median will be the middle one. For example, here are the first 9 elements of the Fibonacci Sequence:
1,1,2,3,5,8,13,21,34
The median here is 5, which is the fifth value.
In the case of even numbers, the median is the mean of two middle numbers. Let’s see the first 10 Fibonacci numbers.
1,1,2,3,5,8,13,21,34,55
The median here would be 6.5, that is (5+8)/2.
MongoDB aggregation pipeline
MongoDB provides the aggregation pipeline tool, which be used to do complicated operations on multiple documents. The following aggregation pipeline example demonstrates a solution for finding the median of a data set.
Example
Consider a collection of simple workflows,
which stores the types and durations in milliseconds.
{"_id": 1, "type": "abc", "duration": 78484}
{"_id": 2, "type": "xyz", "duration": 48515}
{"_id": 3, "type": "abc", "duration": 46459}
{"_id": 4, "type": "xyz", "duration": 3609457}
{"_id": 5, "type": "xyz", "duration": 67750}
{"_id": 6, "type": "abc", "duration": 400851}
{"_id": 7, "type": "xyz", "duration": 217639}
Optionally, we can add filters as the first stage with $match
operator, but let’s jump to the next part. We need to sort all the records in numerical order with $sort
operator:
{ $sort: { duration: 1 } }
Next, it is recommended to group the workflows by the type field with $group
, if we want to distinguish them. In this stage, we also push the duration times into an array and save the count of element numbers by using $sum
.
{
$group: {
_id: {type: '$type'},
duration: { $push: '$duration'},
count: { $sum: 1 }
}
}
Finally, in the last stage the median should be calculated. We use projection in this step with the$project
operator. It has to be decided whether the count of the data set is even or odd. It is done by the$mod
operator, which performs a modulo operation and checks the reminder. Depending on the result the previously mentioned different calculation methods have to be applied, which is covered with a conditional expression.
{
$project: {
_id: 0,
type: '$_id.type',
median: {
$cond: {
if: { $eq: [ { $mod: [ '$count' , 2 ] }, 0 ] },
then: {
$avg: [
{
$arrayElemAt: [
'$duration',
{ $add: [ { $divide: ['$count',2] }, -1 ] }
]
},
{
$arrayElemAt: [
'$duration',
{ $divide: [ '$count', 2 ] }
]
}
]
},
else: {
$arrayElemAt: [
'$duration', { $floor: { $divide: [ '$count', 2 ] } }
]
}
}
}
}
}
The aggregation returns the following result:
{ type: 'xyz', median: 142694.5 }
{ type: 'abc', median: 78484 }
Conclusion
In this article we have introduced a solution for finding the median value of a data set using the MongoDB aggregation pipeline. This is a three step calculation using the operators $sort
, $group
and $project
.