Fill missing values in MongoDB aggregation framework

I recently worked in a project where I had to develop different charts with aggregated data. Standard charts as Average values per hour, Average values per date and Average values per month. I learned to work with the MongoDB aggregation framework and I had to deal with missing values.
It was a very hard task to deal with missing values in application code, until I found this question in stackoverflow and I learned how to fill missing values directly using the aggregation framework. In this article I will go step by step explaining the answer and how I applied it to fill missing values after aggregate values per hour.
Attention: This article assumes basic familiarity with the MongoDB aggregation framework.
A former definition: the MongoDB aggregation framework is basically a set of transformations applied to a collection. Transformations are expressed in JSON. These transformations happen in phases, or better-known as “stages” in MongoDB.
The problem
Let’s say we have a collection with sensor data like the following :
[
{ timestamp: 1566016729, value: 20 },
{ timestamp: 1566018729, value: 21 },
{ timestamp: 1566019729, value: 24 },
...
]And we want: “Between two dates, get the average of the field value per hour”.
Using the MongoDB aggregation framework, and two dummy timestamps, of start date and end date respectively, we would do something like this:
db.getCollection('Sensor').aggregate([
{
$match : {
$and :[ { "timestamp" : { $gte : 1566016729 } },
{ "timestamp" : { $lte : 1569962886 } } ]
}
},
{
$addFields : {
"date" : { $toDate : { $multiply : ["$timestamp",1000] } }
}
},
{
$group : {
_id : { hour : { $hour : "$date" } },
"average" : { $avg : "$value" }
}
},
{
$project : {
_id : 0 ,
hour : '$_id.hour',
average : 1
}
},
])Up to here, we would have the result as desired. But there is limitation. Let’s suppose that between the dates, only were values in the first hour of the day. This is the output we would expect:
[
{ "average" : 33.05 ,"hour" : 1 },
{ "average" : null ,"hour" : 2 },
{ "average" : null ,"hour" : 3 },
...
{ "average" : null ,"hour" : 24 }
]But this is what we actually get:
[ { "average" : 33.05 ,"hour" : 1} ]We could fill missing values in application code, but is something very hard to debug and maintain, especially when we have to deal with dates and timezones. So, why not do it directly in the MongoDB aggregation.
The solution
We have to add extra stages in the aggregation. Alone, these stages could not make that much sense, but together and sequentially applied gave us the result desired.
I will explain the extra stages one by one:
i. This stage, groups all values in a unique object. Whose _id is equal to null and the field points is an array where all the objects of the previous stage get appended .
{
$group: {
_id: null,
points: { $push: "$$ROOT" }
}
}ii. This stage starts from creating an array from 1 to 24. I will iterate over each value using the variable hour (also referred as $$hour). For each value from 1 to 24, gets the index of that value in the original array using the command$indexOfArray . If such value exists in our original array the result will the be actual index, otherwise it will be -1 . This index will be stored in the variable hourIndex (also referred as $$hourIndex).
Later, we check if this index was different from -1 which indicates there was a value in our original array. If so, let the original object pass (with the command $arrayElementAt and the index we found prior, $$hourIndex). If equal to -1 means, there was no such value and we have to create a new object. This takes place in the else branch of the$cond command.
{
$project :{
points : {
$map :{
input : { $range : [ 1, 25 ] },
as : "hour",
in : {
$let: {
vars: {
hourIndex: {
"$indexOfArray":[ "$points.hour","$$hour" ]
}
},
in: {
$cond: {
if: { $ne:[ "$$hourIndex", -1 ] },
then: {
$arrayElemAt: [ "$points"," $$hourIndex" ]
},
else: { hour: "$$hour", average: null }
}
}
}
}
}
}iii. Here we expand all the objects that were shrunk into a single field when we created the field points in the stage i .
{
$unwind : "$points"
}iv. Up to here we carried the id: _null and our result is nested. This stage is the last step to put our values as the original input.
{
$replaceRoot: {
newRoot: "$points"
}
}Our final query would be:
db.getCollection('Sensor').aggregate([
{
$match : {
$and :[ { "timestamp" : { $gte : 1566016729 } },
{ "timestamp" : { $lte : 1569962886 } } ]
}
},
{
$addFields : {
"date" : { $toDate : { $multiply : ["$timestamp",1000] } }
}
},
{
$group : {
_id : { hour : { $hour : "$date" } },
"average" : { $avg : "$value" }
}
},
{
$project : {
_id : 0 ,
hour : '$_id.hour',
average : 1
}
},
{
$group: {
_id: null,
points: { $push: "$$ROOT" }
}
},
{
$project :{
points : {
$map :{
input : { $range : [ 1, 25 ] },
as : "hour",
in : {
$let: {
vars: {
hourIndex: {
"$indexOfArray":[ "$points.hour","$$hour" ]
}
},
in: {
$cond: {
if: { $ne:[ "$$hourIndex", -1 ] },
then: {
$arrayElemAt: [ "$points"," $$hourIndex" ]
},
else: { hour: "$$hour", average: null }
}
}
}
}
}
},
{
$unwind : "$points"
},
{
$replaceRoot: {
newRoot: "$points"
}
}
])What else is possible?
We could extend this idea in the case of make aggregation by months, i.e. between two dates, get the average of the field value per month. And fill with null, in months where there was no value.
Also by dates, between two dates, get the average of the field value per date. And fill with nulls, in dates where there was no value. In the case of dates, it would be necessary(easier) generate a date-range using application code.
Once understood how this process works, we could always fill with null values as needed, given certain aggregation.
Closing Remarks
- The original solution of this process comes from this question in stackoverflow and was explained to fill missing values, after aggregating data by dates. I have only tried to explain what the query does, and how to apply it to other type of aggregations.
- We could save us the extra stages and doing it using application code. But from my experience in this project, these aggregation queries are easier to debug and maintain. If there is an error, we could debug the query isolated from the application stack. Using some tool like robo3T , we could add, remove and edit stages as needed to see how data is being transformed. Once we are satisfied with the aggregation query, is just a matter of translate the aggregation query using a MongoDB driver.
- This idea could be extended to fill null values for several kind of aggregations(per hour, per month, per minute, per year).
- Please, let me know in comments if this solution helped you ;) .
References
Original idea about how to do the aggregation, comes mainly from :
- https://stackoverflow.com/questions/52235027/fill-missing-dates-in-records
- https://stackoverflow.com/questions/26406944/mongodb-aggreagte-fill-missing-days
Software used to develop, test and debug the aggregation queries:
The Mongo DB aggregation framework:
