Explaining Aggregation Pipelines

Guy Harrison
MongoDB Performance Tuning
3 min readSep 18, 2021

--

The MongoDB find() command is versatile and easy to use, but the aggregation framework allows you to take it to the next level. Aggregation pipelines can do anything a find() operation can do and much, much more.

Aggregation pipelines also allow you to simplify your application code by reducing logic that might otherwise require multiple find operations and complex data manipulation. When leveraged correctly, a single aggregation can replace many queries and their associated network round trip times.

However, with all the added benefits that using aggregations brings, it also creates a new set of tuning challenges. The most important thing when tuning aggregation pipelines is to know how to interpret Aggregation explain() output.

Tuning aggregation pipelines

If you are not familiar with using explain() with simple find() queries, see this post or check out the MongoDB documentation.

To use explain, we add the .explain() method after the collection name:

The explain() output for an aggregation pipeline a bit different from that generated for a find(). Firstly, aggregation explain output is organized as a series of stages. The stages array contains each of the aggregation stages as individual objects. For example, here’s simplified output from the above explain():

The stages array contains each step of the aggregation. $cursor is a special step that contains the initial find() operation that feeds into each subsequent stage. It has an execution plan similar to that for a normal explain().

We can feed the ‘executionStats’ argument to explain() which then executes the statement and collects timings and other super-helpful statistics. Let’s grab the output of executionStats into an object we can play with. This aggregation is part of some experiments I’m doing with time-series collections and Window functions:

The explainOut object has a lot of data and can be quite confronting. But if we simply want to get each stage and an idea of how long it took to execute, we can use this simple function — stageTimes — to print an estimate of the time spent in each stage.

Note carefully that the times ascribed to each stage appear to be estimates — hence the name executionTimeMillisEstimate. In at least some cases, we’ve seen these timings to be a bit dubious. In particular, the time for each subsequent stage appears to include the timings for the previous stage — you are looking at cumulative execution times, so the function subtracts the previous cumulative time to reveal the delta for each step.

For our MongoDB tuning book, we wrote a helper function that can give you even more insight. You can get our tuning scripts at https://github.com/gharriso/MongoDBPerformanceTuningBook.

The aggregationExecutionStats function from our tuning script, prints what we think is a handy summary of the execution steps in the aggregation pipeline. It shows each stage as well as the sub-steps within the cursor step; this is where you can see (for instance) if you’ve used an index or a collection scan to drive your aggregation:

What we see here, is that the initial $CURSOR stage used an index to retrieve documents — taking about 4.5 seconds. The $GROUP stage added another 1.2 seconds. The sort added no time to the overall execution.

There’s more detail in the explain() output that you can leverage — the amount of memory required for each stage and the presence of a disk sort for instance. This will come in handy in the next blog post when we’ll look in-depth at some complex aggregations.

Aggregation functions are incredibly powerful, but knowing how to tune them is essential. Our book MongoDB Performance Tuning has some tips and tricks that you might be able to leverage to make the most of your Aggregation functions — see chapter 7 in particular.

Guy Harrison and Michael Harrison are authors of MongoDB Performance Tuning (Apress, 2021). They both work at ProvenDB which, amongst other things, brings the immutability and trust of public Blockchains to the power and productivity of MongoDB.

--

--