On MongoDB Aggregation Pipelines

Kartal Kaan Bozdogan
3 min readJan 27, 2024

--

This is the first part of a two-part series, where I start by introducing MongoDB’s aggregation pipelines and proceed to discuss a sample real-world use case, accompanied with a benchmark demonstrating its superiority to a naive approach. You can get to the second part here.

Aggregation pipelines are a powerful tool for interfacing with Mongo databases in a declarative and performant way. In most cases, they allow an equivalent level of expressiveness and flexibility to SQL queries.

An aggregation pipeline is always run on a collection. It is composed of a list of stages , where each stage takes as input the output of the previous stage, which is a list of documents. The initial stage takes as input the entire collection on which the pipeline is being run. The result of the pipeline is a cursor corresponding to the output of the last stage.

Let’s have a look at several pipeline stages. The ones that merely transform the input documents are the easiest:

$set

Adds new fields to documents. $set outputs documents that contain all existing fields from the input documents and newly added fields.

The$set stage takes as parameter an object, essentially a set of key-value pairs, and sets each key in the input documents to the given value. Here is an example:

// Given a database state:
[
{ _id: 0 },
{ _id: 1 }
]

// Running this pipeline on it:
[
{ $set: { foo: "bar" } }
]

// Will result in
[
{ _id: 0, foo: "bar" },
{ _id: 1, foo: "bar" }
]

You can also refer to a field of the object being processed:

// Running this pipeline on the same database:
[
{ $set: { foo: "$_id" } }
]

// Will result in
[
{ _id: 0, foo: 0 },
{ _id: 1, foo: 1}
]

You can also use aggregation expressions:

// Running this pipeline on the same database:
[
{ $set: { foo: {$add: ["$_id", 1] } } }
]

// Will result in
[
{ _id: 0, foo: 1},
{ _id: 1, foo: 2}
]

This allows you to express a wide variety of operations in a completely functional and declarative way, making them amenable to being optimized by the query engine while being easier to digest than SQL, at least in my opinion.

The $count stage always produces a single output document: the number of input documents:

// Running this pipeline on the same database:
[
{ $count: "myCount" }
]

// Will result in
[ { myCount: 2 } ]

The$documents stage ignores its input and outputs a given set of documents:

// Running this pipeline on the same database:
[
{ $documents: [
{ foo: "bar" },
{ buz: "fiz" },
]}
]

// Will result in
[
{ foo: "bar" },
{ buz: "fiz" },
]

This might be handy if you want to run the rest of your pipeline on a set of test documents without having to create a test collection containing them.

$unwind runs on an array and outputs one document per array item for each input document, where the array is replaced with one of its elements:

// Given a database state:
[
{ _id: 0, arr: [ 0, 1 ] },
{ _id: 1, arr: [ 2, 3 ] }
]

// Running this pipeline on it:
[
{ $unwind: "arr"}
]

// Will result in
[
{ _id: 0, arr: 0 },
{ _id: 0, arr: 1 },
{ _id: 1, arr: 2 },
{ _id: 1, arr: 3 },
]

Some other pipeline stages I find interesting include $lookup , corresponding to SQL joins, $facet , which lets you run a single set of documents through multiple sub-pipelines in parallel and collect the results together, $graphLookup , which lets you run recursive graph searches effortlessly, and $group, which lets you group together input documents according to a key or an expression and run aggregations on the resulting groups.

In the second part of this series I will talk about how you can design new aggregation pipelines to work with your data and how you might use them to run performant and reliable schema migrations.

--

--