MongoDB Aggregation Unleashed: Crafting Complex Queries for Performance

Omkar Langhe
Globant
Published in
16 min readApr 4, 2024
Image source: https://icon-icons.com/icon/mongodb-original-logo/146424

MongoDB is a database known for its flexibility and power in data management. The Aggregation Framework is popular because of its out-of-the-box abilities.

In today’s data-driven world, businesses and applications face diverse and complex data sets. It becomes essential to shape, transform, and analyze information. As organizations aim to get meaningful insights from data. It lets engineers process and manipulate data within the database.

What to Expect?

In this course, we will explore the full power of the Aggregation Framework. This guide is for MongoDB users of all levels, from beginners to experienced users.

  • Are you a developer looking to improve query performance?
  • Are you a data engineer working on complex data?
  • Are you someone who wants to explore the advanced capabilities of MongoDB?

This guide is the perfect roadmap for all the above queries.

Key Topics

In this article, we will cover some key concepts like operators and how to use them in stages efficiently. We will also cover various types of examples to understand the foundations. Additionally, we will cover some performance optimization techniques to speed up query performance. We will also see the difference between Good vs Bad Aggregation queries with examples.

  • Foundations of Aggregation: Concepts like stages, operators, and pipelines along with practical examples.
  • Performance Optimization: Boost query performance with indexing and other best practices.
  • Good vs Bad Aggregation queries: See what optimized queries look like as compared to the bad ones.

In the end, you will be able to improve queries and apply optimization strategies in your MongoDB.

The Foundations of Aggregation

The MongoDB Aggregation Framework applies stages to process and transform input documents. Each stage performs a specific task. The output of one stage goes to the input for the next stage in the pipeline.

An example Aggregation pipeline for 2 stages. Image Source: https://devopedia.org/images/article/356/7816.1635954560.png

Above is an example pipeline for 2 stages. Let’s break down the image to understand.

  1. In the first stage, we use $match operator on status field to retrieve the documents from orders collection whose status is equal to A .
  2. The output of the first stage now goes as the input to the second stage where we use $group operator to group the documents based on cust_id field.
  3. The cust_id with A123 , will be grouped in this stage and the sum is performed using $sum operator on amount field. Hence, cust_id with A123 will have a totalof 750 .
  4. Similarly, the cust_id with B212 , will be another group with total of 200 .
  5. At the end, the result of the query will output 2 documents i.e. A123 with total equal to 750 and B212 with total equal to 200 .

It might be intimidating at first to understand the above example. So, let’s understand the basics, and then we will do step-by-step progress toward advanced concepts.

In the following sections, we'll see several operators we can use in pipeline stages.

$match — Filters specific documents

The $matchoperator allows you to scan your documents and only select the ones that meet the criteria. For instance, assume we have the following data about blog writers in a articles collection with their scores and number of views.

[
{"_id":ObjectId("512bc95fe835e68f199c8686"),"author":"dave","score":80,"views":100},
{"_id":ObjectId("512bc962e835e68f199c8687"),"author":"dave","score":85,"views":521},
{"_id":ObjectId("55f5a192d4bede9ac365b257"),"author":"ahn","score":60,"views":1000},
{"_id":ObjectId("55f5a192d4bede9ac365b258"),"author":"li","score":55,"views":5000},
{"_id":ObjectId("55f5a1d3d4bede9ac365b259"),"author":"annT","score":60,"views":50},
{"_id":ObjectId("55f5a1d3d4bede9ac365b25a"),"author":"li","score":94,"views":999},
{"_id":ObjectId("55f5a1d3d4bede9ac365b25b"),"author":"ty","score":95,"views":1000}
]

We could want to filter data only for author "Dave"; we would do this as follows with the $match option.

db.articles.aggregate([
{
$match: {
author: "dave"
}
}
]);

The results would be as follows.

[
{
"_id": ObjectId("512bc95fe835e68f199c8686"),
"author": "dave",
"score": 80,
"views": 100
},
{
"_id": ObjectId("512bc962e835e68f199c8687"),
"author": "dave",
"score": 85,
"views": 521
}
]

You would use $match whenever you want to filter documents of a collection and pick only those which you need.

$group — group by keys

The $group operator separates documents into groups according to a "group key". For instance, assume we have a sales collection as shown below. It uses $group stage to count the number of documents in the sales collection.

[
{"_id":1,"item":"abc","price":Decimal128("10"),"quantity":Int32("2"),"date":ISODate("2014-03-01T08:00:00Z")},
{"_id":2,"item":"jkl","price":Decimal128("20"),"quantity":Int32("1"),"date":ISODate("2014-03-01T09:00:00Z")},
{"_id":3,"item":"xyz","price":Decimal128("5"),"quantity":Int32("10"),"date":ISODate("2014-03-15T09:00:00Z")},
{"_id":4,"item":"xyz","price":Decimal128("5"),"quantity":Int32("20"),"date":ISODate("2014-04-04T11:21:39.736Z")},
{"_id":5,"item":"abc","price":Decimal128("10"),"quantity":Int32("10"),"date":ISODate("2014-04-04T21:23:13.331Z")},
{"_id":6,"item":"def","price":Decimal128("7.5"),"quantity":Int32("5"),"date":ISODate("2015-06-04T05:08:13Z")},
{"_id":7,"item":"def","price":Decimal128("7.5"),"quantity":Int32("10"),"date":ISODate("2015-09-10T08:43:00Z")},
{"_id":8,"item":"abc","price":Decimal128("10"),"quantity":Int32("5"),"date":ISODate("2016-02-06T20:20:13Z")}
]

We could want to group all the documents using $group and then calculate the total no. of count using $count operator in a single query shown below.

db.sales.aggregate([
{
$group: {
_id: null,
count: {
$count: { }
}
}
}
])

The results would be as follows.

{
"_id": null,
"count": 8
}

$project — Select fields that are included in the query

The $project operator selects specific fields included by a developer from the collection to send to the client. For instance, let’s assume we have a document in a books collection.

{
"_id": 1,
title: "abc123",
isbn: "0001122223334",
author: {
last: "zzz",
first: "aaa"
},
copies: 5
}

We could want to include only _id , title , and the author fields in the output document; we would do this as follows with $project .

db.books.aggregate([
{
$project: {
title: 1,
author: 1
}
}
])

The results would be as follows.

{
"_id": 1,
"title": "abc123",
"author": {
"last": "zzz",
"first": "aaa"
}
}

$sort — Sorts the document

The $sort operator sorts the final document in ascending or descending order. For instance, consider the following restaurant collection with name of restaurants and borough.

[
{"_id":1,"name":"Central Park Cafe","borough":"Manhattan"},
{"_id":2,"name":"Rock A Feller Bar and Grill","borough":"Queens"},
{"_id":3,"name":"Empire State Pub","borough":"Brooklyn"},
{"_id":4,"name":"Stan's Pizzaria","borough":"Manhattan"},
{"_id":5,"name":"Jane's Deli","borough":"Brooklyn"}
]

We could use the $sort stage on the borough field.

db.restaurants.aggregate([
{
$sort: {
borough: 1
}
}
])

The results would be as follows.

[
{"_id":3,"name":"Empire State Pub","borough":"Brooklyn"},
{"_id":5,"name":"Jane's Deli","borough":"Brooklyn"},
{"_id":1,"name":"Central Park Cafe","borough":"Manhattan"},
{"_id":4,"name":"Stan's Pizzaria","borough":"Manhattan"},
{"_id":2,"name":"Rock A Feller Bar and Grill","borough":"Queens"}
]

In this example, the sort order may be inconsistent, since the borough field contains duplicate values for both Manhattan and Brooklyn. Documents are returned in alphabetical order by borough, but the order of those documents with duplicate values for borough might not be the same across multiple executions of the same sort.

$limit — Restricts documents

The $limit operator limits the number of documents passed to the next stage. For instance, consider the following example.

db.article.aggregate([
{
$limit: 5
}
]);

This operation returns only the first five documents passed via the pipeline. This operation does not change the contents of the documents it passes.

$skip — Omits documents

The$skip operator allows you to skip a certain number of documents and get the remaining ones. For instance, consider the following example.

db.article.aggregate([
{
$skip: 5
}
]);

This operation skips the first five documents in the pipeline. Also, this operation does not alter the content of the documents it passes.

$lookup — Left outer join

The$lookup operator performs a left outer join with a collection in the same database. It is important to know the syntax of $lookup to understand.

{
$lookup:
{
from: <collection to join>,
localField: <field from the input documents>,
foreignField: <field from the documents of the "from" collection>,
as: <output array field>
}
}

Now let’s create a collection called orders with the documents below.

// Collection 1

[
{"_id":1,"item":"almonds","price":12,"quantity":2},
{"_id":2,"item":"pecans","price":20,"quantity":1},
{"_id":3}
]

Create another collection called inventory with the documents below.

// Collection 2

[
{"_id":1,"sku":"almonds","description":"product 1","instock":120},
{"_id":2,"sku":"bread","description":"product 2","instock":80},
{"_id":3,"sku":"cashews","description":"product 3","instock":60},
{"_id":4,"sku":"pecans","description":"product 4","instock":70},
{"_id":5,"sku":null,"description":"Incomplete"},
{"_id":6}
]

Now we will do an Aggregation operation on the orders collection. Consider the following query.

db.orders.aggregate([
{
$lookup: {
from: "inventory",
localField: "item",
foreignField: "sku",
as: "inventory_docs"
}
}
])

This operation joins the documents orders with the documents from the
inventory collection using the fields item from the orders collection
and the sku field from the inventory collection.

The operation returns the documents shown below.

[
{
"_id": 1,
"item": "almonds",
"price": 12,
"quantity": 2,
"inventory_docs": [
{
"_id": 1,
"sku": "almonds",
"description": "product 1",
"instock": 120
}
]
},
{
"_id": 2,
"item": "pecans",
"price": 20,
"quantity": 1,
"inventory_docs": [
{
"_id": 4,
"sku": "pecans",
"description": "product 4",
"instock": 70
}
]
},
{
"_id": 3,
"inventory_docs": [
{
"_id": 5,
"sku": null,
"description": "Incomplete"
},
{
"_id": 6
}
]
}
]

So till now, we have learned the most commonly used operators in the Aggregation pipeline along with examples. We also learned how to construct stages using these operators. Now, let’s consider one more example where we can combine many stages to output one result.

Create a sample collection named sales with the following documents.

db.sales.insertMany([
{"_id":1,"item":"abc","price":Decimal128("10"),"quantity":Int32("2"),"date":ISODate("2014-03-01T08:00:00Z")},
{"_id":2,"item":"jkl","price":Decimal128("20"),"quantity":Int32("1"),"date":ISODate("2014-03-01T09:00:00Z")},
{"_id":3,"item":"xyz","price":Decimal128("5"),"quantity":Int32("10"),"date":ISODate("2014-03-15T09:00:00Z")},
{"_id":4,"item":"xyz","price":Decimal128("5"),"quantity":Int32("20"),"date":ISODate("2014-04-04T11:21:39.736Z")},
{"_id":5,"item":"abc","price":Decimal128("10"),"quantity":Int32("10"),"date":ISODate("2014-04-04T21:23:13.331Z")},
{"_id":6,"item":"def","price":Decimal128("7.5"),"quantity":Int32("5"),"date":ISODate("2015-06-04T05:08:13Z")},
{"_id":7,"item":"def","price":Decimal128("7.5"),"quantity":Int32("10"),"date":ISODate("2015-09-10T08:43:00Z")},
{"_id":8,"item":"abc","price":Decimal128("10"),"quantity":Int32("5"),"date":ISODate("2016-02-06T20:20:13Z")}
])

Now let’s construct a pipeline that calculates the total sales amount, average sales quantity, and sale count for each day in the year 2014.

db.sales.aggregate([
//FirstStage
{
$match: {
"date": {
$gte: newISODate("2014-01-01"),
$lt: newISODate("2015-01-01")
}
}
},
//SecondStage
{
$group: {
_id: {
$dateToString: {
format: "%Y-%m-%d",
date: "$date"
}
},
totalSaleAmount: {
$sum: {
$multiply: [
"$price",
"$quantity"
]
}
},
averageQuantity: {
$avg: "$quantity"
},
count: {
$sum: 1
}
}
},
//ThirdStage
{
$sort: {
totalSaleAmount: -1
}
}
])

Now let’s break down each stage to understand the complete query.

  1. First Stage — The $match stage filters the documents based on the date field. The condition uses $gte and $lt operators and checks for documents where the date is greater than or equal to January 1, 2014, and less than January 1, 2015.
  2. Second Stage — The $group stage groups the matched documents by the formatted date using the $dateToString operator. The total sale amount for each group is calculated using the $sum operator with $multiply, which multiplies the price and quantity fields. The average quantity for each group is calculated using the $avg operator on the quantity field. The count of documents in each group are obtained using the $sum operator.
  3. Third Stage: The $sort stage sorts results by totalSaleAmount in descending order.

The operation returns the following results:

[
{
"_id": "2014-04-04",
"totalSaleAmount": Decimal128("200"),
"averageQuantity": 15,
"count": 2
},
{
"_id": "2014-03-15",
"totalSaleAmount": Decimal128("50"),
"averageQuantity": 10,
"count": 1
},
{
"_id": "2014-03-01",
"totalSaleAmount": Decimal128("40"),
"averageQuantity": 1.5,
"count": 2
}
]

By the above examples, now you should be able to get an idea of how the Aggregation pipeline works.

Performance Optimization

We can improve Aggregation queries using techniques like indexing strategies and best practices. Here are some tips to improve the performance of your MongoDB Aggregation queries. The following sections will show how to use indexes to improve the speed of data retrieval on a collection.

Single Field Indexing

Single Field Indexing involves creating an index on a single field in a collection. Here’s how you can create a single-field index on fieldName in MongoDB.

db.collection.createIndex({ fieldName: 1 });

db.collection.createIndex() is a function in your MongoDB shell or MongoDB’s drive API that helps us create indexes. Let’s understand how to use it by considering one example.

Using MongoDB Shell

  • Connect to MongoDB — Open a terminal and run the MongoDB shell by typing mongo. Connect to your MongoDB server by specifying the database and providing the connection details.
mongo --host your_host --port your_port --username your_username --password your_password your_database
  • Create Single Field Index — Use the createIndex method to create an index on a specific field. For example, if you want to create an index on the email field in the users collection, use the following command.
// The 1 indicates an ascending index. If you want a descending index, use -1 instead.
db.users.createIndex({ email: 1 });
  • Verify Index Creation — We can verify that the index has been created by using the getIndexes method. For instance, the below command will display a list of indexes on the users collection.
db.users.getIndexes();

Using MongoDB Driver in a Programming Language (e.g., Node.js with MongoDB Node.js Driver)

  • Install MongoDB Driver — Make sure you have the MongoDB driver installed in your project. You can install it using npm.
npm install mongodb
  • Connect to MongoDB — In your code, establish a connection to the MongoDB server using the driver.
const { MongoClient } = require('mongodb');

const uri = 'mongodb://your_username:your_password@your_host:your_port/your_database';
const client = new MongoClient(uri, { useNewUrlParser: true, useUnifiedTopology: true });

async function connect() {
await client.connect();
console.log('Connected to MongoDB');
}

connect();
  • Create Single Field Index — Once connected, create a single-field index using the createIndex method.
const db = client.db('your_database');
const collection = db.collection('users');

// The 1 indicates an ascending index, similar to the MongoDB shell.
await collection.createIndex({ email: 1 });
  • Verify Index Creation — You can verify the index creation by using the listIndexes method. The below example outputs the list of indexes created.
const indexes = await collection.listIndexes();
console.log(indexes);

Compound Field Indexing

Compound Indexing allows to collect and sort data from two or more fields within a collection. Here’s how you can create a compound index on the field1 and field2 fields in MongoDB.

db.collection.createIndex({ field1: 1, field2: 1 });

Let’s understand how to use it by considering one example.

Using MongoDB Shell

  • Connect to MongoDB — Open a terminal and run the MongoDB shell by typing mongo. Connect to your MongoDB server by specifying the database and providing the connection details.
mongo --host your_host --port your_port --username your_username --password your_password your_database
  • Create Compound Index — Use the createIndex method to create a compound index on multiple fields. For instance, if you want to create a compound index on the category and price fields in the products collection, use the following command.
// The 1 on category indicates an ascending index, and the -1 on price indicates a descending index.
db.products.createIndex({ category: 1, price: -1 });
  • Verify Index Creation — You can verify that the compound index has been created by using the getIndexes method. For instance, the below command will display a list of indexes on products collection.
db.products.getIndexes();

Using MongoDB Driver in a Programming Language (e.g., Node.js with MongoDB Node.js Driver)

  • Install MongoDB Driver — Make sure you have the MongoDB driver installed in your project. You can install it using npm.
npm install mongodb
  • Connect to MongoDB — In your code, establish a connection to the MongoDB server using the driver.
const { MongoClient } = require('mongodb');

const uri = 'mongodb://your_username:your_password@your_host:your_port/your_database';
const client = new MongoClient(uri, { useNewUrlParser: true, useUnifiedTopology: true });

async function connect() {
await client.connect();
console.log('Connected to MongoDB');
}

connect();
  • Create Compound Index — Once connected, create a compound index using the createIndex method.
const db = client.db('your_database');
const collection = db.collection('products');

// The 1 indicates an ascending index,and -1 indeicates descending index, similar to the MongoDB shell.
await collection.createIndex({ category: 1, price: -1 });
  • Verify Index Creation — You can verify the index creation by using the listIndexes method. The below example outputs the list of indexes created.
const indexes = await collection.listIndexes();
console.log(indexes);

Text Indexing

Text Indexing enables full-text search functionality on text fields within a collection. It allows you to search text using the $textoperator. It also supports features such as language-specific stemming, stop words, and text score. Suppose you have a collection of documents called articles . If you want a full-text search on the content field. You can create a text index to support text search queries. Here’s how you can create a text index in MongoDB.

// create a text index on content field
db.articles.createIndex({ content: "text" });

Once you create a text index, you can use the $text operator in your query shown below.

// Find documents that contain the word "MongoDB" in the "content" field
db.articles.find({ $text: { $search: "MongoDB" } });

Let’s understand how to use it by considering one example.

Using MongoDB Shell

  • Connect to MongoDB — Open a terminal and run the MongoDB shell by typing mongo. Connect to your MongoDB server by specifying the database and providing the connection details.
mongo --host your_host --port your_port --username your_username --password your_password your_database
  • Create Text Index — Use the createIndex method to create a text index on the field containing text content. For example, if you want to create a text index on the content field in the articles collection, use the following command.
db.articles.createIndex({ content: "text" });
  • Perform Text Search — You can then perform text search queries using the $text operator. For instance, we can find documents containing the word “MongoDB” in the content field.
db.articles.find({ $text: { $search: "MongoDB" } });

Using MongoDB Driver in a Programming Language (e.g., Node.js with MongoDB Node.js Driver)

  • Install MongoDB Driver — Make sure you have the MongoDB driver installed in your project. You can install it using npm.
npm install mongodb
  • Connect to MongoDB — In your code, establish a connection to the MongoDB server using the driver.
const { MongoClient } = require('mongodb');

const uri = 'mongodb://your_username:your_password@your_host:your_port/your_database';
const client = new MongoClient(uri, { useNewUrlParser: true, useUnifiedTopology: true });

async function connect() {
await client.connect();
console.log('Connected to MongoDB');
}

connect();
  • Create Text Index — Once connected, create a text index using the createIndex method.
const db = client.db('your_database');
const collection = db.collection('articles');

await collection.createIndex({ content: "text" });
  • Perform Text Search — You can perform text search queries using the $text operator:
const searchResults = await collection.find({ $text: { $search: "mongodb" } }).toArray();
console.log(searchResults);

Wildcard Indexing

The Wildcard Indexing is a text index that covers string fields without specifying field names. Particularly useful in scenarios where you perform full-text searches across many fields.

Suppose you enable full-text search across fields such as title, description and content. Wildcard indexing helps to create a single index to support text search queries. Here’s how you can create a wildcard index in MongoDB.

db.articles.createIndex({ "$**": "text" });

Once you create a wildcard index. Use the $textoperator in your queries to search across all string fields covered by the index:

// Find documents that contain the word "MongoDB" in any string field
db.articles.find({ $text: { $search: "MongoDB" } });

Use case — Only use wildcard indexes when the fields you want to index are unknown or can be changed later. Consider using wildcard indexes in the following scenarios.

  • If your application queries a collection where field names differ between documents, create a wildcard index.
  • If your application queries embedded documents where subfields are inconsistent, create a wildcard index to support queries on subfields.

Sparse Indexing

A Sparse Indexing only contains entries for documents that have the indexed field. It also includes null value index fields.

In other words, it skips over any document that is missing the indexed field. It is also known as a sparse index, as it does not include all documents of a collection.

To create a sparse index, use the createIndex() method with the sparse option set to true. For instance, the below operation in mongosh creates a sparse index on the cust_id field of customers collection.

db.customers.createIndex( { "cust_id": 1 }, { sparse: true } )

Let’s understand how to use it by considering one example.

Using MongoDB Shell

  • Connect to MongoDB — Open a terminal and run the MongoDB shell by typing mongo. Connect to your MongoDB server by specifying the database and providing the connection details.
mongo --host your_host --port your_port --username your_username --password your_password your_database
  • Create Sparse Index — Use the createIndex() method to create a sparse index on a specific field. For instance, if you have a status field that exists only in a subset of documents in the orders collection, and you want to index it. The { sparse: true } option tells MongoDB to create a sparse index.
db.orders.createIndex({ status: 1 }, { sparse: true });
  • Verify Index Creation — You can verify that the sparse index has been created by using the getIndexes method.
db.orders.getIndexes();

Using MongoDB Driver in a Programming Language (e.g., Node.js with MongoDB Node.js Driver)

  • Install MongoDB Driver — Make sure you have the MongoDB driver installed in your project. You can install it using npm.
npm install mongodb
  • Connect to MongoDB — In your code, establish a connection to the MongoDB server using the driver.
const { MongoClient } = require('mongodb');

const uri = 'mongodb://your_username:your_password@your_host:your_port/your_database';
const client = new MongoClient(uri, { useNewUrlParser: true, useUnifiedTopology: true });

async function connect() {
await client.connect();
console.log('Connected to MongoDB');
}

connect();
  • Create Sparse Index — Once connected, create a sparse index using the createIndex method.
const db = client.db('your_database');
const collection = db.collection('orders');

await collection.createIndex({ status: 1 }, { sparse: true });
  • Verify Index Creation — You can verify the index creation by using the listIndexes method. The below example outputs the list of indexes created.
const indexes = await collection.listIndexes();
console.log(indexes);

Remember that creating an index is a one-time operation, and you only need to do it once for each field you want to index. However, it’s crucial to consider the impact on write performance and disk space, especially in scenarios with frequent write operations.

Pipeline Optimization

It is recommended to keep Aggregation pipelines simple to avoid performance issues. The unwanted number of stages can cause degradation.

  • Use the $match stage early in the pipeline to filter out unnecessary documents.
  • Use of $project to only include fields needed for further processing. This reduces the amount of data passed through the pipeline.
  • Use the $sort stage early in the pipeline if required. This allows MongoDB to take advantage of indexes for sorting.
  • Large values for $skip can be inefficient, especially with large datasets. Avoid using it and consider alternative pagination strategies.
  • Avoid $outwhenever possible. This stage writes the results of the Aggregation pipeline to a new collection. This operation can be resource-intensive, so avoid it whenever possible.
  • When using the $lookup stage, try to use local fields that are indexed. This can improve the performance of the $lookup operation.
  • If your Aggregation pipeline results in a large dataset, use the $merge stage. It writes the results in a new collection. This can improve performance and avoid memory issues.
  • Use Covered Indexes — If possible, structure your indexes to cover the query. It means that the index itself contains all the fields required by the query. MongoDB can fulfill the query without accessing the actual documents.
  • Analyze Query Execution Plans — Use the explain method to identify areas for optimization.
  • Use allowDiskUse Option — Consider using the allowDiskUse option for large data sets. It allows MongoDB to use temporary files on disk during the Aggregation.
  • Regular Maintenance — Maintain your indexes to ensure they remain effective over time. Besides, use the latest MongoDB version for performance improvements and bug fixes.

Improve the performance of MongoDB Aggregation queries, by following the above guidelines. It’s important to test changes before applying them to production.

Good vs Bad Aggregation queries

Certainly! Below are some examples illustrating the differences between good and bad MongoDB aggregation queries.

Conclusion

In Summary, the Aggregation Framework of MongoDB is a powerful tool for data processing. It has advanced features that enable developers to perform complex analytics with ease.

References:

--

--