All about Aggregation in MongoDB

Rahul Kumar
8 min readNov 26, 2023

--

In MongoDB, aggregation refers to the process of transforming and combining documents from one or more collections in order to perform various data processing operations. The aggregation framework provides a set of powerful and flexible tools for performing data transformations, filtering, grouping, sorting, and other operations on the data stored in MongoDB.

The aggregation framework is essentially a pipeline-based system where documents pass through a sequence of stages, each stage representing a specific operation. Each stage in the pipeline takes input documents, processes them, and passes the results to the next stage. The output of the last stage is the final result of the aggregation. Aggregation is not like join in MySQL, we can perform join like operations in aggregation using lookup operator.

// basic block of aggregation
db.collection.aggregate(pipeline, options)

Options are additional settings that you can provide within certain stages. For example, the $sort stage can take options like { $sort: { field: 1 } } to sort in ascending order.

There are some pipelines which we use to perform aggregation:-

{ 
$match, $project, $group, $sort, $skip, $limit, $unwind, $lookup, $out,
$addFields, $replaceRoot, $sample, $facet, $redact, $bucket, $graphLookup
}

For more clear understanding let’s create a database like this:-

db.employees.insertMany([
{
"_id": 1,
"name": "Alice",
"department": "HR",
"salary": 50000,
"skills": ["communication", "recruitment"]
},
{
"_id": 2,
"name": "Bob",
"department": "IT",
"salary": 80000,
"skills": ["programming", "database"]
},
{
"_id": 3,
"name": "Charlie",
"department": "HR",
"salary": 55000,
"skills": ["communication", "training"]
}
])

$match: the $match stage is used to filter documents in a collection based on specified criteria. The $match stage is similar to the find query, and it allows you to filter the input documents to the aggregation pipeline.

// Basic format
{ $match: { <query conditions> } }
{ $match: { department: "IT", salary: { $gt: 75000 } } }
  1. Multiple Conditions: You can use multiple conditions within the $match stage to create more complex queries.
  2. Index Usage: The $match stage can take advantage of indexes on the specified fields, making the query more efficient.
  3. Query Operators: You can use various query operators within the $match stage, such as $eq, $ne, $gt, $lt, $in, $regex, etc.
  4. Logical Operators: Logical operators like $and, $or, and $not can be used to combine multiple conditions.

This would filter documents to only include those where the department is “HR”.

db.employees.aggregate([
{ $match: { department: "HR" } }
])

// Output
{ "_id": 1, "name": "Alice", "department": "HR", "salary": 50000, "skills": ["communication", "recruitment"] }
{ "_id": 3, "name": "Charlie", "department": "HR", "salary": 55000, "skills": ["communication", "training"] }

$project: The $project stage is used to reshape or transform documents in the pipeline. It allows you to include, exclude, rename, or transform fields, effectively creating a new set of documents with modified structures. The $project stage is analogous to the SELECT statement in SQL databases.

This would include only the name and salary fields in the output.

db.employees.aggregate([
{ $project: { name: 1, salary: 1 } }
])

// Output
{ "_id": 1, "name": "Alice", "salary": 50000 }
{ "_id": 2, "name": "Bob", "salary": 80000 }
{ "_id": 3, "name": "Charlie", "salary": 55000 }

$group: The $group stage is used to group documents by a specified key and perform aggregations on each group. It is similar to the GROUP BY clause in SQL. The $group stage is often used to calculate aggregated values like counts, averages, sums, or other statistical measures within each group.

This would group documents by department and calculate the average salary for each department.

db.employees.aggregate([
{ $group: { _id: "$department", averageSalary: { $avg: "$salary" } } }
])

// Output
{ "_id": "HR", "averageSalary": 52500 }
{ "_id": "IT", "averageSalary": 80000 }

$sort: This would sort documents by salary in descending order.

db.employees.aggregate([
{ $sort: { salary: -1 } }
])

// Output
{ "_id": 2, "name": "Bob", "department": "IT", "salary": 80000, "skills": ["programming", "database"] }
{ "_id": 3, "name": "Charlie", "department": "HR", "salary": 55000, "skills": ["communication", "training"] }
{ "_id": 1, "name": "Alice", "department": "HR", "salary": 50000, "skills": ["communication", "recruitment"] }

$skip: This would skip the first document in the pipeline.

db.employees.aggregate([
{ $skip: 1 }
])

// Output
{ "_id": 2, "name": "Bob", "department": "IT", "salary": 80000, "skills": ["programming", "database"] }
{ "_id": 3, "name": "Charlie", "department": "HR", "salary": 55000, "skills": ["communication", "training"] }

$limit: This would limit the output to the first two documents in the pipeline.

db.employees.aggregate([
{ $limit: 2 }
])

// Output
{ "_id": 1, "name": "Alice", "department": "HR", "salary": 50000, "skills": ["communication", "recruitment"] }
{ "_id": 2, "name": "Bob", "department": "IT", "salary": 80000, "skills": ["programming", "database"] }

$unwind: The $unwind stage is used to deconstruct an array field from the input documents, creating a new document for each element in the array. This is particularly useful when you have documents with arrays, and you want to perform operations on each element of the array separately.

This would create a separate document for each skill within the skills array.

db.employees.aggregate([
{ $unwind: "$skills" }
])

// Output
{ "_id": 1, "name": "Alice", "department": "HR", "salary": 50000, "skills": "communication" }
{ "_id": 1, "name": "Alice", "department": "HR", "salary": 50000, "skills": "recruitment" }
{ "_id": 2, "name": "Bob", "department": "IT", "salary": 80000, "skills": "programming" }
{ "_id": 2, "name": "Bob", "department": "IT", "salary": 80000, "skills": "database" }
{ "_id": 3, "name": "Charlie", "department": "HR", "salary": 55000, "skills": "communication" }
{ "_id": 3, "name": "Charlie", "department": "HR", "salary": 55000, "skills": "training" }

$lookup: The $lookup stage is used to perform a left outer join on documents from another collection. It allows you to combine documents from two collections based on a specified condition and include the matching documents from the "joined" collection in the result.

// Basic format
{
$lookup:
{
from: <collection>,
localField: <field from the input documents>,
foreignField: <field from the documents of the "from" collection>,
as: <output array field>
}
}

Assuming you have another collection named projects. This would perform a left outer join with the projects collection based on the employees field.{ “_id”: 101, “projectName”: “Training Program”, “employees”: [1, 3] }

{ "_id": 101, "projectName": "Training Program", "employees": [1, 3] }
db.employees.aggregate([
{
$lookup: {
from: "projects",
localField: "_id",
foreignField: "employees",
as: "projects"
}
}
])

// Output
{
"_id": 1,
"name": "Alice",
"department": "HR",
"salary": 50000,
"skills": ["communication", "recruitment"],
"projects": [
{ "_id": 101, "projectName": "Training Program", "employees": [1, 3] }
]
}
{ "_id": 2, "name": "Bob", "department": "IT", "salary": 80000, "skills": ["programming", "database"], "projects": [] }
{ "_id": 3, "name": "Charlie", "department": "HR", "salary": 55000, "skills": ["communication", "training"], "projects": [{ "_id": 101, "projectName": "Training Program", "employees": [1, 3] }] }

$out: Writes the results of the aggregation pipeline to a specified collection. It is used to store the computed results for further analysis or to create a new collection.
This would write the results of the aggregation pipeline to a new collection named newCollection.

db.employees.aggregate([
{ $out: "newCollection" }
])

// Output will be written to a new collection named newCollection.

$addFields: Adds new fields to documents in the pipeline. It is used to create or modify fields based on expressions or values.
This would add a new field bonus to each document, calculated as 10% of the salary.

db.employees.aggregate([
{ $addFields: { bonus: { $multiply: ["$salary", 0.1] } } }
])

// Output
{ "_id": 1, "name": "Alice", "department": "HR", "salary": 50000, "skills": ["communication", "recruitment"], "bonus": 5000 }
{ "_id": 2, "name": "Bob", "department": "IT", "salary": 80000, "skills": ["programming", "database"], "bonus": 8000 }
{ "_id": 3, "name": "Charlie", "department": "HR", "salary": 55000, "skills": ["communication", "training"], "bonus": 5500 }

$replaceRoot: Replaces the existing document structure with a new one specified by a document or an expression. It is often used to promote nested fields to the top level.
This would replace the root document with the department field for documents where the name is "Alice".

db.employees.aggregate([
{ $match: { name: "Alice" } },
{ $replaceRoot: { newRoot: "$department" } }
])

// Output
"HR"

$sample: Randomly selects a specified number of documents from the input. It is useful for obtaining a random subset of documents from a collection.
This would randomly select 2 documents from the collection.

db.employees.aggregate([
{ $sample: { size: 2 } }
])

// Output: Randomly selected documents.

$facet: Enables multi-faceted aggregation, allowing you to define multiple independent pipelines within a single stage. Each sub-pipeline represents a different facet of the data.
This would perform multiple aggregations in parallel, generating statistics for departments and listing distinct skills.

db.employees.aggregate([
{
$facet: {
departmentStats: [
{ $group: { _id: "$department", averageSalary: { $avg: "$salary" } } }
],
skillsList: [
{ $unwind: "$skills" },
{ $group: { _id: "$skills" } }
]
}
}
])

// Output
{
"departmentStats": [
{ "_id": "HR", "averageSalary": 52500 },
{ "_id": "IT", "averageSalary": 80000 }
],
"skillsList": [
{ "_id": "communication" },
{ "_id": "recruitment" },
{ "_id": "programming" },
{ "_id": "database" },
{ "_id": "training" }
]
}

$redact: Filters the documents in the pipeline based on specified conditions, allowing you to control access to data by restricting document content.
This example assumes a field named confidential. This would redact (hide) documents where the confidential field is true
{ "_id": 1, "name": "Alice", "department": "HR", "salary": 50000, "confidential": false }

db.employees.aggregate([
{ $redact: { $cond: { if: { $eq: ["$confidential", false] }, then: "$$DESCEND", else: "$$PRUNE" } } }
])

$bucket: Groups documents into buckets based on specified ranges or conditions. It is useful for creating histograms or categorizing data into distinct groups.
This would categorize documents into buckets based on salary ranges.

db.employees.aggregate([
{
$bucket: {
groupBy: "$salary",
boundaries: [50000, 60000, 70000, 80000],
default: "Other",
output: { count: { $sum: 1 } }
}
}
])

// Output
{ "_id": 50000, "count": 1 }
{ "_id": 60000, "count": 2 }
{ "_id": 70000, "count": 0 }
{ "_id": 80000, "count": 1 }
{ "_id": "Other", "count": 0 }

$grapghLookup: Performs a recursive search on a graph-like data structure. It is commonly used to traverse hierarchical or tree-like structures in documents, like those representing organizational hierarchies.
Assuming a hierarchical structure with a manager field. This would perform a recursive search to build a reporting hierarchy for each employee.

{ "_id": 1, "name": "Alice", "manager": null }
{ "_id": 2, "name": "Bob", "manager": 1 }
{ "_id": 3, "name": "Charlie", "manager": 1 }
db.employees.aggregate([
{
$graphLookup: {
from: "employees",
startWith: "$manager",
connectFromField: "manager",
connectToField: "_id",
as: "reportingHierarchy"
}
}
])

// Output
{
"_id": 1,
"name": "Alice",
"manager": null,
"reportingHierarchy": []
}
{ "_id": 2, "name": "Bob", "manager": 1, "reportingHierarchy": [{ "_id": 1, "name": "Alice", "manager": null }] }
{ "_id": 3, "name": "Charlie", "manager": 1,

Now, you know how to use pipelines in aggregation. Let’s try to use multiple pipelines at once.

// collection
{ "_id": 1, "name": "Alice", "department": "IT", "salary": 80000, "skills": ["programming", "database"] }
{ "_id": 2, "name": "Bob", "department": "IT", "salary": 75000, "skills": ["programming", "testing"] }
{ "_id": 3, "name": "Charlie", "department": "IT", "salary": 85000, "skills": ["database", "management"] }
{ "_id": 4, "name": "David", "department": "IT", "salary": 82000, "skills": ["programming", "design"] }
{ "_id": 5, "name": "Eve", "department": "IT", "salary": 78000, "skills": ["testing", "management"] }
// Query
db.employees.aggregate([
{ $match: { department: "IT" } },
{
$project: {
_id: 0,
employeeName: "$name",
department: 1,
salary: 1,
skills: 1
}
},
{ $unwind: "$skills" },
{
$group: {
_id: "$skills",
averageSalary: { $avg: "$salary" }
}
},
{ $sort: { averageSalary: -1 } },
{ $limit: 2 }
])
// Output
{ "_id": "database", "averageSalary": 82500 }
{ "_id": "programming", "averageSalary": 82250 }

All this is all about aggregation in MongoDB. There are some pipelines, which we need to know in depth. But now you can create a basic level of aggregation with multiple pipelines.

Aggregation overview

--

--