MongoDb aggregation pipelines.

Steven de Salas
5 min readFeb 4, 2022

--

Aggregation Pipelines are a powerful tool to process millions of records quickly and efficiently. If you have used other database systems before, you may be familiar with Materialised Views (Postgres/Oracle) or Indexed Views (SQL Server).

These are all awesome solutions to one specific problem:

How do I combine vast amounts of information into a single dashboard in a way that requires minimum effort to load?

Use MongoDb aggregation pipelines to create dashboards in a way that is easy to load.

You see people like to consume information in the form of summary tables and charts, but getting this information ready for them often requires trawling through thousands or millions of records. This is a slow and time-consuming process which should never be executed in real-time every time a user requests the information (they’ll be waiting a long time, and you’ll be doing a lot of processing unnecessarily).

The ideal approach here is two pronged:

  1. Have a way of building the aggregate information INSIDE your database, so that your millions of records do not have to be moved around to get the result.
  2. Store the result into disk, so it is essentially written once and read many times easily.

While you can define your aggregation pipelines in any language (Node.js, Python, Java, etc..) once they are sent to MongoDb they get executed using C++ inside the MongoDb context, so your thousands/millions of records are processed in an optimized manner and do not need to go anywhere.

You then get a result which may be a few hundred rows long, but contains data put together from thousands or millions of records. This result can then be stored to disk using the $out or $merge aggregation stages, depending on whether you want to replace the resulting collection or add to it.

How to put together your first aggregation pipeline

If you are going to put together an aggregation pipeline for the first time I recommend using MongoDb Compass, which comes with a UI for building pipelines that makes it easy to get familiar with the language for creating aggregation steps.

Once you are done your can export the result to your language of choice.

Use the $out/$merge aggregation and schedule your execution.

Done building your pipeline? No worries, when you are happy with the result don’t forget to add either the $out or $merge aggregation step at the end, this will essentially save your results to disk as the last step of the aggregation.

$out: Replaces the target collection with the results of the pipeline
$merge: Adds to or partially modifies your target collection.

You can then schedule execution through a variety of ways:

  • Timed scheduling: Use your favorite scheduler to run your pipeline every few minutes. (I like cron, or cronicle for a great Node.js equivalent with a web UI, but any scheduler will do.)
  • Database triggers: You can also run your aggregation pipeline when there is a change to any of your desired collection/s (might be more than one if your data is put together from several collections). I recommend using Atlas Triggers for this (if your database is using MongoDb Atlas). Here is a good write-up.

Why not use another solution?

Aggregation pipelines are the right tool to use if you are already committed to MongoDb.

Obviously it doesn't make a lot of sense to use them if your data is stored in a different backend database. Depending on your database there is probably a more suitable tool for the same niche:

Postgres SQL: Use Materialized Views
MySQL: No equivalent. But can be cooked-up manually with a stored proc.
Couchbase: Use Couchbase Views
DynamoDb: Use Global Secondary Indexes
MSSQL: Use Indexed Views
ORACLE: Use Materialised Views

What is a Founding Engineer?

This is a series of articles on Founding Engineer know-how. But they are useful for all kinds of engineers out there.

A Founding Engineer is a special kind of CTO for startups. Early in a startup (seed or pre-seed), you need a developer who can ship product and then build business around it. CTOs of bigger companies actually don’t spend any time coding. They work on strategy, partnerships, hiring and other higher leverage activities. But that won’t work for a startup.

A founding engineer will turn over a product quickly (ie, get s*** done) and iterate it until your have a working solution for a startup, as nothing is more important than choosing the right market and getting a solution out to that market so that customers can give you feedback (with their dollars). They can build a solution, store the code in Github/Gitlab, setup a database, CI/CD pipelines, cloud hosting, container workloads, email, DNS, SSL, reverse proxying, automated testing, code branching and collaboration guidelines.

In addition to all of that, the founding engineer must communicate with non technical folks, both within the organization (including the CEO) and outside of the company. They need to be able to grow a team from a single person and be willing and able to drive decisions, especially around technical issues, when to hire, which technology stack to use, when to use lowcode/nocode, and when to accrue technical debt and when to pay it back.

About the Author

Steven de Salas is a Founding Engineer/CTO at LetMePark, a Series-A startup based in Madrid. His superpower is building API-based platforms from scratch with Node.js, MongoDb and Vue.js, but he has many other talents like playing board games and explaining homework to his 8-year twins.

Like the article?

Please press the👏 right below you. It helps more people see it.

--

--

Steven de Salas

Developer. Hardware, blockchain, boardgame enthusiast. I create interfaces and automations that make everyday tasks more enjoyable.. https://github.com/sdesalas