Parallelizing Tasks Using MongoDB

Published in

Zencity Engineering

3 min readDec 13, 2021

As part of Zencity’s data ingestion pipelines, we manage and schedule what we call Data Connector Tasks. These tasks represent the state of atoms of data we pull, process and enrich through various means.

In this post I’d like to focus on how we managed to scale our scheduling mechanism.

A Bit of Background

After a Data Atom is initially created, we also create an associated Task. In a separate scheduled job, we query our database for tasks pending additional processing and push them into a queue.

Sounds simple, right? But as we’ll soon see, it’s a bit tricky to pull off at scale.

The Moving Parts

We use MongoDB as our primary database, RabbitMQ as our message queue, and Kubernetes to orchestrate and scale our code. Our application runs on Node.js and is written in Typescript.

First Draft

Our first, admittedly naive, implementation ran in a single pod every minute, querying the tasks collection in batches of 10,000 at a time:

What was wrong with it? At first absolutely nothing. But as we kept ingesting more data, we found that we couldn’t keep up with demand and had to rethink our approach.

Getting Warmer

What’s better than a single pod doing the work? Multiple pods sharing the workload and getting more done. So much is obvious, but how to ensure parallel processes don’t accidentally pull the same task, creating a race condition?

Just Shard It

First, we set up our pods so that each is aware of the total number of pods/shards as well as its own shard ID using environment variables.

Second, we needed to filter the tasks based on these parameters:

It did the trick, but soon enough we noticed that things weren’t working quite as well as we thought…

Filtering such a large data set in code comes with serious performance overhead leading to Kubernetes occasionally killing pods that took too long to finish, losing all their work in the process. We were back at square one.

It was at this point that we thought that if we were using a relational database such as MySQL or PostgresQL we could simply write a SQL query that would shard the data at the source, saving us the cost of having our app do that. Unfortunately it couldn’t be done in MongoDB… or could it?

Fortunately for us, MongoDB’s aggregation framework turned out to be just the right tool for the job.

Bingo!

After some research, this is what we came up with:

As you can see, we delegated the work of figuring out which tasks belong to which shard to MongDB. This, at last, let us clear the bottleneck allowing us to handle our workload and scale up in the future to meet increasing demand.