Leverage Mongodb change stream to build a strong ETL pipeline

Mohammed Galalen
Analytics Vidhya
Published in
3 min readDec 30, 2020
Photo by Joshua Sortino on Unsplash

Before diving into the change stream, let’s know why we want to migrate data from Mongo to another SQL database specifically Postgresql. The main issue raised when we connected an analytical tool like Metabase with mongo, and since we are dealing with document datastore, most of the relationships are embedded, and finding relationships between these documents can be hard enough, and it’s doesn’t fit our analytical pipeline.

Another thing we have to put in mind, we needed that tool to work in near-realtime (always giving the most recent analytics).

The best solution was to migrate the data to a SQL database and connect any analytical tool directly to it. Also, this is can be used as a secondary backup for the data because the update is real-time (Thanks Mongo change stream).

So what is Mongo change Stream?

According to the official Mongo website: “Change streams allow applications to access real-time data changes without the complexity and risk of tailing the oplog. Applications can use change streams to subscribe to all data changes on a single collection, a database, or an entire deployment, and immediately react to them. Because change streams use the aggregation framework, applications can also filter for specific changes or transform the notifications at will”.

How to use it:

I followed this amazing article in order to setup mongo to handle the change stream. It provides golden information about this topic.

But the most important one is mongo by default is running as a standalone application. “In order to use changeStreams you must convert your “Standalone” database into a “Replica Set” database.”

With that been said, I highly advise you to check this article.

Mongo changeStreams can be used in a lot of use cases to enable real-time apps to work effectively. The main docs reference that its save time for developers when implementing ETL

System overview:

Created using https://app.diagrams.net/

This code example is taken from mongo official website:

const collection = db.collection(‘inventory’);
const changeStream = collection.watch();
let newChangeStream;
changeStream.once(‘change’, next => {
const resumeToken = changeStream.resumeToken;
changeStream.close();
newChangeStream = collection.watch({ resumeAfter: resumeToken });
newChangeStream.on(‘change’, next => {
processChange(next);
});
});

The above example is using nodejs to listen for changes in the inventory document.

  • Another thing added to this system recently is a batch pull, a background service to pull data from MongoDB directly. It runs periodically (5 to 7 days) to ensure data integrity.

But unfortunately, there is no free lunch, the main downside of this approach adds a maintainability headache to the team. Every new field added or removed from the main schema in MongoDB needs to be reflected in the ETL service and the SQL schema as well. Making it a bit harder to maintain.

In conclusion, the mongo change stream is a really powerful feature, you will get rid of many third-party tools to enable real-time in your application. And it's perfect to implement most of the common use cases like ETL, collaborative apps, notifications, and chats. The possibilities are endless.

References:

--

--