Member-only story

Data Processing

Processing MongoDB Oplog

Rebuilding MongoDB documents from Oplog

Atharva Inamdar
TDS Archive
Published in
4 min readFeb 12, 2019

--

In a previous post, I covered what the MongoDB oplog is and its semantics. In this post, I will look at how to process it to get the new state of documents.

Recap

First, let’s remind ourselves of the data manipulation operations: Insert, Update & Delete. For Inserts and Deletes, only the o field exists with either the full document or just the _id being deleted. For Updates, o field contains the updates as $set and $unset commands and o2 notes the _id of the document being updated.

We can ignore c (DB Commands) and n (NOOP) operations as these do not modify the data.

Problem

Let us consider a large MongoDB collection containing over 1TB of data which needs to be transferred to a warehouse or lake or another storage system on a daily basis. One method is to perform a full export every day using mongoexport utility. However, we quickly find that it can take a long time which makes it unfeasible for daily export. We also have to consider the performance impact on the cluster itself.

Another way is to export once, get updates (oplog) for 1 day and apply those to the existing objects. This requires fewer resources on the MongoDB cluster to read the oplog but also allows applying changes at any frequency required.

Solution

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

No responses yet