Member-only story
Data Processing
Processing MongoDB Oplog
Rebuilding MongoDB documents from Oplog
In a previous post, I covered what the MongoDB oplog is and its semantics. In this post, I will look at how to process it to get the new state of documents.
Recap
First, let’s remind ourselves of the data manipulation operations: Insert, Update & Delete. For Inserts and Deletes, only the o
field exists with either the full document or just the _id
being deleted. For Updates, o
field contains the updates as $set
and $unset
commands and o2
notes the _id
of the document being updated.
We can ignore c
(DB Commands) and n
(NOOP) operations as these do not modify the data.
Problem
Let us consider a large MongoDB collection containing over 1TB of data which needs to be transferred to a warehouse or lake or another storage system on a daily basis. One method is to perform a full export every day using mongoexport
utility. However, we quickly find that it can take a long time which makes it unfeasible for daily export. We also have to consider the performance impact on the cluster itself.
Another way is to export once, get updates (oplog) for 1 day and apply those to the existing objects. This requires fewer resources on the MongoDB cluster to read the oplog but also allows applying changes at any frequency required.