Understanding MongoDB Oplog

MongoDB, similar to other databases operates using a transaction log internally. In MongoDB’s case, it is called oplog. I have been looking into the oplog to understand the operations a bit more so as to process them for data ingestion. This post documents my learnings.

Oplog is a log of every internal operation used for replication in a MongoDB cluster. In a sharded cluster, each replica set has its own oplog. The oplog in all cases is a capped collection and can be accessed like any other collection in MongoDB. There are two types of operations in MongoDB: commands and data manipulation ops.

Oplog entry structure

Before diving into the oplog commands and values let’s look at how the oplog entry is structured.

This is a simple example and shows the most common and useful fields. There are many more fields for other types of operations.

Data Operations

I’m going to focus on data manipulation operations and ignore any database commands for now. there are only three operation types that MongoDB supports: Insert i, Update u& Delete d. Each operation is also idempotent meaning that if the same operation is applied multiple times to the same input, the output is guaranteed to be the same. This is important to know how the oplog is interpreted and processed.

Inserts

The insert operation will list the inserted document as the value of o field without a o2 . If there is a bulk insert or multiple documents being created then each document will have a corresponding oplog entry. Theo field will include an _id corresponding to the document ID.

Updates

The update operation is for updating parts of a document. In this case, theo2 field contains the _id of the updated document. The o field contains operation in terms of $set or $unset . There is never any delta or incremental operation. The $set record the updated final value instead of increment. For example, if {"a":1} gets updated to {"a":5} , the o field will be {"$set": {"a": 5}} . If a field gets removed, the $unset object will contain the field name.

Deletes

The delete operation is for deleting entire documents from collections. Unlike insert and update, this does not list the document contents. It only lists object ID of the document in the o field.

Tailing Oplog

I work with Python so I will point you to python resources.

Oplog is stored in local DB as oplog.rs collection. This is a capped collection but can be tailed using a cursor. In Python, pymongo package can be used to connect to MongoDB and tail the oplog. The pymongo documents have a good example.

Processing Oplog

Knowing the above operations and how they are represented, we can start to think about how to process the oplog. Now why would we need to process the oplog? One of the main reasons is to be to stream changes from MongoDB and generate snapshots for ingesting into a data lake or warehouse. You could use it directly in your application too. To do this we first need to understand a few nuances of the oplog.

Timestamp

If you aren’t using a library that understands BSON, you will likely see the timestamp as a 64bit Long number. Like Unix time, you can treat this as a serially increasing timestamp. BSON Timestamp type is actually made up of two parts, time and increments. Time portion is a Unix timestamp of seconds since epoch and increments is a serially increasing number denoting the operation number within the given time.

BSON Timestamp is noted as Timestamp(1549263503,1) and as long this is 6654036078271397889. The most significant 32 bits are the time part and the least significant 32 bits are the increment. In Python this is simple to compute:

Resharding operations

In the first section, you saw there was a special field in the oplog entry called fromMigrate . This denotes whether the operation belongs to an internal transfer of documents from one shard to another.

During resharding operation, you will see on the original shard, the delete operations with fromMigrate: true and in the new shard, insert operations with fromMigrate: true . These operations can be ignored as they don’t change the overall state of the database.

If the resharding operation is interrupted it can leave orphaned documents. These won’t usually be visible unless connected to replica set directly. These can be manually updated or deleted but there is no robust way to ignore them.

Failovers and Primary Reelections

MongoDB being a distributed database has the concept of a master node which from time to time can change. There are multiple strategies to handle this and can become complex. See this blog post from MongoDB for more information.

References