An Overview of Databases — Part 8.1: Clocks (MongoDB Replica)

10 min readAug 16, 2024

Part 1: DBMS Flow
Part 2: Non-Relational DB vs Relational
Part 3: CAP and BASE Theorem
Part 4: How to choose a Database?
Part 5: Different Solutions for Different Problems
Part 6: Concurrency Control
Part 7: Distributed DBMS
>> Part 7.1: Distributed DBMS (Apache Spark, Parquet + Pyspark + Node.js)
>> Part 7.2: Distributed DBMS (PostgreSQL Partitioning)
Part 8: Clocks
>> Part 8.1: Clocks (MongoDB Replica)
>> Part 8.2: Clocks (MongoDB Replica and Causal Consistency)
Part 9: DB Design Mastery
Part 10: Vector DB
Part 11: An interesting case, coming soon!

What we are going to discuss in this post:

Setting up the environment
MongoDB Consensus
Write and Read Concern
MongoDB Session
DB consistency and Causal Consistency

Setting up the environment

A MongoDB replica set is a group of MongoDB servers that maintain the same data set, providing redundancy and high availability. A typical replica set consists of:

Primary: The node that receives all write operations and coordinates data replication to the secondaries.
Secondaries: Nodes that replicate the data from the primary. They can vote in elections for primary and can serve read operations based on read preferences.

In our scenario, we have a replica set consisting of:

Primary: The main node handling writes.
Secondary 1: A secondary node with a 10-second network delay.
Secondary 2: Another secondary node with a 15-second network delay.

Challenges with Network Delays

Network delays in secondaries affect several aspects of your replica set:

Replication Lag: The time it takes for changes made on the primary to be replicated to the secondaries.
Read Operations: If your read concern is set to "majority" or you have read preference routing some reads to the secondaries, the delayed replication can impact the freshness of the data you read.
Failover and Elections: In case the primary fails, an election is triggered to decide a new primary.

Here is the dockerized version of our MongoDB replica/cluster:

# Creating a MongoDB keyfile

openssl rand -base64 756 > /path/to/mongodb-keyfile

chmod 400 /path/to/mongodb-keyfile

# File Name: docker-compose.yml

services:
  mongo1:
    image: mongo:latest
    container_name: mongo1
    restart: unless-stopped
    hostname: mongo1
    ports:
      - "27017:27017"
    environment:
      MONGO_INITDB_ROOT_USERNAME: root
      MONGO_INITDB_ROOT_PASSWORD: example
    command: ["mongod", "--replSet", "rs0", "--bind_ip_all", "--auth", "--keyFile", "/etc/secrets/mongo-keyfile"]
    volumes:
      - mongo-data1:/data/db
      - ./mongo-keyfile:/etc/secrets/mongo-keyfile:ro
    healthcheck:
      test: ["CMD-SHELL", "echo 'db.runCommand(\"ping\").ok' | mongosh localhost:27017/admin -u root -p example --quiet"]
      interval: 10s
      timeout: 10s
      retries: 5

  mongo2:
    image: mongo:latest
    container_name: mongo2
    restart: unless-stopped
    hostname: mongo2
    ports:
      - "27018:27017"
    environment:
      MONGO_INITDB_ROOT_USERNAME: root
      MONGO_INITDB_ROOT_PASSWORD: example
    command: ["mongod", "--replSet", "rs0", "--bind_ip_all", "--auth", "--keyFile", "/etc/secrets/mongo-keyfile"]
    volumes:
      - mongo-data2:/data/db
      - ./mongo-keyfile:/etc/secrets/mongo-keyfile:ro
    healthcheck:
      test: ["CMD-SHELL", "echo 'db.runCommand(\"ping\").ok' | mongosh localhost:27017/admin -u root -p example --quiet"]
      interval: 10s
      timeout: 10s
      retries: 5

  mongo3:
    image: mongo:latest
    container_name: mongo3
    restart: unless-stopped
    hostname: mongo3
    ports:
      - "27019:27017"
    environment:
      MONGO_INITDB_ROOT_USERNAME: root
      MONGO_INITDB_ROOT_PASSWORD: example
    command: ["mongod", "--replSet", "rs0", "--bind_ip_all", "--auth", "--keyFile", "/etc/secrets/mongo-keyfile"]
    volumes:
      - mongo-data3:/data/db
      - ./mongo-keyfile:/etc/secrets/mongo-keyfile:ro
    healthcheck:
      test: ["CMD-SHELL", "echo 'db.runCommand(\"ping\").ok' | mongosh localhost:27017/admin -u root -p example --quiet"]
      interval: 10s
      timeout: 10s
      retries: 5

  mongo-init-replica:
    image: mongo:latest
    container_name: mongo-init-replica
    depends_on:
      - mongo1
      - mongo2
      - mongo3
    entrypoint: ["sh", "-c", "bash /docker-entrypoint-initdb.d/mongo-init.sh"]
    volumes:
      - ./mongo-init.sh:/docker-entrypoint-initdb.d/mongo-init.sh

  mongo-express:
    image: mongo-express:latest
    container_name: mongo-express
    depends_on:
      - mongo1
      - mongo2
      - mongo3      
    ports:
      - "8081:8081"
    environment:
      - ME_CONFIG_OPTIONS_EDITORTHEME=default
      - ME_CONFIG_MONGODB_SERVER=mongo1
      - ME_CONFIG_MONGODB_ADMINUSERNAME=root
      - ME_CONFIG_MONGODB_ADMINPASSWORD=example
      - ME_CONFIG_BASICAUTH_USERNAME=admin
      - ME_CONFIG_BASICAUTH_PASSWORD=admin

volumes:
  mongo-data1:
  mongo-data2:
  mongo-data3:



------------------------------
# File Name: mongo-init.sh

#!/bin/bash

sleep 10

echo "Initializing MongoDB Replica Set"
mongosh --host mongo1:27017 -u root -p example --authenticationDatabase admin --eval 'rs.initiate({
  _id : "rs0",
  members: [
    { _id: 0, host: "mongo1:27017" },
    { _id: 1, host: "mongo2:27017" },
    { _id: 2, host: "mongo3:27017" }
  ]
})'

# Some helper commands to check the cluster

# starting the replica
docker compose up -d

# connect to the primary one
docker exec -it mongo1 mongosh -u root -p example --authenticationDatabase admin

# connect to the secondary nodes
docker exec -it mongo2 mongosh -u root -p example --authenticationDatabase admin
docker exec -it mongo3 mongosh -u root -p example --authenticationDatabase admin


# check the replica configuration
rs.conf()
rs.status()


# add custom delay to secondary nodes
cfg = rs.conf()
cfg.members[1].priority = 0
cfg.members[1].hidden = true
cfg.members[1].secondaryDelaySecs = 10
rs.reconfig(cfg)

cfg = rs.conf()
cfg.members[2].priority = 0
cfg.members[2].hidden = true
cfg.members[2].secondaryDelaySecs = 15
rs.reconfig(cfg)

MongoDB Consensus

A MongoDB replica set is a group of MongoDB instances that maintain the same dataset, providing redundancy and high availability. A typical configuration involves:

One Primary: The node that receives all write operations.
Two Secondaries: Nodes that replicate the data from the primary.

How the Replica Set Works

1. Primary Node:

Writes: All write operations are directed to the primary node. The writes are recorded in the primary’s oplog (operations log).
Leadership: The primary node is responsible for the coordination of writes and the consistency of the dataset. It uses heartbeats to send periodic signals to the secondaries to ensure they are operational.

2. Secondary Nodes:

Replication: Secondary nodes replicate data by reading the oplog from the primary and applying the operations to their data set. This replication happens asynchronously, which means there is a small delay (replication lag) in the secondaries.
Read Operations: Secondary nodes can also serve read operations if the read preference is configured to allow it. Depending on network delays and replication lag, the data on the secondaries may be slightly behind the primary.

Consensus and Elections

Raft-like Consensus Protocol:
Inspired by the Raft protocol, MongoDB uses a consensus mechanism to ensure that there is always exactly one primary node in the replica set. This process involves elections and leader (primary) selection.

Elections:
If the primary node fails or becomes unreachable, an election process is triggered among the remaining members of the replica set to select a new primary. Nodes vote for a new primary, and the one with the most up-to-date data (with the highest term number and log index) is chosen as the new primary.

Heartbeat Mechanism:
Each member of the replica set sends regular heartbeats to other members to confirm that they are up and running. If a secondary node does not receive a heartbeat from the primary within a certain threshold, it may initiate an election to pick a new primary.

Write and Read Concern

In MongoDB, write concern and read concern are two important concepts that relate to the consistency and durability of data.

Write Concern

Write concern determines the level of acknowledgment requested from MongoDB when performing write operations (inserts, updates, deletes). It allows you to control the level of guarantee that MongoDB provides when reporting the success of a write operation.

Levels of Write Concern

w: 0 - Do not wait for acknowledgment from the server. The driver will not even check if the operation was successful. This is the fastest but least safe option.
w: 1 - Wait for acknowledgment from the primary server. This ensures the operation is written to the primary server's memory.
w: <number> - Wait for acknowledgment from the primary server and the specified number of secondary servers.
w: "majority" - Wait for acknowledgment from the majority of replicas in the replica set. This is a good balance between data safety and performance.
j: true or journal: true - Ensure the write operation is committed to the journal on the primary server before acknowledging.

db.collection.insertOne(
   { item: "apple", qty: 100 },
   { writeConcern: { w: "majority", j: true } }
)

Read Concern

Read concern defines the level of isolation for read operations. It determines what data is visible to the read operations and how much consistency is guaranteed.

Levels of Read Concern

"local" (default) - Read from the instance's local data, which might not include the most recent writes from the primary.
"majority" - Returns only the data that has been acknowledged by the majority of the replica set members. This ensures you are reading data that is durable.
"linearizable" - Ensures that the read operation is linearizable, meaning it reflects all acknowledged writes. Suitable for operations that require strictly consistent data.
"available" - Provides no guarantees about the recency of the data. Reads could be significantly behind the actual state of the data.
"snapshot" - Reads data from a specific snapshot (available for certain storage engines like WiredTiger and applicable in transactions).

db.collection.find(
   { item: "apple" },
   { readConcern: { level: "majority" } }
)

Example Use Cases:

1. High Performance, Low Durability:

Write Concern: { w: 0 }
Read Concern: "local"

2. Balanced Durability and Performance:

Write Concern: { w: 1, j: true }
Read Concern: "majority"

3. Maximum Consistency and Durability:

Write Concern: { w: "majority", j: true }
Read Concern: "linearizable"

Simple Implementation: You can find a sample code and some explanation here on GitHub.

MongoDB Session

In MongoDB, the startSession function is used to create a new session for operations. Sessions are essential for a variety of advanced database features, specifically for multi-document ACID transactions and retryable writes.

Here’s a detailed explanation of what startSession does and why it's important:

What `startSession` Does

Creates a New Session: When you call startSession, it creates a new session object. This session object is used to group operations together.
Tracks Operations: The session can track multiple operations, allowing the database to treat them as part of a single logical grouping.
Provides Context for Transactions: Sessions are necessary to start, commit, or abort transactions. Transactions allow atomicity across multiple operations and documents.
Supports Causal Consistency: Sessions help in ensuring causal consistency in your reads. Causal consistency guarantees that reads always reflect the most recent writes while using a session.

Why `startSession` is Important

1. Multi-Document Transactions:

In MongoDB, a session is required to perform multi-document transactions. Transactions allow you to make atomic, consistent, isolated, and durable (ACID) changes across multiple documents and collections.
This is particularly useful for applications that require complex data integrity guarantees.

2. Retryable Writes:

MongoDB supports retryable writes (e.g., insertOne, updateOne, deleteOne), which automatically retries certain write operations if they fail due to network errors, primary elections, etc.
These retryable writes often rely on sessions to ensure that the operation is completed without duplication.

3. Causal Consistency:

When operations are performed in a session, MongoDB ensures that reads are causally consistent. This means that a read operation within a session reflects the most recent writes that were performed within the same session.

DB consistency and Causal Consistency

There are a few different ways to ensure consistency and guarantee the causal order of events. Each method provides different guarantees related to consistency, durability, and operation sequencing, but only some of them specifically address causal consistency.

Causal Consistency ensures that operations within a session reflect a sequential order of execution, respecting cause-effect relationships. In simpler terms, it guarantees that:

If operation A causally precedes operation B, then any read seeing the effects of B will also see the effects of A.
Ensures that read operations experience a consistent and predictable order of writes.

This is critical in distributed systems to ensure that operations appear to execute in a logical sequence.

MongoDB utilizes a logical clock, specifically a Lamport clock, referred to as clusterTime. This logical clock is designed to partially order events across all servers in a replica set or a sharded cluster.

To illustrate, consider the following sequential operations:

1. Client Writes Data:

Client A writes data to Server 1.
Server 1 advances its logical clock and returns clusterTime = T1 to Client A.

2. Client Reads Data:

Client A sends a read request to Server 2 with afterClusterTime = T1.
Server 2 processes this read request.
If Server 2’s logical clock is behind T1, it waits until it advances to T1.
Once Server 2 has synchronized up, it returns the data that reflects all writes up to T1.

Benefits of logical clock:

Ensuring Coherence: This mechanism ensures that all causally-related operations maintain their order and coherence, providing a consistent view of the data.
Robust Consistency Model: By using logical clocks and afterClusterTime, MongoDB offers a robust and efficient consistency model suitable for distributed environments.

Here’s a breakdown of which methods ensure causal consistency and which provide other types of consistency:

Methods that Ensure Causal Consistency

1. Cluster Time: By manually managing the clusterTime, you can ensure that reads reflect all causally prior writes up to a specific logical time.

2. Sessions with Causal Consistency: When causalConsistency is set to true in a session, the MongoDB driver ensures that all operations within the session reflect a causal order.

3. Read Concern majority: Using the majority read concern can indirectly ensure causal consistency by reflecting the most recent majority-committed writes, providing a consistent view across replica sets.

Methods that Do Not Specifically Ensure Causal Consistency

4. Write Concern: Write concerns ensure that writes are acknowledged according to specified criteria, but they do not directly address causal relationships between operations.

5. Read Preference: Read preference directs reads to specific replica set members but does not inherently manage the causal order of operations.

6. Transactions: While transactions provide atomicity and strict consistency guarantees for a sequence of operations, they ensure causal consistency within the scope of the transaction itself. However, they do not ensure causal consistency between operations inside and outside the transaction.

In the next article we will be covering more details of the above six different methods including some sample code.