MongoDB and WiredTiger: A Journey Through the Storage Engine

Elie hannouch
5 min readSep 21, 2023

Databases are the backbone of modern applications, and MongoDB stands out with its flexibility and scalability. Central to its functionality is the WiredTiger storage engine. WiredTiger, as MongoDB’s default engine, seamlessly merges document-level concurrency for high throughput, advanced compression techniques for optimized storage, and an in-memory architecture for rapid data access.

With the addition of write-ahead logging for robust durability and the sophistication of MultiVersion Concurrency Control for snapshot-like data views, WiredTiger harmoniously orchestrates MongoDB’s data management.

This exploration will delve into the intricacies of WiredTiger, shedding light on the processes and techniques that ensure efficient data storage and retrieval in MongoDB.

Step 1: Data’s First Stop: Initial Write and In-Memory Storage

Data’s first touchpoint is a volatile staging ground: MongoDB’s in-memory storage. Here, data is buffered, but it’s more than mere storage. This stage is critical for immediate read and write operations, offering blistering speeds.

Example: When Jane uploads a new photo, it’s not directly written to disk. Instead, it’s swiftly staged in-memory, ensuring immediate accessibility for her followers.

Step 2: Buffer Pool: Data’s Temporary Residence

The buffer pool is WiredTiger’s workhorse. Acting as an intermediary, it juggles data between volatile memory and persistent storage, carefully managing which datasets are hot (frequently accessed) and which are cold.

Example: Jane’s photo, gaining popularity, remains in the buffer pool longer, ensuring that her followers can quickly view it without delays.

Step 3: Safeguarding Data: Write-Ahead Logging (WAL) in Action

WAL (Write-Ahead Logging) is not just a backup; it’s a commitment. By recording changes before they hit the disk, it provides a safety net, ensuring durability and enabling swift recovery from unexpected interruptions.

Example: If the database faces an abrupt shutdown, Jane’s photo won’t be lost. The WAL ensures that her upload can be reconstructed and committed to disk upon recovery.

Step 4: Efficient Data Navigation: The Role of B-Trees

B-Trees aren’t just data structures; they’re dynamic entities. WiredTiger uses them to organize data hierarchically, optimizing searches, insertions, and deletions. Their self-balancing nature ensures data remains accessible in logarithmic time.

Example: When someone searches for Jane’s photo using a tag, the B-Trees ensure that this search is efficient, navigating through data layers to fetch the result swiftly.

Step 5: The Art of Compression

WiredTiger’s compression is a masterclass in space optimization. By shrinking data without loss of fidelity, it reduces storage costs and I/O operations, enhancing overall performance.

Example: Even though Jane’s photo is high-resolution, WiredTiger ensures it occupies minimal disk space without compromising its quality.

Step 6: Ensuring Durability: Data’s Journey to Disk

This is where data earns its permanence. But it’s not a mere act of writing; it’s a carefully choreographed sequence, ensuring data integrity even if interruptions occur mid-write.

Example: As Jane’s photo finds its permanent home on the disk, techniques like copy-on-write ensure that even a power outage mid-write won’t corrupt her photo.

Step 7: Managing Multiple Timelines: MVCC in Action

Concurrency is a challenge, but MVCC (MultiVersion Concurrency Control) turns it into an art. By maintaining multiple versions of a dataset, it ensures that readers get a consistent view, even when writers are updating data.

Example: While Jane updates her photo’s caption, her followers can still view her photo without any inconsistencies.

Step 8: Capturing Moments — Snapshot Management

Snapshots provide temporal anchors. By preserving data states at specific moments, they allow for historical data views, rollbacks, and consistent backups.

Example: An hour after updating her caption, Jane wants to revert it. Snapshots ensure that her previous caption is still accessible for such rollbacks.

Step 9: Safety Checkpoints

Checkpoints are like lighthouses in data’s vast ocean. By marking consistent states of data, they ensure swift navigation during recovery, guiding processes to the last known safe point.

Example: In case of a system hiccup, WiredTiger can resume operations from the last checkpoint, ensuring minimal data loss or corruption.

Step 10: Dynamic Page Management in B-Trees

Within B-Trees, pages are the fundamental units of data storage. WiredTiger’s intelligent algorithms decide when a page should split due to excess data or merge when data is sparse, optimizing storage and retrieval.

Example: As more users like and comment on Jane’s photo, the underlying data page might split, ensuring efficient data organization.

Step 11: Organizing the Data: Data File Management

Data’s residence is a complex maze of files on disk. Each serves a purpose, from storing collection data to indexing. Their interplay and management are pivotal for database health.

Example: Jane’s photo, her list of followers, and her comments might reside in different data files, each tailored for its specific type of data.

Step 12: The Cleanup — Garbage Collection

As data evolves, remnants of its past linger. Garbage collection is WiredTiger’s cleaning crew, diligently reclaiming space from obsolete or deleted data.

Example: When Jane deletes a draft post, the space it occupied is efficiently reclaimed, ensuring the database remains uncluttered.

Step 13: Synchronized Operations: Managing Concurrent Reads and Writes

Data operations in WiredTiger are a synchronized ballet. Algorithms and structures ensure that reads and writes occur seamlessly, without destructive interference.

Example: Even as Jane’s photo garners more comments, other users can read the existing comments without any overlap or data inconsistency.

Step 14: Mirroring Changes: The Process of Replication Synchronization

In a world of replicas, data’s song must resonate in harmony across all nodes. WiredTiger ensures that changes on the primary node are echoed consistently across secondary nodes.

Example: If Jane’s photo gets a new comment, this change is promptly reflected across all replica servers, maintaining data harmony.

Step 15: Striking a Balance: Data Between Memory and Disk

WiredTiger’s genius lies in its balance. It deftly juggles data between ephemeral memory and persistent disk, ensuring optimal performance without exhausting resources.

Example: While Jane’s recent posts are cached in memory for blazing-fast access, older posts might be retrieved from disk when a follower decides to browse her history.

Step 16: Enhancing Capabilities: How WiredTiger Complements MongoDB Features

WiredTiger isn’t isolated; it seamlessly integrates with MongoDB’s unique offerings. Each feature, be it TTL or full-text search, influences the storage engine in nuanced ways.

Example: Jane sets a story to expire after 24 hours using MongoDB’s TTL. WiredTiger ensures this data is efficiently purged post-expiry.

As data continues to drive our digital experiences, understanding the intricate dance of information within systems like MongoDB becomes ever more crucial. This deep dive into the storage engine’s processes and techniques offers a glimpse into the meticulous engineering that powers one of the world’s leading NoSQL databases, ensuring data is always available, consistent, and safe.

Curious about the intricate journey of a database request? Dive into my latest article and unravel the mysteries behind it. (The Journey of a Database Request)

For extra information’s about the WiredTiger engine, visit MongoDB Docs to deep dive in the topic

--

--

Elie hannouch

Elie Hannouch, Lebanese Technologist & MongoDB Champion, drives tech innovation, mentors upcoming talent, and authors to inspire the digital age.