Beyond the hood of MongoDB
For the last few years, I’ve used MongoDB as the main DB solution, both in my military service in an Israel Defense Forces (IDF) intelligence unit and in my previous start-up company.
This article will cover my personal experiments and insights from those years using this powerful, flexible and impressive service.
Many NoSQL databases have proven themselves to be stable and reasonable solutions to many everyday problems, such as real-time response, handling large capacity of data, or even storing data that just won’t fit in a relational schema.
Nowadays, it’s clear that NoSQL stores have become an integral part of every startup’s lifecycle. It’s almost unimaginable to meet companies that won’t combine one or more NoSQL services either in production or as part of their back office products.
A common approach that helps developers choose which data storage they should use requires answering these three questions:
- How much data will be stored?
- How often should the data store perform read/write operations?
- How many resources do you have? (manpower, funds etc.)
If you have answered this three questions, and come to the conclusion that you don’t have as many resources as other companies or any advanced knowledge using NoSQL stores, but you know that you’re going to store a large amount of data, that you expect to be queried a lot and may run some sophisticated queries as well,
THEN, I would strongly recommend you to use MongoDB as your store service. Thanks to MongoDB’s Nexus architecture and its partitions mechanism (a.k.a “shards”), a MongoDB instance can be scaled easily across basic hardware.
In addition, modifying a data model dynamically should not have any negative impact on performance or result in any system downtime, giving you the flexibility you’re probably looking for in your first steps using a NoSQL data store.
As a side note, let’s take a brief look at some popular NoSql store types:
- Key-value store, which usually uses as a popular solution for caching large volume of data or for fast read/write operations using a single key.
One of the famous NoSQL’s data store that implements this approach is RedisDB by RedisLab.
- A derived type of the key-value store is the wide-column store.
This type is a lookalike of the relational data base’s model — with one main difference: the names and format of the columns can vary from row to row in the same table.
Data stores such as Cassandra or Hbase have become very popular, due to their fast read/write operations and the way they extend the relational data structure.
- The third type, which we’re going to examine closely, is called document-based stores.
- Last but not least are the graph-based stores, probably the most complicated and expensive NoSQL store types; however, for specific jobs, especially those that require queries of the database without any significance to relations between entities, graph-based stores are probably your only choice.
MongoDB’s model architecture in a nutshell
Before taking a tour of the model architecture, it is important to know the motivation behind it. According to Eliot Horowitz, MongoDB CTO and co-founder, MongoDB wasn’t built from scratch, but with the thought to improve an existing relational DB product.
“…the way I think about MongoDB is that, if you take MySQL, and change the data model from relational to document-based, you get a lot of great features...” - Eliot Horowitz
The great features he mentions include the ability to store embedded data as sub-documents, and thereby reduce the number of JOINS operations which finally leads to faster queries.
Moreover, the development process becomes more agile, due to the dynamic schemas feature, and the ability to scale horizontally.
Bson is an extension of the regular Json format, which includes additional data types such as int, long, date, floating point, decimal 128 and more.
A Bson document could include one or more fields (e.g., “columns”), while each field contains a specific data type, including array, binary data, objects or another sub-document.
A Bson file may look like a Json formatted file, but it is actually serialized and stored as a binary file, so it reduces the disk usage.
When inserting a new Bson document (e.g “row”) into a collection (e.g “table”), MongoDB will automatically add two more important fields, called “_id”, “_class”.
The _id field’s meaning is straightforward and there is no need to explain it, however, I will expand more about its generating process later in this article.
The _class field is the relative path directing data to the current ORM’s entity (for those of you who are familiar with OOP concepts, the _class field must point to a specific class instance; interfaces aren’t allowed here due to deserialization issues).
Unlike other NoSQL stores, MongoDB provides a document validation,
removing another responsibility from the developers.
That is, by managing the document validation when inserting or updating a document, the developer can enforce which fields are mandatory, which data types are possible, what should be the range and control of the data structure.
ObjectId is a unique identifier generated automatically when inserting new Bson file into MongoDB.
The objectId’s generation mechanism guarantees that each identifier is different from any other identifier that had previously been generated.
MongoDB store engine (WiredTiger) does not rest on its laurels. Rather, “MongoDB automatically optimizes queries to make evaluation as efficient as possible…”. For example, it includes a component called “query optimizer” that periodically runs alternative query plans and selects the index with the best response time for each query type.
So what is the “efficient evaluation” they are talking about? Evaluation normally includes the selection of data based on predicates, and sorting data based on the sort criteria provided. The best results of the empirical test are stored as a cached query plan and are updated periodically.
A subset of query optimizations, called “covered queries”, is characterized by their return results. In MongoDB, a query for which the return results contain only indexed fields is returned without any reading from the source documents.
“Covered queries” are a mixed blessing, as far as features go.
Even though they reduce the response time by returning the results directly from the index, they also are complicated by inconsistent data.
Imagine you have a multi-threads application that performs many CRUD operations over the DB. Some of the READ operations are “covered queries” and return results directly from the index.
In such a case, two threads use MongoDB at the same time, with the first one writing/updating/deleting a document, while the second is performs a covered query operation.
Well, you may expect to have an inconsistent data problem because of to the index rebuilding mechanism.
While one thread inserted/updated/deleted the document, the index used for reading the covered query will not yet have rebuilt itself, with the new data changes.
Embedded document vs. separate collection
“Should I need to create a new collection for this data or perhaps embed it as
Many developers have a hard time answering this question.
Here are few guidelines to help you reach the conclusion that is right for you — including the fact that it is recommended that you duplicate your data to reach a higher speed, and reference it for more integrity.
Denormalized data (a.k.a. “embedded”), has its own advantages, such as speed, readability, indexing sub-fields, etc.
It is also reduces very costly operations, such as AGGREGATION and JOINS.
However, you have to pay attention to the data you are going to store and what are your preferences about the options that it will be inconsistent.
Another very important guideline to mention is the future-proofs guideline.
If you are planning to query this data in different ways in the future, you may want to consider normalizing it (the problem with denormalized data, is that it is limited to the following context it is in).
Another way to look at this guideline, is by asking whether you’ll be querying for the information in the given field by itself, or only in the context of the larger document.
The last guideline I want to talk about indicates that you should not embed fields that have unbound growth.
You may embed 100 or 1,000,000 sub-documents, but the way to do so is up front. Given how the MongoDB stores data (WiredTiger engine), it would be fairly inefficient to be constantly appending information to the end of an array.
The MongoDB WiredTiger engine provides a wide spectrum of indexing options, including unique, compound and array indexes, and some more spectacular indexes options like TTL, geospatial, partial, sparse, and text search indexes.
Unique indexes: When MongoDB has specifies that index is unique, it rejects inserts of new documents with an existing value for the field for which the unique index has been created.
Compound indexes: This kind of index should be used for queries that specify multiple predicates. An additional benefit of compound indexes is that any leading field within the given index can be used.
Array indexes: For fields that contain an array, each array value is stored as a separate index entry.
Partial indexes: By specifying a filtering expression — a condition established during the index creation — a user can instruct MongoDB to include only documents that meet the desired condition.
Spares indexes: This kind of index contains only entries for documents that contain the specified field.
My own experience using MongoDB (v. 3.4) was absolutely incredible!
Its comprehensive indexing options helped me to get the exact impressive response time I was looking for. Moreover, combining it with a rich querying language made MongoDB a worthy candidate as compared to other NoSQL services that were on the table.
One of the main reasons I was trying MongoDB in the first place was the fact that it is a document-based data store, which gave me almost an absolute freedom to implement my data model so that I could reuse it later on.
Another advance feature that MongoDB provides is the way it stores data within binary files — which I configured according to my servers limitations and restrictions.
I worked with MongoDB using RoboMongo to enhance my velocity, working both on func and test environments. and I strongly recommend using such tools (I didn’t have the chance to work with MongoDB official Compass interface, but as far as I know it is a good fit for both beginners and advanced users, as it includes features that allow monitoring and indexes-rebuilding).
There is another very helpful feature which I didn’t mention in this article, called the oplog collection — I strongly encourage you to read more about it in other great articles across the web.