Some notes on MongoDB
I’m going to talk about MongoDB, a trending NoSQL solution. MongoDB is the most popular open-source NoSQL databse.
NoSQL is the term used for some new emerging datastore technologies which have different data models from traditional relational databases. There are four major types based on data model:
- Document: MongoDB, RethinkDB, CouchDB, …
- Column: HBase, Cassandra, …
- Key-Value: Redis, Riak, …
- Graph: Neo4j, …
Here I’m not going to explain each model, my focus is MongoDB. You can find several articles and books about each model over Internet.
It’s About Trade-offs
When we talk about NoSQL solutions it’s always about overcoming trade-offs. They are not magic boxes! Relational databases like MySQL and PosgreSQL have provided developers with useful features like join, transaction, ACID properties, rich query language, etc. which makes developing applications a lot easier. But NoSQL databases on the other hand have taken some of this features and provided other features, like easy administration, automatic sharding, automatic failover, automatic replication. Some of these have been designed considering distributed system concepts in mind, to fit in the scale-out paradigm.
MongoDB is the most popular open-source NoSQL database. Because of its popularity, developers often get tempted to use it without evaluating other solutions. But MongoDB may not be the best choice. In this text I want to discuss it. As I said NoSQL is all about tradeoffs, but does MongoDB offer as much as it takes away?
MongoDB is a document store database. It means data is stored as semi-structured documents, in simple terms you can store data in JSON, XML or other formats. Document model is very good for simple records e.g. if you want to store employees data, they fit well in documents. Document model is also very good for one-to-many relations like storing employee phone numbers, but when it comes to many-to-many relations such as keeping employee skills (people may have same skills), problems arise. There are many tutorials on how to store many-to-many relations in document-oriented databases. But some document stores like RethinkDB support distributed joins (they can join tables across many nodes). But MongoDB doesn’t support joins. Generally there are two solutions:
- Doing joins in application layer;
- Storing your data like one-to-many relations and dealing with redundancies.
Before recent v3.2 release, MongoDB’s default storage engine was MMAPv1. MMAPv1 is based on memory mapped files which are files with data the operating system places in memory by way of the mmap() system call. Memory mapping assigns files to a block of virtual memory with a direct byte-for-byte correlation. Memory-mapped files offer great benefits when dealing with large files but they are not optimized for real-time systems. Problems arise when data grows bigger than availabe memory and the operating system begins to swap in and out memory pages (this problem is called Page Faults). But in the newest release, the default storage engine is WiredTiger which is optimized. So don’t forget to change your storage engine if your are using older versions.
When you send a write query to MongoDB, it sends back an ack to client when changes has been written to memory by default. The in-memory data is synced to disk periodically. You have to configure this interval (or generaly write concern), because there is a small window that you may lose data.
MongoDB doesn’t support transactions. It has atomicity only at document level. In databases like MySQL or PostgreSQL you have full transactions in a single node, you only lose transactions between multiple nodes when you shard (partition) data.
Before v3.2, MongoDB partially supported concurrency. It had database level locks which locked the whole database during writes (so no concurrent operations on the whole database!) but with the new storage engine, it now supports document level locking (like row level locking in RDBMS’s).
One of the things that NoSQL databases claim is easy administration. Routine admin tasks may be easy doing in MongoDB (they are easy every where!) but sharding your data or setting up a highly available cluster without a single point of failure is still a complex manual task, but other NoSQLs like RethinkDB simplifies this task greatly.
Learning a New Technology
One other thing you have to deal with, is learning a new technology. Setting a server for testing is easy with MongoDB but if you want to deploy in production you have to go through the docs. Also developers have to learn a new query language, because MongoDB has its own query language.
I’ve summarized some expriences I have faced in a project using MongoDB, finally we switched back to MySQL. So before choosing MongoDB, consider other solutions too.
P.S. Thanks to Behdad Keynejad for editing
Originally published at aidirex.github.io on January 4, 2016.