Over the past 3 months I have completed a rewrite which converted our application backend from a SQL Server to MongoDB. This is quite a shift in design and thinking. In many ways I was forced to rethink my approach to data storage after many years of SQL Server development. In this series, I will explore my thoughts on the subject.
Why the hell doesn’t Mongo have joins?
Joins are a construct that is essentially native to relational databases. Relational databases represent data in 2 dimensional tables (with some exceptions such as XML field support).
On the other hand Mongo is a document data, which supports rich hierarchical data structures. This allows you to embed related data in a single document which can be retrieved in single request. There is no need for a join operation if you are thoughtful about how your document is structured in storage.
How do I get started?
First, spend time understanding your query patterns. Think about how your application accesses data. If you are building a web application, often times your data access patterns mimic the pages in your application. In Mongo terms, this would mean that each page (or entity) is a collection in Mongo. You will then have documents created as instances of the page (or entity).
I made a list of the most common queries in our application, then listed out the full dataset required for each query. I was surprised to find that the number of joins required for common queries was huge. This is definitely an indication that I had over normalized the data structures which was leading to performance bottlenecks.
Almost always embed
Embedding relational data has been our primary means of data access. After understanding the usage patterns, I was able to construct a primary document containing the necessary relational attributes. These is stored at a document save point in the application. Later when the document is retrieved, the necessary relational attributes are available in the document, avoiding joins.
Rarely, we have resorted to application level joins. This is a separate query issued by the application to retrieve a related document. Of course this comes with additional overhead but in rare cases it is unavoidable. I would recommend keeping track of these queries in your application for later review. Think hard about why each application join is necessary and any options you have to avoid them.
The key is to think about optimizing for read operations. There may be exceptions such as application with heavy transactional loads. For most applications, read operations suffer because the SQL backend is normalized. Normalization optimized for writes because you can write a row and have all queries receive that update instantly. But for our application, normalization was the enemy. We simply have don’t need relational data updated immediately. We can rely on eventual consistency across our relationships.
I have learned that the absence of joins actually helps to force me to think about data access in different ways. Since our application is primarily a directory, read operations far out paced write operations. By avoiding expensive joins for each page load, performance is greatly enhanced. Retrieving data is streamlined because we have previously stored a denormalized document containing all necessary data for any given page load.