Document Databases: The new SQL

Mahesh Gattani
4 min readApr 23, 2018

--

If you have been in the technology sector in the last decade, you must have seen a surge in usage of non relational databases. There are many reasons for this move and it has definitely lead to surge in available technologies in the area. Arguably, the most versatile among them are Document Databases.

Document databases, as the name suggests, are used for storing documents. This in comparison with traditional SQL-like databases, leaves the schema free-formed and mostly, open to change. Key-Value databases usually store a singular value per key and allow for quick look up. Document databases bring the best (well, partly) of SQL-like and Key-Value databases together in a single tool. Here documents are usually keyed off a unique element and value can be any collection of additional items. Having the additional items separately allows for extra querying capabilities which are very useful for complex production systems.

There are quite a few Document Databases out there but here are some of the most prevalent ones.

  1. MongoDB
  2. Amazon DynamoDB
  3. Azure CosmosDB
  4. Google Cloud Datastore

MongoDB is the most prevalent open source Document Database. The other three mentioned above are services provided by the top three cloud providers respectively. The feature set provided by all are startlingly similar. Something which stands out, especially comparing with SQL-like databases, is essentially infinite scale. The caveat here is that you must have designed your data model well. Let’s discuss the basic ideas which enables such scaling and how it’s something essential to learn for every distributed systems engineer.

Partitioning

The idea of partitioning data isn’t new. Even in older days, I remember partitioning my hard disk to prevent fragmentation. The idea applies to our discussion in a similar but very profound way. If we can partition data being written to our database well, we can store a lot of data, presumably on numerous disks. This obviously would help with the read throughput as well. The way we do it varies based on which Document Database we choose.

MongoDB: Partitioning is done implicitly by the software. As a partition (shard) becomes too large, it’s broken down into two and spread around. This is done as a background process so the application can keep using the Database without caring about how data is stored. Caveat: under continuous high load, this re-partitioning might always be playing catch up, leading to sub-optimal response times.

Cloud Document Databases: Partitioning is done by the user explicitly. This by definition means you have some intuition about how your data is distributed so that you can choose the most appropriate partition key. The most appropriate partition key will distribute the data uniformly at the same time keeping relevant data in the same partition to allow for fast query times. Caveat: Unevenly distributed data can cause you to pay more for the service. The price point usually increases exponentially with larger partition sizes.

Indexing

Seasoned engineers would remember the continuous loop of trying to find the right indexes to help with query times in SQL-like databases. One significant difference between Key-Value Databases and Document Databases is how the data is actually stored. The Value in Key-Value Databases are stored as blobs, which makes it extremely hard to allow for any special ordering between keys. This makes the use case for Key-Value Databases restricted to Key lookups. Document Databases on the other hand store individual properties of the documents in a more readable manner. This allows systems to create indexes on those properties, which in turn allows for faster look up times. Indexes also allow for faster complex queries in some databases.

Indexing is also done differently in different Document Databases.

MongoDB: MongoDB requires user to define the indexing policies and only the unique key is indexed by default. This assumes that the user knows the read pattern before hand. This is not a big deal though since indexes can be added later too.

Cloud Document Databases: Different cloud services deal with this differently. Some leave it to the user to add index properties and some aggressively index everything and leave it to the user to modify indexing policy. This especially signifies the importance of knowing the tools of trade when using one of these technologies.

Adding indexes, much like SQL-like counter parts, requires extra space to store them and extra CPU cycles to create them. Having many indexed properties can significantly slow down writes in some cases. Paying attention to the read vs write throughput is very important in those cases.

Querying

One might think that with all the differences between Document Databases and traditional SQL-like databases, you will have to re-learn how to query data in this new technology. But surprisingly, that’s not the case usually. Most Cloud based Document Databases provide a SQL-like query language over their tools. The query language is usually a subset of SQL for good reason. This reduces the learning curve for new developers and makes the data available for reading purpose to even non-developers. MongoDB suggests querying using it’s javascript API, which is very powerful too.

But I like SQL. Is it dead?

Definitely not. Document databases are a tool much like SQL-like databases are. Each of them serve a use case and it won’t be ideal to try to think of one these tools as one-size-fits-all. The relational nature of most SQL-like databases allow for data to be stored and queried in a fashion that closely resembles how you would yourself think of data. That is not possible in Document Databases. But in return you get many things starting with the easy scalability. A lot of businesses still runs on SQL-like databases because of the relational properties it provides and I don’t see that changing anytime soon. But I can definitely say Document Databases have found their own place in the land of distributed systems.

Summary

Document Databases are on the rise. It’s tempting to use a database with a lot of scalability and fewer restrictions. It does come with restrictions though and they are important to understand. It’s an essential tool to know for the modern system designers.

--

--