Studies in mongoDB schema design — Pt 1

Published in

MongoDB Performance Tuning

6 min readFeb 15, 2018

Anyone who has spent anytime working with MongoDB knows that designing your document structure — your schema is critical to application performance. Those who came from the relational database world soon realise that the old rules no longer apply — instead of modelling to “third normal form”, we model our collections more towards the application object model and the queries that we need to optimize.

There’s a lot of patterns that can be applied to MongoDB schema design — I’d recommend in particular The little Mongo DB Schema Design book and MongoDB Applied Design Patterns. There are also some very good presentations available from MongoDB.

I’m not going to attempt to cover all or most of the possible design patterns, but I did want to compare two extremes of design from a query performance point of view. The two extremes are:

Embedding everything in a single document.
Linking collections using pointers to data in other collections. This is roughly equivalent to using a relational database’s third normal form model.

There’s a lot of room for compromise between the two approaches and a lot of non-performance related reasons for choosing one over the other (atomic updates and the 16M document limit for instance). Nevertheless, let’s look at how the two extremes compare from a performance point of view — at least for a specific workload.

The Schema

I’m using the classic product-customer-orders-lineitem schema for this example. In the first case, each of these entities was represented by a separate document. In a relational database, we’d diagram this design like this:

In mongoDB we create four collections which look like this:

In the embedded case we put absolutely everything into a single document. The result looks like this:

For my example, I had 1,000 customers, 1,000 products, 51,116 orders and 891,551 lineitems. I had the following indexes defined:

Let’s take a look at some typical operations that we might execute against these schemas.

Getting all data for a customer

It’s very simple to get all the data for a customer when all the information is embedded in a single document. We can get all the data from the embedded version with a query like this:

With an index on email, it's almost instantaneous - 1,000 lookups took only 706ms.

Life is much harder with the fully normalized version. We need to use an aggregation or custom code to achieve the same result, and we need to be sure we have indexes on the $lookup join conditions. Here's the aggregation:

Not surprisingly, the aggregation/join takes way longer than the embedded situation: 1,000 lookups took 6,543 microseconds.

Get all open orders

In a typical order processing scenario, we want to retrieve all the orders that are in an incomplete state. In our example, this is represented by orderStatus=0.

Of course, we can get customers with open Orders like this:

That does give us all customers with at least one open order, but if we only want to retrieve orders that are open, we are going to need to use the aggregation framework:

What is with the duplicate $match statements? The first $match gets us customers with open orders, the second $match gets us the orders themselves. We don't need the first to get the right results, but it does improve performance since it can exploit an index - after we $unwindthe data we can no longer use an index.

It’s far easier to get the orders in the linked data model.

Not surprisingly, the simpler linked query gets the better performance: 60ms vs 5,397ms for the embedded aggregate query.

Top products

Most companies wants to identify best selling products. For the embedded model, we need to unwind the lineitems and aggregate by product name:

In the linked model we also need to use aggregate, with $lookup joins between lineitems and products to get the product names:

Despite having to perform a join, the linked data model performs best. We only have to join after we get the top 10 products, while in the embedded case we have to scan all of the data in the collection. The linked case took 4,288 as compared to 9,369ms for the embedded case.

Inserting new orders

In this case, I looked at inserting a new order for an existing customer. In the embedded case this is simply done by using a $push operation into the customer document:

In the linked data model we have to insert into the lineitems collection and the orders collection:

You might think that the single update would easily outperform the multiple inserts required by the linked model. But actually, the update is a quite expensive operation — especially if there’s not enough spare space in the collection to fit the new data. The linked inserts — though more numerous — are simpler operations. As a result the linked model still outperformed the embedded model in this example: 5,722ms vs 9,272ms for 500 order inserts.

Updating product prices

What if we want to update the price of a product? In the embedded case the product prices are embedded into the lineitems itself. We can do this in a single operation in mongoDB 3.6 using the arrayFilters operator:

Of course, in the linked model we can use a very simple update to the products collection:

The embedded model requires us to touch many more documents than in the linked model. As a result 10 product code price updates took 5,397ms in the embedded model vs only 12ms for the linked version.

Deleting a customer

If we want to delete all data for a single customer in the embedded model we need to iterate through lineitems, orders and customers collections. The code would look something like this:

Of course, in the embedded case things are a lot easier:

The linked example performs very poorly — 26,393ms to delete 50 customersvs 648 for the embedded case. It might be possible to optimize the linked deletes by deleting all the customers and orders in one hit, but no matter what we do we have over a thousand time more documents to delete than in the embedded case — it’s always going to perform relatively poorly.

Summary and conclusion

Lets look at the relative performance of each of these examples graphically:

What we see is that while the embedded model is pretty good at fetching all the data for a single customer or for deleting a customer, it’s not superior to the linked alternative in other situations.

The answer to the question “What is the best data model for my application” is — and always has been — “it depends”.

In the old relational world we could do an almost mechanical mapping to third normal form as a starting point and then tweak that to fit the application workload. In the case of a document database, we tend to start with the most embedded model possible and then pull things apart to reduce redundancy and improve performance. In this study we can see that the linked and embedded extremes each tend to favour very different types of queries. Which model works best for you will depend on which sort of operations are optimisation targets for your application.

Don’t forget in mongoDB you can implement hybrid models that embed a subset of data while linking to the full set. The classic example is a blog system that embeds the first page of comments while linking to subsequent pages. You don’t have to choose one approach for all your collections.

Many things have changed in the world of databases since I wrote my first book on database performance 20 years ago. But one of the fundamentals of tuning has not changed — the only way to be certain that a data model is going to perform is to test it. Having a ‘schema-free’ database does not mean you can get away with a schema-free application, and it is still imperative to get the data model right if you want to achieve optimum application performance.