What We Learned by Working with MongoDB’s Collection Index

Why a query that should have been fast, was slow and how this led us to learn how to optimize filters

Edoardo Zanon
THRON tech blog
7 min readSep 16, 2020

--

We have been using MongoDB successfully since its early versions and, while you may have come across articles against MongoDB usage, we believe it is a very good and mature product for the tasks it has been designed for.

One of the main challenges of using MongoDB is the management of its resources during the query phase. By resources, we mean the precious CPU, but also memory and disk IOPS. This article goes in-depth on MongoDB indexing and shares what we learned while optimizing our queries.

How can we preserve our resources during the query operation?

MongoDB provides Indexes to improve query performance. Without indexes, MongoDB must perform a collection scan (it analyzes every document in a collection, to select those documents that match the query statement). If an appropriate index exists for a query, MongoDB uses that index to limit the number of inspected documents.

Indexes use data structures called b-tree, they make the search extremely fast. In addition, to achieve the shortest processing time, MongoDB tries to fit indexes entirely in RAM. In this way, the system can avoid reading the index from disk.

During the query process, the query planner, a component of MongoDB, decides which index suits better the query. When the search happens through the index, it is performed in memory and is super-efficient. If your query cannot be solved entirely by the index, Mongo (often) performs a scan to fetch the documents from the disk. This is generally an expensive operation and may result in slow queries.

To define an index, you just need to specify a set of attributes with a certain order (1 to insert the fields in ascending order, -1 for descending). MongoDB will take care of sorting them in the b-tree data structure according to the order you provided for the attributes. The first attribute field is the root of the tree and the last are the leaves.

How to make an index for a complex query?

Unfortunately, there is no index that works for all queries. If you want to optimize the performance of your database you must understand how data is structured and how you want to access that data, in order to create the best index for the most critical queries.

When you create an index for a query, this rule of thumb helps you to decide the order of fields in the index:

  • The fields against which equality conditions of the queries are run.
  • The fields to be indexed should reflect the Sort order of the query.
  • The fields representing the Range of data to be accessed.

But this rule doesn’t fit all cases.

Your data structure plays a very important role in the index definition

For the purpose of this article, let’s consider the collection ‘audit’ (mimicking our audit collection that records the user operations in our platform).

Each document in the ‘audit’ collection has the following structure:

{
_id: ObjectId
tenant: string,
type: Array[string],
status: string,
removed: boolen,
silent: false,
targetId: string,
...
}

Let’s assume we need to support the export of a portion of the audit, filtered by date range (using _id), user status, and type of action.

db.audit.find({
"_id": {
"$gte":ObjectId("5e26775e0000000000000000"),
"$lte": ObjectId("5e4f55de19383d7303d1d8b5")
},
"type": {
"$in": ["TYPE1", "TYPE2", "TYPE3, "TYPE4", "TYPE5", "TYPE6", "TYPE7", "TYPE8", "TYPE9", "TYPE10", "TYPE11", "TYPE12", "TYPE13","TYPE14","TYPE15","TYPE16"]
},
"silent": false,
"status": {"$in": ["COMPLETE"]},
"removed": false,
"tenant": "tenant_1"
}, {_id :-1}).limit(50)

There are several indexes available on this collection, the system created them to manage different queries and the query planner may take advantage of them, here is a list of the available indexes that the query planner evaluates for our query:

_id_1
tennant_1_status_1_silent_1__id_-1
tennant_1_silent_1_targetId_1__id
tennant_1_silent_i_userId_1__id
...

If we execute the query we see it takes 50ms to complete. But there is no index that satisfies the rules described before. Why is this query fast? To understand why we should execute the command .explain() and that will allow us to see that it uses the index ‘tennant_1_status_1_silent_1__id_-1’.

Looking just at the execution time might not be enough to validate the query performance.

To check if a query is running efficiently you need to use the explain() command and check the relationship between “nReturned”, which is the number of documents that match the query, and “totalDocsExamined”, which is the number of documents analyzed during the query process. In this case, our result was:

{
seek: 14
nReturned: 50
totaldocumentsExamined: 178
totalkeysExamined: 178
}

It turns out the execution time was a stroke of luck caused by the fact that the distribution of data, in this case, was favorable to us.
The bigger the difference between returned documents and examined documents, the higher is the query execution duration. For good performance, the ratio between nReturned and totalkeysExamined must be as close as possible to 1:1.

In fact, if we want to test the real power of the index, we can just run a count() on the previous query. The execution time is 30,1 seconds. The index used was the same, why is there such a difference?

To perform the count with this index, MongoDB fetches from disk (or cache) any document contained in this index, because there are some fields in this query, that aren’t included in the index. In addition, not all scanned documents match the query. As a consequence, we can see CPU usage is around 10% and disk IOPS is very high (~1660 IOPS)

What the official docs would suggest to use as the correct index

By following the rule previously described, the correct index for this query would be:

tenant_1_silent_1_removed_1_status_1_type_1__id_1

But, if we run the count() again, we still see an execution time of 30,1 seconds. By running a explain() we notice that the query planner ignores the new index we created by following the documentation advice and keeps using the previous one.

Why?

Analyzing the discarded plan of the explain, we notice that our index has a high number of “seek” (repositioning of the cursor in the index during the pre-scan of the b-tree) which negatively impacts the effectiveness of the query. Our interpretation is that the date filtering applied to any element of the $in clause causes a high number of “seek”, and generate an operation that is evaluated as too expensive from the query planner.

What makes us think that we are correct on our hypothesis is that if we change our query, by reducing the number of elements for $in operator on “type” filter to only five items, the query planner chooses the “right” index and the execution time drops to about 60ms.

The following graph explains the structure of the index.

B-tree index path used to find the query result

The correct index for our use case

The previeous index is incorrect for our case and data distribution.

To make our query as fast as possible, we changed the index to:

tenant_1_silent_1_#removed_1_status_1__id_1_type_1

In this example, data affects the construction of the index.

Our data is distributed over 4 years, the inserting rate in this collection is constant over time. Our query must filter the data in a range included from 1 to 30 days. When we create the index, if we swap the position of “type” and “_id” the data analyzed by the query should be lower than before.

The following diagram explains the path choice during the index exploring.

If we run the explain() command now, we can see that:

{
seek: 1
nReturned: 50
totaldocumentsExamined: 0
totalkeysExamined: 50
}

Now the “seek” is 1. The scan of the index is simpler and results in a faster query.

The “totalkeysExamined” now have a ratio with “nReturned” equal to 1. In the worst-case scenario, when we filter by only one “type”, “totalkeysExamined” can be high, but in our case, the performance using this index is acceptable (130ms query response time), since it is not a customer-facing query but an administrative one with a fairly low access frequency.

This query would also work fine without inserting the field “type” within the index because the data analyzed during the query are reduced (CPU usage is lower). However, to further improve performance and avoid taking data from disk and consuming resources, we inserted it anyway to achieve “totaldocumentsExamined: 0”

Having all the fields in the index and have enough RAM to contain that, is also a best practice recommended by MongoDB to reduce the usage of disk.

In the following graph, we can see the performance of the two indexes. With the last index, the number of examined keys in the index is still high. This metric shows that the query could work better with another index, even if the usage of the cluster resources (CPU and disk IOPS) was reduced. The two tests were executed cleaning the cluster caching (object already fetched by disk).

  • Left graph: two queries run with the old index
  • Right graph: two queries run with the new index

Can we do better?

Yes, we could. This is a tradeoff between efficiency and cost. We could add both indexes, but that would require an increased amount of resources. As an example, our index uses 1.4 GB of memory with actual data, the tradeoff between the resource usage and the query performance is good as it is now.

In the future, we might be forced to add another index, especially if our data changes it skewing.

There is no such thing as a perfect index, databases are getting smarter and they might do a good job at identifying good indexes most of the times, but you have to be ready to dig deeper into how the query planner works, how your data distribution is shaped and how it plays on your own indexes, which most of the times will be a tradeoff between performance and resource usage.

--

--