Practical Guide To Analyzing Slow MongoDB Queries

How to quickly identify and mitigate performance bottlenecks of MongoDB operations

Kay Agahd
idealo Tech Blog
8 min readMar 9, 2021

--

In this blog post, I want to share how we use the open-sourced mongodb-slow-operations-profiler to quickly identify the reasons for performance issues and how to mitigate or even avoid them.

Preparation

After you have installed, configured, and started the mongodb-slow-operations-profiler, open its application status page in a browser: http://<YOUR_WEBSERVER>:<YOUR_PORT>/mongodb-slow-operations-profiler/app?adminToken=<YOUR_TOKEN>

In the search box above the table, enter a part or the whole name of the database that you want to analyze, e.g. backend.

the table on the application status page

Each line in the table corresponds to a database running on a mongoD instance. In this example, searching backend shows all mongoD’s where the database offerListServiceBackend is running, which I want to analyze.
Tick the checkbox in the column header to select all lines.

Before you can analyze slow queries, enable profiling by setting a threshold, e.g. 100 milliseconds. This means that all operations taking longer than 100 ms will be written into a capped collection of the database being profiled.
Keep in mind that a low threshold may add stress to the whole database system because the threshold is valid for the whole mongoD instance (see MongoDB’s documentation for more details). This means also that one can’t set different thresholds for different databases running on the same mongoD instance at the same time.

start profiling with a threshold of 100 ms

To collect the profiled operations from the capped collection to the collector database for later analysis, start collecting:

start collecting

Click the link refresh to update the metrics in the table. The column #SlowOps shows how many slow operations have been collected so far from each database.

After having collected a bunch of slow operations, click the link analyse to open the analysis page.

Analysis

The analysis page opens with default parameters, showing all collected slow operations that occurred within the last hour on the selected databases with a resolution of one minute. You can change these parameters in the form.

Scroll down to the table and click the link Clean just above the table on the right-hand side in order to show only the most relevant columns. You can show/hide more of these columns by clicking on these links.

the table on the analysis page

In the column Group, you see what makes up the slow operations that have been grouped together. In the screenshot above, for example, only the queried fields differ (see fields) between the two shown slow operation types.
Just below fields, click the example link to see an example of a concrete slow-operations document on a new page:

the filter expression of an example slow-operations document

Such concrete examples help to interpret the syntax of fields because you can exactly see, how one of the queries of the clicked slow-operations type looked like.

Close the example page to return to the analysis page.
Now, let’s dive into the analysis by sorting some columns of the table!

SumMs = most expensive operations

Sort descending by column SumMs (click twice on the column header) to see the most expensive operations first. These operations may be worth being optimized first because they needed altogether the most operation time, either because they were slow or were executed very often, or both.
Continue reading to learn how to optimize them.

MaxMs = slowest operations

Sort descending by column MaxMs to see the slowest operations first. Since they are slow, they may use a lot of resources which may slow down the database even if they are not executed very often.
A new or better-suited index may speed them up significantly.

SumBytes = heaviest data streamers

Sort descending by column SumBytes to see the operations first that transferred the most data. Those operations can’t be really quick because they need to fetch and return a lot of data. The question is, if they are really needed or if such a use case can be better solved by using a streaming platform such as Apache Kafka instead of a database.

SumRet / rKeys / rDocs = query performance

Sort ascending by column rKeys to see the operations first that did not use any index which is the case where rKeys is 0 because no index key was read from the index.
rDocs is the number of documents read from the disk. If it’s greater than 0 and no index was used (rKeys is 0), it indicates a collection scan which is bad.
If rKeys for query and getmore operations is much higher than the number of returned documents (column Sum ret), the database is scanning many index keys to find the result documents. Consider creating or adjusting indexes to improve index efficiency.

Generally spoken for query and getmore operations, the larger rKeys divided by rDocs and rKeys equal or similar to Sum ret, the better the query performance.

The optimum is when rDocs is 0 and rKeys is greater than 0 because then the index covered the query, which means that all data requested by the client was found in the index without needing to fetch any document from disk.
However, keep your indexes small enough so that they always fit in RAM. You shouldn’t cover your query by an index if the index would be too big to fit in RAM. Also, fields that change very frequently are not a good choice to put into indexes.

wDocs = data restructuring effort

Sort descending by column wDocs to see the operations first that wrote a lot of documents. Writing documents implies also updating data structures such as indexes. When the BTree index structure becomes unbalanced, it requires rebuilding this part of the index which may be resource-intensive.
The question is whether you are able to reduce the quantity of such operations or better use a database that is better suited for heavy writes than for reads. MongoDB is more optimized for read-intensive rather than write-intensive use cases.

memSort = sorting costs

Sort descending by column memSort to see the operations first that could not use an index to sort the result which is the case where memSort is true. If you spotted such operations you may adapt the index so that the result is already in the right order instead of doing an in-memory sort.
Here is an old, but still valid, and very good article that covers some edge-cases for building such indexes.

More insights

Go back to the application status page, tick the checkboxes of the lines of the database(s) you are interested in, and execute the following commands against their mongoD instances.

I describe the commands in the order of their importance in regards to spot performance issues, not in the order of the above screenshot.

host info

Click on the link host info. You see on a new page a table with some info about the hosts of the selected database(s).
If the same database is located on multiple mongoD’s (e.g. replicaSet and/or sharded cluster), check whether the values in all columns are identical among all mongoD’s, because you want to have identical hardware, identical software versions, and identical configurations among all database servers. If not, you may have just found the reason for sporadic performance issues.

host info of a replica set having a different number of CPU cores

index access stats

Click on the link index access stats. You see on a new page a table with the index access statistics of all collections of the selected database(s).
Sort in ascending order by column #access in order to see indexes first that have been accessed less often.

If it’s 0, the index has never been accessed since server restart or index creation (see column since). If it’s furthermore an _id index, which is the primary key, the collection has never been used and may be dropped.

If the number of accesses is quite low, maybe the index is not used anymore.
In such case, the index wastes resources and is a candidate to be dropped as long as it’s not a TTL index (see column TTL) because when time-to-live (TTL) indexes are accessed by the internal remove job, the access counter is not incremented.

db stats

Click on the link db stats. You see on a new page a table with some statistics about the selected database(s). Evaluate the column indexSize to know if the indexes fit still in RAM.
Also, if the same database is located on multiple mongoD’s (e.g. replicaSet and/or sharded cluster), check whether the numbers in all columns are similar among all mongoD’s. If not, it indicates that the database is not well balanced among the mongoD’s. The reason may be different configurations, unsuited shard keys, or unsharded collections. Unsharded collections are located on only one shard, hence putting load on only this one shard instead to balance the load to all shards.

collection stats

Click on the link collection stats. You see on a new page a table with some statistics about all collections of the selected databases.
Here you can sort by columns such as size, count, storageSize, #indexes, totalIndexSize to get an idea which of the collections is the most “expensive” in terms of resource utilization.
You may use the search box to show only collections you’re interested in.

current ops

If you have not yet enabled profiling and collecting and encounter a sporadic performance issue, e.g. requests are slow or even stuck, you can inspect all currently running operations of the selected database system, respectively the selected databases, by clicking on the command current ops.
On the newly opened web page, sort either by column microsecs running or secs running in descending order to see the longest-running operations first.
You may use the search box to show only what you’re interested in, in this example the collection productOffer.
The opid might be used to kill an operation by the command db.killOp(opid) in case it got stuck or slows down everything else (see MongoDB’s documentation).
The other columns of the table should be self-explanatory.

excerpt of the current operations filtered by productOffer

And that’s it! I hope you got some interesting insights!
If you found this article useful, give me some high fives 👏🏻 and share it with your friends so others can find it too. Follow me here on Medium (Kay Agahd) to stay up-to-date with my work. Thanks for reading!
Btw. idealo is hiring: Check out our vacancies here.

--

--