MongoDB Slow Operations Analyzer 2.0

From a pure visual analysis tool to a potential administration tool

Kay Agahd
idealo Tech Blog
8 min readSep 11, 2018

--

TL;DR If you are just interested in the result of our work or want to have a quick visual overview of your slow queries you might be interested in our open source project which is hosted on our Github repo. I recently wrote also a practical guide to analyze slow MongoDB operations by using this tool.
Keep on reading if you want to know more insights about this project.

In my first blog post I wrote about our way to visually spot and analyze slow MongoDB operations. The resulting web application has been developed in 2013 already. Some time later it turned out that our web application missed an overview of the current configured database profilers. Since each mongod instance and database has its own database profiler, we ended up to have over 400 profilers. Not all are running at the same time. Only a few, 66 to say a number, are running constantly. The rest will be enabled on demand when we experience decreasing database performance. Also, we didn’t want to feed a new configuration to the application just to enable some database profilers. We just wanted to click them to activate or deactivate them. This is why the need for an application status page came in.

Application status page

Besides showing the status of the collector, means where and how many slow operations have been collected (read and written) since application restart, the application page should also show every registered database profiler in a table. Since profiling works per database, each database to be profiled is presented by one row. In regard of the potentially high number of rows, the table has to be filterable. Columns should be sortable and selectable to be hidden or shown in order to increase visibility. The sum of the values, where it makes sense, should be shown at the bottom of the columns. To implement such a flexible data table client side, we used datatables which we were using with great ease on the analysis page already. Here is a mockup of such a table:

There are the following columns:

  • Label of the database system
  • Replica set name
  • Status (primary, secondary, arbiter, recovering, down, etc.)
  • Host server address
  • Database to be profiled
  • Collections to be profiled
  • SlowMS threshold
  • Status whether profiling is running or stopped
  • Status whether collecting is running or stopped
  • Timestamp of the last written slow operation
  • Quantity of the last written slow operations within different time periods (last 10 seconds, 1, 10, 30 and 60 minutes, 12 and 24 hours)

The table should be sorted by default by the columns Label, ReplSet and Status which gives a very good overview over a whole bunch of clusters.

By the last columns which show the quantity of the last collected slow operations, we are able to see instantly which databases may behave abnormally. In such case, we may either analyse those databases or switching off collecting or lower their slowMS threshold in order to profile less slow operations.

Interesting to note is that loading the application page quickly, especially if many MongoDB systems with many databases are to be profiled, is not trivial. It requires that all profiling readers get their status update (e.g. if the database profiler is running or stopped) from the database server concurrently. However, since the webserver is limited in CPU cores, it won’t be able to handle all threads in parallel within the given max. response timeout. In such cases the user should be informed that not all threads could terminate in time, and hence the application status page might be incomplete. Adding more CPU cores to the webserver, incrementing the max. response timeout or profiling quicker responding or fewer MongoDB servers will alleviate or even avoid this issue.

Action panel

Now that we have a summery of all database profilers, we can use this view to execute more than just commands like profiling on/off. The user just needs to tick the database profiler(s) and then chose the command to execute. The user can choose whether the command has to run against the corresponding database system (i.e. mongos-router) or against the individually selected nodes (i.e. mongod). The difference is that the command will run either against the entry point of the database system (i.e. router or primary) or against all selected nodes which may be secondaries as well. Current implemented commands are:

  • list databases and their collections
  • show currently running operations (requires MongoDB v3.2 or newer)
  • show index access statistics of all databases and their collections (requires MongoDB v3.2 or newer)

Implementing new commands is quite easy: one just need to create a new java class which implements the interface de.idealo.mongodb.slowops.command.ICommand. The interface has only 2 methods in order to execute the database command and to transform the result to a corresponding table structure.

The command result is shown in a new page in a table built with the already above mentioned javascript library datatables which makes the table filterable and sortable by one or more columns at once.

This is especially useful when we want to see instantly whether a current running operation might block or slow down the database, which is by our experience one of the most frequently asked questions of the database users, aka software developers. If you have used until now MongoDB’s command currentOp to achieve the same, you’ll know how cumbersome it is to get a meaningful picture of the current operations. If you did not, have a look at the output example of the currentOp command and ask yourself how you would query this beast.

Here is a screenshot of the data table showing result of the currentOp command filtered by “offerStore.offer”:

The data table is much easier to filter, to sort and finally to spot the slow operation entry than using MongoDB’s currentOp command. With our action panel, almost everybody in our company may tick its database and click on the currentOpcommand, wait the result page is loaded and finally sort by column secs running to see instantly the longest running operations. We might consider to add a killcommand to avoid copying & pasting the opId and executing MongoDB's killOpcommand manually in case we want to kill such long running operations.

Last but not least, the application status page simplifies also the launch of an analysis of collected slow operations because users just need to tick the database and click the analyse action. This opens the analysis page, preset with the database label(s), server address(es), replica set(s), database(s) and a time span of the last 24 hours, and with default grouping by preset fields and a default resolution by hour. Users may adapt these preset values of course but they do not need initially to enter such values by themselves.

Further benefit

Another benefit of the globally collected slow operations is that we can easily keep an eye on the evolution of slow operations within any time span.
We are doing this by periodically executing an aggregation query against our collector database:

The query counts for each database system the number of all collected slow operations within the last 60 seconds, finds out the maximum duration and calculates the sum of all durations, returned document counts and response lengths. These results are sent to our graphite server.

The aggregation query execution time is only 100 to 200 ms because it uses the index {ts:-1, label:1}. Since only documents that were inserted within the last minute have to match, only the very beginning of the index needs to be traversed, which makes the query fast although the collector database contains already more than 250 million documents!

With these metrics we have built a Grafana dashboard:

Users can select their database system, here for example offerstore-it, through the select box above the diagram. Then, by clicking on one or multiple metrics below the diagram, they can bring out the corresponding graph(s). In this screenshot for example, the orange graph shows that the maximum duration per query is 1.24 min, the average of all maximum durations is 7.4 sec, hence a strong variation, visible also by the peaks in the diagram. The yellow graph in this example shows the number of slow queries on the right y-axis. This varies from max 79 to averaged 15 slow operations per minute. The light blue graph sums up the execution times of all collected slow operations per minute. As you can see, the database system in this example is characterized by many heavy peaks which are nearly perfectly recurrent. This is because the database is nearly exclusively used by backend jobs that run periodically. If the jobs could be distributed more evenly, the general database performance would be better.

If we zoom out, back to begin of 2018 until now, we can see that the number and averaged duration times of slow operations, besides a few heavy spikes, have generally not increased. That’s a good sign that nothing is running out of the rudder so far.

Even if averaged execution times of 4.8 seconds may seem very slow, we don’t have to forget that this is the average of only those queries, exceeding our defined threshold slowMS of 3 seconds for this database system. It would not make much sense to lower the threshold because one should first fix the slowest queries before collecting still more not so slow queries to optimize them further.

The following screenshot shows the end-to-end response times and number of operations monitored throughout the year 2018 by the main user of this database system:

As you can see, the average response time for findAndModify operations is only 11 ms although some spike went up to 1.3 seconds maximum. However, such heavy spikes were extremely rare for this database client. Without taking into account these 3–4 spikes throughout the year, the average response time for findAndModify is generally below 5 ms. In the background, you see the number of read and write operations which reached a maximum of 40M ops/minute (66K ops/sec) each. The average of read and write counts both together is more than 20M ops/minute (33K ops/second). That’s a quite good overall performance for a database cluster consisting of only 3 shards, storing nearly 300 million of documents. Our mongodb-slow-operations-profiler helped us to achieve this good performance.

I’m curious to know your feedback, suggestions or contributions! Hopefully, you liked both of my posts of this blog post series. If you found this article useful, give me some high fives 👏🏻 and share it with your friends so others can find it too. Follow me here on Medium (Kay Agahd) to stay up-to-date with my work. Thanks for reading!

--

--