Troubleshooting MongoDB 100% CPU load and slow queries

I guess many of you faced once the following CPU spikes on your MongoDB server:

MongoDB server: CPU load issue

If not — you are the MongoDB guru then or do not work on production applications with real users.

Let’s think what can lead to these spikes. It can happen when somebody (or you) run ‘find’ queries against MongoDB instance and these queries take some long amount of time to finish. We also call it “slow queries”. The slow queries can happen when you do not have proper DB indexes. Indexes support the efficient execution of queries in MongoDB. Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the query statement. And it can take a while if you have many documents in your collection. You can read more about MongoDB indexes here.

Okay, you got into a situation when your MongoDB server CPU goes crazy. Let’s review what solutions we have to understand where the problem is and fix it then. Fortunately MongoDB is actually quite easy to troubleshoot for common muggles.

In this article I’m going to cover 3 main things:

  • How to detect the queries that lead to CPU load spikes: solution N1
  • How to detect the queries that lead to CPU load spikes: solution N2
  • How to get rid (fix) of these ‘bad’ queries

Solution #1: db.currentOp()

The db.currentOp() function lists the currently running queries with very detailed information. It also includes the duration they have been running (secs_running).

You can go to MongoDB console, run this function and analise the output, but I recommend to start with a filter (secs_running) to reduce the output, e.g:

db.currentOp({“secs_running”: {$gte: 3}}) 

It will return a list of operations in progress that take more than 3 seconds to perform. If your CPU load is 100% — it’s a good filter to use. You can play with different values, sometimes your requests can take up to 10 secs or even more, so use different values to minimise the output and find that single ‘bad query’.

This is what I got on some period:

db.currentOp({“secs_running”: {$gte: 3}})
{
"inprog" : [
{
    "desc" : "conn8003244",
    "threadId" : "0x5e7e56c0",
    "connectionId" : 8003244,
    "opid" : 891016392,
    "active" : true,
    "secs_running" : 3,
    "microsecs_running" : NumberLong(3575970),
    "op" : "query",
    "ns" : "quickblox_production.custom_data",
    "query" : {
        "$query" : {
            "start" : {
                "$lte" : ISODate("2017-11-15T13:57:55Z")
             },
            "application_id" : 53894,
            "class_name" : "Broadcast"
        },
        "$orderby" : {
            "start" : -1
        }
    },
"planSummary": ...,
    "client" : "10.0.0.87:46290",
    "numYields" : 5601,
    "locks" : ...,
    "waitingForLock" : false,
    "lockStats" : ...
}
]
}

I highlighted the most important keys to analize:

  • active: means the query is ‘in progress’ state
  • secs_running: query’s duration, in seconds
  • ns: a collection name against you perform the query.
  • query: the query body.

So now we know how to find slow queries that can lead to high CPU load.

Solution #2: Profiler

The 2nd solution here is to use Database Profiler. The database profiler collects fine grained data about MongoDB queries longer than the specific threshold — threshold in milliseconds at which the database profiler considers a query slow. The database profiler writes all the data it collects to the system.profile collection so you can analyze them later.

You also can set some bigger value as the threshold to minimise the amount of queries for analysis.

For example, the following command sets the profiling level for the current database to 1 and the slow operation threshold is 1000 milliseconds:

user:PRIMARY> db.setProfilingLevel(1, 1000)
{ “was” : 0, “slowms” : 100, “ok” : 1 }
user:PRIMARY> db.getProfilingStatus()
{ “was” : 1, “slowms” : 1000 }

Database will log operations slower than 1000 milliseconds into system.profile collection.

Now you can query for the data against this collection and analize:

db.system.profile.find().pretty()
// or get 'query' operations only and specified fields
db.system.profile.find( { op: { $eq : 'query' } } , {"millis": 1, "ns": 1, "ts": 1,"query": 1}).sort( { ts : -1 } ).pretty()

The documents in the system.profile collection have the form described in Database Profiler Output page.

Interpret the results and fix an issue

So now you have 2 ways to find your ‘bad/slow queries’ which lead to high CPU load. To fix it, we need to understand the cause.

The most common issue observed in the field is Inefficient execution plan. MongoDB needs proper indexes to efficiently search through the documents. When the database can not use an index to support a query or when the existing indexes are not optimal, MongoDB will need to fetch and examine additional documents. An efficient index for a query is easily identifiable: the number of document returned by a query is equal to the number of document fetched from a collection and is equal to the number of keys in the index used. You can easily identify an inefficient plan by looking in the plan summary or by using the explain(‘executionStats’)

Each of the above 2 ways provide information about the query plan. The db.currentOp() method provides the planSummary field — a string that contains the query plan to help debug slow queries. The Database Profiler method provides even more data: keysExamined, docsExamined, nreturned, execStats. All these fields provide very useful information that contains the execution statistics of the query operation.

The best way from my point of view is to use the explain(‘executionStats’) thing.

For example, we had used the db.currentOp() method and caught the following query:

"query" : {
"$query" : {
"UniqueId" : "a6f338db7ea728e0",
"application_id" : 36530,
"class_name" : "Logs"
}
},

Now we are checking its query plan (we just repeat the query and add ‘explain’ in the end):

db.custom_data.find({"application_id" : 36530,"class_name" : "Logs","UniqueId" : "a6f338db7ea728e0"}).explain('executionStats')

and an output:

{
"queryPlanner" : {
"winningPlan" : ...,
"rejectedPlans" : [...],
},

"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 754,
"totalKeysExamined" : 97436,
"totalDocsExamined" : 97436,
"executionStages" : {
"stage" : "FETCH",
"filter" : {
"UniqueId" : {
"$eq" : "a6f338db7ea728e0"
}
},
"nReturned" : 1,
"executionTimeMillisEstimate" : 280,
"works" : 97438,
"advanced" : 1,
"needTime" : 97435,
"needFetch" : 0,
"saveState" : 2283,
"restoreState" : 2283,
"isEOF" : 1,
"invalidates" : 0,
"docsExamined" : 97436,
"alreadyHasObj" : 0,
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : 97436,
"executionTimeMillisEstimate" : 50,
"works" : 97437,
"advanced" : 97436,
"needTime" : 0,
"needFetch" : 0,
"saveState" : 2283,
"restoreState" : 2283,
"isEOF" : 1,
"invalidates" : 0,
"keyPattern" : {
"application_id" : 1,
"class_name" : 1,
"user_id" : 1,
"created_at" : 1
},
"indexName" : "application_id_1_class_name_1_user_id_1_created_at_1",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"application_id" : [
"[36530.0, 36530.0]"
],
"class_name" : [
"[\"Logs\", \"Logs\"]"
],
"user_id" : [
"[MinKey, MaxKey]"
],
"created_at" : [
"[MinKey, MaxKey]"
]
},
"keysExamined" : 97436,
"dupsTested" : 0,
"dupsDropped" : 0,
"seenInvalidated" : 0,
"matchTested" : 0
}
}
}
}

As we can see we have the "stage": "FETCH" which actually slowdowns our query execution — it examined almost 100K(!) docs ("docsExamined": 97436). The docsExamined key means we touch the hard disk, not RAM. And do it too often when search. This is bad. Ideally the docsExamined value must be close to zero.

Now we can clearly say we have a wrong index for the query — our query by application_id+class_name+UniqueId is not fully covered by application_id_1_class_name_1_user_id_1_created_at_1 index. The right way here is to add a new index: application_id_1_class_name_1_UniqueId_1.

Conclusion

Here we know now 2 methods to find your slow queries and hence one of the root cause of CPU load spikes. It’s a good thing to apply this technique to very slow queries first and fix them. Then it’s good to review and tune further other requests that take more or around 100ms and get even better performance. Good luck with your MongoDB!

P.S. Need MongoDB consultancy services? You can ping me at khomenkoigor@gmail.com