Understanding and Managing dhandles in MongoDB

Published in

helpshift-engineering

7 min readSep 20, 2023

Introduction

This article is about a less talked MongoDB entity called dhandles or data handles and how this entity can affect MongoDB’s stability and performance.

For context, Helpshift is a customer support conversational platform. Our customers can integrate our SDKs with their applications to interact with their users via tickets.

This story is about problems we faced with MongoDB in the year 2020. The data is from the same time.

We use MongoDB at Helpshift. We had the following average RPS

Reads — 3.6k per second
Writes — 280 per second

We had a replica-set MongoDB setup with the following specifications —


+---------------+------------+
|     Specs     |   value    |
+---------------+------------+
| Instance type | i3 8xlarge |
| CPUs          | 32         |
| RAM           | 244 GB     |
| Replica       | 4 - 6      |
+---------------+------------+

We had more than 2TB of data.

Signs of Trouble

In 2020, we started experiencing problems with MongoDB.
During this time, we started seeing occasional read and write failures.
The failures would occur in a flurry for 15–30 seconds and then stop. We ended up in an inconsistent state in our critical business objects, e.g. tickets.
For example, a ticket is created without any message.

The term support ticket refers to a type of interaction between a customer support team and a customer. Tickets can be opened by customers whenever they have an issue. Customers can chat with the support agent to resolve their problems.
Source — https://www.helpshift.com/glossary/what-is-a-support-ticket/

With some analysis of MongoDB graphs, we could correlate these errors with increasing checkpointing time. We started seeing an uptick in checkpointing time around the same time as when we began facing the issue.
It was nearing 60 seconds compared to the normal 10 seconds. Hence, the subsequent checkpointing event was delayed, leading to accumulated dirty bytes for the next checkpointing event.

Increase in checkpointing time for MongoDB server in 6 months (from ~10s to ~30s):

Checkpointing in databases refers to the process of periodically saving the current state of the database to disk. It involves writing modified data from memory (RAM) to the database’s permanent storage, typically a hard disk or solid-state drive (SSD).

MongoDB uses the WiredTiger storage engine. During checkpointing, the WiredTiger storage engine flushes dirty bytes from the cache to disk, hence application read/writes requests are queued or failed in worst cases.

This was a huge problem.

The Hidden Villain: Dhandles Revealed

As a temporary solution, we used to do a rolling restart of the MongoDB cluster. This activity temporarily lowered the checkpointing time and increased the overall health of MongoDB. But this was a temporary solution.
The checkpointing time would again increase after a few months. We realized we still didn’t know the root cause of why the checkpointing time was increasing.

Time was of the essence here because our customers were getting impacted. So, we opted for a MongoDB consulting session to find the root cause of the issue. In this session, the MongoDB consultant collected multiple pieces of information from us like Reads/writes per second, dataset size, and our data model.

During these sessions, we also talked about how we designed databases for our customers. We used to create a dedicated database for every customer with respective business entity collections in it. This design led to us having around 2200+ databases with around 50 collections in each. We got to know that this might be the problem. This was an inefficient way to model data.

The WiredTiger maintains an entity called dhandle or data handle per data source for reading and writing to them. Checkpointing process iterates through all dhandles, hence a high number of dhandles can increase the checkpointing time.
We started seeing these problems after moving from MMAPv1 to the WiredTiger storage engine.
We were onto something with this new information.

What are dhandles?

From the WiredTiger documentation,

Dhandle is a generic representation of any named data source. Dhandles are required to access any data source in the system. WiredTiger maintains all dhandles in a global list accessed by sessions. Dhandles are created when accessing tables and other data sources that have not been accessed before.
Source — https://source.wiredtiger.com/11.0.0/arch-dhandle.html

Now the question is what are these named data sources?
They can be databases, collections or indices.
It means —

Dhandles count calculation

Let’s calculate dhandles count for our MongoDB instance. Following are the details about our data model —


+-------------+---------------------+
|    Specs    |        Value        |
+-------------+---------------------+
| Databases   | 2200+               |
| Collections | 50 per database     |
| Indices     | 2.45 per collection |
+-------------+---------------------+

Using the above information, max possible dhandles count —

2200 (databases count) + 2200*50 (collections count) + 2.45*2200*50 (indices count) = 3,81,700

The above number is the max possible dhandles. Fortunately, the actual dhandles count was ~1,65,000 in our case because we also had data for inactive customers which wasn’t getting accessed at all.
You can get this count via the serverStatus command in MongoDB shell —

db.serverStatus()['wiredTiger']['data-handle']['connection data handles currently active']

From our consulting sessions, we got to know max recommended limit for dhandles is 20,000. We had ~1,65,000 dhandles against the recommended limit of just 20,000!
Also, this dhandles count was unbounded due to having a dedicated database for each customer. With each new customer, this count will keep increasing, in turn deteriorating MongoDB’s health.

Solutions

With this new knowledge, we started brainstorming solutions to lower the dhandles count. We planned to implement solutions in two phases —

Mitigation: stop dhandles growth

There were some low-hanging fruits, like we had many inactive databases because we never deleted data belonging to customers that had churned unless requested by the customer. Most of it was free-trial customers.
We started working on deleting this data. After deletion and MongoDB server restart, we saw a significant drop in the dhandles count and checkpointing time.

This was just one part of the solution. How do we make sure that this doesn’t happen again? Since we were creating a dedicated database for every new customer, the dhandles count was unbounded. Hence, we decided to use a common database for all of our new customers. This activity put a hard stop to the ever-increasing dhandles count.

Addressed to dhandles after implementing phase 1

There were some caveats to the above solution —
1. Heterogeneous data model
Since we were using a common db for new customers and a dedicated db for old customers, there was no way for the application code to locate data for a customer.

Ticket schema - dedicated collection
{
  id: Number,
  title: text
}

Ticket schema - common collection
{
  id: Number,
  body: text,
  customer: text
}

For dedicated collections, one can just query with the ID filter to get the ticket document, but for common collections, the query must include customer and ID. Hence, we had to develop another layer to smartly route data requests to the correct database.

2. Schema migration scripts
With new product features, we had to introduce new fields to existing collections. This required running scripts that would update the data for all customers.
This activity used to activate dormant dhandles for inactive customers using dedicated collections.
We had set a process for reviewing these scripts with a set of SMEs.
SMEs could make recommendations to perform the migration for active customers only, or perform migration in batches.

Strategize: Reduce dhandles

To reduce dhandles, we planned to do the following things —

Auto customer data deletion job
This job would run daily and delete inactive customers’ data from all of our data stores. Since we never moved old customers’ data to common collections, this job would clean dedicated collections and respective dhandles automatically.

Data collation
We wanted to move data stored in the dedicated collections to common collections. This plan never took off because we decided to move to sharded setup. We will talk about this in our future blogs.

More Tuning

Apart from the above solution, we additionally did the following things —

Upgrade MongoDB version from 3.6 to 4.x.xx — We wanted to use multi-document transaction that was only available from 4.0 onwards.
Reduce TCP keepalive time from 7200 to 120 seconds — The default value of TCP keepalive for Linux Systems is too high and can cause connections to be dropped silently.

Learnings

MongoDB requires restarts — MongoDB performance keeps deteriorating over time. If you do a restart, things become better suddenly.
Even if you delete data, MongoDB doesn’t release resources until an explicit restart is done. In our case, we had to restart the MongoDB setup after bulk deletion.

Conclusion

Finally, our MongoDB cluster was stable after implementing the above solutions.

But it’s not the end of the story. One of the recommendations from the MongoDB consultation session was to shard the database after it exceeds 2TB in size. At present, we are migrating large-scale collections to a horizontally scalable database. We will talk about it in our future blogs.

Stay Tuned!

Thanks to Somya Maithani, Vineet Naik, Rubal Jabbal, Adityo Deshmukh and Sameer Patil for the review and suggestions.