Don’t be Afraid of MongoDB Cross-shard Queries

Empower Sharding Strategy to Handle Cross-Shard Queries with Confidence

5 min readAug 14, 2023

Before discussing sharding, let’s first talk about scaling. We all know that there are two types of scaling, one is vertical scaling, also known as scale-up, and the other is horizontal scaling, aka scale-out.

As for scaling, there are different ways for different purposes. For MongoDB, if we want to improve query performance, then vertical scaling is to improve the machine specs, while horizontal scaling is to increase the number of replicas so that the query can be executed on the idle replicas.

On the other hand, to increase the amount of data stored, for MongoDB, vertical scaling is still about improving the specs of the machine (in this case, the size of the hard drives), while horizontal scaling is about sharding, which is the main topic of this article.

Therefore, we have to understand that the purpose of sharding is to make the data evenly distributed so that MongoDB can store more data, not to improve the query performance. In other words, query performance will be improved by sharding as an extra, but not the main purpose.

Why specifically mention cross-shard queries in the title?

When we consider doing sharding, we will always consider carefully how to choose the shard key, choose the shard key and worry it will affect the production of the query performance, and consider rewrite all the query drastically. These concerns are actually unnecessary.

Let me conclude that if migrating to a sharded cluster from no sharding, as long as the shard key is chosen correctly, then it will only be better, not worse, because it’s a WORST case now.

Why so sure?

The formula for query time in database is as follows.

Ttotal = Ts + Tx

Ts is the retrieval time of the database itself, either by index or by scanning the full collection.
Tx is the return time of the query result.

In addition, Ts and Tx have the following properties.
Ts ∝ total data volume
Tx ∝ result size

When we migrate this collection to a sharded cluster, then the above diagram will change a bit.

The formula will also change a little bit, but the principle is the same.

TclusteredTotal = Max( T1s + T1x , T2s + T2x )

Even if there is data skew happening, I believe we all agree the following conditions are true.

Tns < Ts
Tnx < Tx

Therefore, a conclusion can be deduced.

TclusteredTotal < Ttotal

Nevertheless, there are still some worst cases, for example, sharding is based on the data size rather than the frequency of data access, if there is busy collection(s) originally distributed in two shards without sharding, so they can consume all the resources of their respective machines individually. However, because of sharding, most of the data is distributed to the same shard, which competes for resources.

In other words, the precondition for the problem is the original MongoDB is nearly full and resources are almost exhausted.

How to choose shard key?

From the above introduction, we know as long as the shard key is chosen correctly, then we don’t need to worry about the performance of queries decreasing after sharding. Therefore, how to choose the shard key is pretty important.

I have already provided the ideal formula and described the details in my previous article.

How to choose a MongoDB shard key

I will show you what is the ideal pattern of a MongoDB shard key and the concepts behind it.

interviewnoodle.com

Therefore, in this article, I will only outline which three types of bad cases must be avoided.

Low Cardinality

Suppose we choose an enum field as the shard key, and the range is fixed to [1, 2, 3].

Then even if we write 2 to the max, we still can’t start rebalance, and the maximum number of shards is 3.

Ascending

If we use a continuously incrementing field as the shard key, then we can ensure the chunks are evenly distributed, but it will continuously trigger a rebalance, and the performance will be horrible.

Furthermore, the data we need to query frequently is usually new, in other words, the query and the rebalance will often mix together, making the situation even worse.

Random or Hash Value

Another common choice is to use a random field (or hashing) as the shard key, which also ensures the chunks are evenly distributed and does not trigger frequent rebalances, but the rebalance overhead is increased.

MongoDB’s storage engine, WiredTiger, uses a storage distribution similar to MySQL’s InnoDB, i.e., the primary key (_id) is used as the storage unit (block), and similar primary keys are placed in the same block.

If we use random shard key, it will trigger random access instead of sequential access when we do rebalance.

For instance, when a chunk is full and needs to be rebalanced, the data needs to be migrated across three blocks, and the entire rebalance time will be longer. This is far worse than the performance of a single block to complete the rebalance, as shown in the following diagram.

Conclusion

When we encounter a new problem, we often feel hesitant or even scared because it is unfamiliar.

What we need to do is to take a deep breath and return to the nature of the problem, analyze the core of the problem, and don’t be confused by the external appearance. Taking cross-shard queries as an example, when we use the database perspective instead of MongoDB to unpack the problem, the answer will be obvious.

Software engineering is already a stable field, most of the problems are similar in nature, so don’t panic, think carefully, most of the answers are not complicated.