MongoDB sharding

Published in

The Glitcher

7 min readApr 25, 2020

A practical guide

For starters, MongoDB is a document-oriented NoSQL database used for high volume data storage. In the traditional relational databases, we use tables and rows. In contrast, MongoDB makes use of collections and documents. Documents consist of key-value pairs which are the basic unit of data in MongoDB.

Without any further delay, we jump into what is Sharding, its requirement, sharding cluster architecture in MongoDB, and a practical example with Docker.

Sharding

Sharding is a method for distributing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations.

Why Sharding?

Database systems with large data sets or high throughput applications can challenge the capacity of a single server. For example, high query rates can exhaust the CPU capacity of the server. Working set sizes larger than the system’s RAM stress the I/O capacity of disk drives.

For addressing system growth, we have 2 methods :

Vertical Scaling.
Horizontal Scaling.

Vertical Scaling

Vertical scaling is increasing the capacity of a single server by using a more powerful CPU, adding more RAM, and increasing the amount of storage space. But there are practical limits to vertical scaling like the hardware limits and the load a single server can handle.

Horizontal Scaling

Horizontal Scaling involves dividing the system dataset and load over multiple servers, adding additional servers to increase capacity as required. While the overall speed or capacity of a single machine may not be high, each machine handles a subset of the overall workload, potentially providing better efficiency than a single high-speed high-capacity server.

MongoDB supports horizontal scaling through sharding.

Sharding cluster

Before going further, we need to understand the components in the sharding cluster.

Shard: Each shard contains a subset of the sharded data. Each shard can be deployed as a replica set to provide redundancy and high availability. Together, the cluster’s shards hold the entire data set for the cluster.
Mongos: The mongos acts as a query router, providing an interface between client applications and the sharded cluster.
Config Servers: Config servers store metadata and configuration settings for the cluster. They are also deployed as a replica set.

Shard Keys

MongoDB uses the shard key to distribute the collection’s documents across shards. The shard key consists of a field or fields that exist in every document in the target collection.

You choose the shard key when sharding a collection. The choice of shard key cannot be changed after sharding. A sharded collection can have only one shard key.

To shard a non-empty collection, the collection must have an index that starts with the shard key. For empty collections, MongoDB creates the index if the collection does not already have an appropriate index for the specified shard key. See Shard Key Indexes.

Note: The choice of shard key affects the performance, efficiency, and scalability of a sharded cluster. A cluster with the best possible hardware and infrastructure can be bottlenecked by the choice of the shard key.

Chunks

MongoDB partitions sharded data into chunks. Each chunk has an inclusive lower and exclusive upper range based on the shard key. MongoDB migrates chunks across the shards in the sharded cluster using the sharded cluster balancer. The balancer attempts to achieve an even balance of chunks across all shards in the cluster.

Balancer and Even Chunk Distribution

In an attempt to achieve an even distribution of chunks across all shards in the cluster, a balancer runs in the background to migrate chunks across the shards

Advantages of Sharding

Reads/Writes: MongoDB distributes the read and write workload across the shards in the sharded cluster, allowing each shard to process a subset of cluster operations. Both read and write workloads can be scaled horizontally across the cluster by adding more shards.
Storage Capacity: Sharding distributes data across the shards in the cluster, allowing each shard to contain a subset of the total cluster data. As the data set grows, additional shards increase the storage capacity of the cluster.
High Availability: A sharded cluster can continue to perform partial read/write operations even if one or more shards are unavailable. While the subset of data on the unavailable shards cannot be accessed during the downtime, reads or writes directed at the available shards can still succeed.

In production environments, individual shards should be deployed as replica sets, providing increased redundancy and availability.

Sharded and Non-Sharded Collections

A database can have a mixture of sharded and unsharded collections. Sharded collections are partitioned and distributed across the shards in the cluster. Unsharded collections are stored on a primary shard. Each database has its own primary shard.

Connecting to a Sharded Cluster

You must connect to a mongos router to interact with any collection in the sharded cluster. This includes sharded and unsharded collections. Clients should never connect to a single shard in order to perform read or write operations.

You can connect to amongosthe same way you connect to amongod, such as via themongoshell or a MongoDB driver.

Shard cluster Set Up

Now, we are beginning to build a cluster that consists of two shards that are a replica set (three nodes), config servers (three nodes replica set), and two mongos config-router nodes. In total, we will have eleven Docker containers running for our MongoDB sharded cluster. Of course, we can expand our cluster according to our needs.

Let’s define our first shard replica set.

As you see, we defined our shard nodes by running them with the shardsvr parameter. Also, we mapped the default MongoDB data folder (/data/db) of the container, as you see.

Our second shard replica set is defined as:

Let’s define our config servers with configsvr parameter.

Finally, we are going to define our mongos (router) instances:

These mongos are dependent on our config servers. They take the configdb parameter to obtain metadata and configuration settings.

After building our docker-compose.yaml file, if we compose it up, we will see eleven running docker containers: Two shards(with 3 node replica set) ,3 config servers, 2 mongos routers:

Now, our sharding cluster needed to be configured. We will run some commands, which will build our cluster on related nodes.

First, we will configure our config servers replica set:

docker exec -it mongocfg1 bash -c “echo ‘rs.initiate({_id: \”mongors1conf\”,configsvr: true, members: [{ _id : 0, host : \”mongocfg1\” },{ _id : 1, host : \”mongocfg2\” }, { _id : 2, host : \”mongocfg3\” }]})’ | mongo”

After running the above command, we can check the status by running the following command on our first config server.

docker exec -it mongocfg1 bash -c "echo 'rs.status()' | mongo"

Next, we will build our shard replica set.

docker exec -it mongors1n1 bash -c "echo 'rs.initiate({_id : \"mongors1\", members: [{ _id : 0, host : \"mongors1n1\" },{ _id : 1, host : \"mongors1n2\" },{ _id : 2, host : \"mongors1n3\" }]})' | mongo"

Now, our shard nodes know each other. One of them is primary and two are secondary. We can check the replica set status by running the status check command on the first shard node:

Do this for the second shard too.

docker exec -it mongors1n1 bash -c "echo 'rs.status()' | mongo"

The output will be similar to this.

Finally, we will introduce our shard to the routers:

docker exec -it mongos1 bash -c "echo 'sh.addShard(\"mongors1/mongors1n1\")' | mongo "

Now our routers, which are the interfaces of our cluster to the clients, have the knowledge about our shard. We can check the shard status by running the command below on the first router node:

docker exec -it mongos1 bash -c "echo 'sh.status()' | mongo "

Now, you will see this output.

Now, our sharded cluster configuration is complete. We don’t have any databases yet. We will create a database and will enable sharding.

docker exec -it mongors1n1 bash -c "echo 'use someDb' | mongo"

we should enable sharding on our newly created database.

docker exec -it mongos1 bash -c "echo 'sh.enableSharding(\"someDb\")' | mongo "

Now, we have a sharding-enabled database on our sharded cluster! It’s time to create a collection on our sharded database.

docker exec -it mongors1n1 bash -c "echo 'db.createCollection(\"someDb.someCollection\")' | mongo "

The collection is not sharded yet. We will shard it on a field named someField.

docker exec -it mongos1 bash -c "echo 'sh.shardCollection(\"someDb.someCollection\", {\"someField\" : \"hashed\"})' | mongo "

The sharding key must be chosen very carefully because it is for distributing the documents throughout the cluster.

You can find the docker-compose.yaml and same set up without docker here.

Github link: MongoDB sharding

I have stopped writing on medium and started writing on my blog www.pranaybathini.com . Please visit to checkout the latest tech content.