Scaling Up MongoDB: An Intro to Sharding with a Real Dataset

I’ve seen “sharding” enough times in my System Design course, books, and various other resources, and I’ve finally decided to give it a whirl.

Now, I’m not going to bore you with a detailed explanation of what sharding is or why we need it. If you’re curious and want to dive into the nitty-gritty, I’ve got you covered with this comprehensive guide.

I’ve chosen MongoDB for sharding since it provides this feature right out of the box. While most NoSQL databases offer sharding OOTB (Out Of The Box), that doesn’t mean you can’t shard SQL databases — you just have to be willing to relax those foreign-key constraints 😉

Talk is cheap; let’s jump right into it.

Sahrding meme
Sharding

Dataset

Time to binge-watch some data!

For the dataset, I’ve chosen the Netflix dataset available on Kaggle. You’ll need to log in to download it.

Database setup

Docker: Because why have one database container when you can have five?

As we’re essentially adding multiple machines to the cluster for sharding, I’ll be running multiple Docker containers of MongoDB on my local Windows machine to meet my needs (No judgment for not using Linux, okay? 😛).

I’m assuming you already have Docker installed, but if not, please handle that first. You’ll also need Docker Compose to follow this tutorial.

Compose yourself and get Docker up and running!

Pull MongoDB Docker Image

docker pull mongo

Create Docker Network

Create a custom Docker network to allow communication between MongoDB containers:

docker network create mongo-shard-network

To set up a sharded cluster, we’d need 3 components as mentioned in the Mongo documentation:

  1. Config servers: These servers store configuration settings & metadata for the cluster.
  2. Shard: A shard contains a subset of the sharded data.
  3. Mongos: This is essentially a router between your application and the sharded cluster.

Here’s a graphical representation as depicted in the MongoDB documentation:

The image shows a diagrammatic representation of the Sharded Cluster in MongoDB with Config servers, Shard replica sets and Mongo router connecting the App server and the cluster.
Sharded Cluster setup (Image from MongoDB documentation)

#1 Setup config server replica set

Let’s create a Docker Compose file for this. I’m naming it “docker-compose-config.yml”

version: '3.8'
services:
configsvr1:
image: mongo
command: [
"mongod", "--configsvr",
"--replSet", "configReplSet",
"--port", "27019"
]
ports:
- "27019:27019"
networks:
- mongo-shard-network

configsvr2:
image: mongo
command: [
"mongod", "--configsvr",
"--replSet", "configReplSet",
"--port", "27020"
]
ports:
- "27020:27020"
networks:
- mongo-shard-network

configsvr3:
image: mongo
command: [
"mongod", "--configsvr",
"--replSet", "configReplSet",
"--port", "27021"
]
ports:
- "27021:27021"
networks:
- mongo-shard-network

networks:
mongo-shard-network:
external: true

The Config Server Replica Set (CSRS) requires you to have at least 3 members (Read more here).

Start the CSRS:

docker-compose -f docker-compose-config.yml up -d

Initialize Config Server Replica Set:

Enter a container (e.g., configsvr1):

docker exec -it <container_id> bash

Start the Mongo Shell:

mongosh --port 27019

Landmine:
The port should match the container port. You should cross verify this as mentioned in the compose file.

Initiate the Replica set:

rs.initiate({ 
_id: "configReplSet",
configsvr: true,
members: [
{ _id: 0, host: "configsvr1:27019" },
{ _id: 1, host: "configsvr2:27020" },
{ _id: 2, host: "configsvr3:27021" }
]
});

To check the status at any given point:

rs.status();

Your status output should be something like this:

Replica Set status for config server

Exit the Mongo shell and the container!

#2 Start Shards Replica Sets

Let’s create another Docker compose file named “docker-compose-shards.yml”

version: '3.8'
services:
shard1_1:
image: mongo
command: [
"mongod", "--shardsvr",
"--replSet", "shardReplSet1",
"--port", "27018"
]
ports:
- "27018:27018"
networks:
- mongo-shard-network

shard1_2:
image: mongo
command: [
"mongod", "--shardsvr",
"--replSet", "shardReplSet1",
"--port", "27022"
]
ports:
- "27022:27022"
networks:
- mongo-shard-network

shard2_1:
image: mongo
command: [
"mongod", "--shardsvr",
"--replSet", "shardReplSet2",
"--port", "27023"
]
ports:
- "27023:27023"
networks:
- mongo-shard-network

shard2_2:
image: mongo
command: [
"mongod", "--shardsvr",
"--replSet", "shardReplSet2",
"--port", "27024"
]
ports:
- "27024:27024"
networks:
- mongo-shard-network

networks:
mongo-shard-network:
external: true

Start the shards:

docker-compose -f docker-compose-shards.yml up -d

Initialize the Shard Replica Sets

Enter a container (e.g., “shard1_1”) as we did before and open a bash terminal. Start the Mongo shell to port “27018” (based on which shard you entering).

Initiate the replica set:

rs.initiate({
_id: "shardReplSet1",
members: [
{ _id: 0, host: "shard1_1:27018" },
{ _id: 1, host: "shard1_2:27022" }
]
});

Repeat this for “shard2_1” with hosts “shard2_1:27023”, and “shard2_2:27024” and exit Mongo shell and container.

#3 Start a Mongos Router

Let’s create one last docker compose file named “docker-compose-router.yml”:

version: '3.8'
services:
mongos:
image: mongo
command: [
"mongos", "--configdb",
"configReplSet/configsvr1:27019,configsvr2:27020,configsvr3:27021",
"--port", "27017",
"--bind_ip", "0.0.0.0"
]
ports:
- "27017:27017"
networks:
- mongo-shard-network

networks:
mongo-shard-network:
external: true

Landmine:
To avoid connection issues, I recommend stopping the MongoDB server on your host machine (if any). Alternatively, you can run the router on a different port and update its usage accordingly.

Start Mongos router:

docker-compose -f docker-compose-router.yml up -d

Start a Mongo shell and connect to port “27017”. Let’s now add the shards:

sh.addShard("shardReplSet1/shard1_1:27018,shard1_2:27022");
sh.addShard("shardReplSet2/shard2_1:27023,shard2_2:27024");
The image is a screenshot of the windows command prompt. The terminal contains the result of adding the shards to the Mongo router.
Result of adding shards to Mongos server

Whew! That took a while, didn’t it? We’re almost there. Let’s test the connectivity and kick off with a simple application.

Testing it out!

For this, I’ll simply connect my MongoDB Compass to the database as usual. The connection string will remain the same:

mongodb://localhost:27017

Importing the data and Creating a simple Node.js application

Now that we ensured everything is working fine, let’s play with the data and create a simple Node.js application too.

Prepare and Import the Dataset

Since I’ve already connected to MongoDB Compass, I went ahead and created a netflix database with a titles collection and imported the CSV file.

If you’re comfortable with tools like mongoimport, feel free to use them. Otherwise, stick with Compass for a more visual approach!

“Data, assemble!”

Enable Sharding

Once you’ve imported the data (you should see more than 8k documents in your collection at this point), we need to enable sharding on the Netflix database. For this, first, start a Mongo shell to the Mongos router and run:

sh.enableSharding("netflix");

You’d to create an index for “country” on the “titles” collection:

use netflix;
db.titles.createIndex({ country: 1 });

Now let’s shard the collection by “country”:

sh.shardCollection("netflix.titles", { country: 1 });

Creating a Node app

We’re left with the last piece of the puzzle: setting up an application to work with our sharded cluster.

Move to a working folder, initialize npm, and install the express and mongoose packages. Then, let's create an app.js file and define a basic route to read our data:

const express = require("express");
const mongoose = require("mongoose");

const app = express();
app.use(express.json());

// Connect to MongoDB Cluster through Mongos Router
const MONGO_URI = "mongodb://localhost:27017/netflix?readPreference=primaryPreferred";
mongoose
.connect(MONGO_URI, {})
.then(() => console.log("Connected to MongoDB Sharded Cluster"))
.catch((error) => console.error("MongoDB Connection Error:", error.message));

// Define Mongoose Schema and Model based dataset
const titleSchema = new mongoose.Schema({
show_id: String,
title: String,
director: String,
cast: String,
country: String,
date_added: String,
release_year: Number,
rating: String,
duration: String,
listed_in: String,
description: String,
});

const Title = mongoose.model("Title", titleSchema, "titles");

// Define a Read API Endpoint
app.get("/api/titles", async (req, res) => {
const country = req.query.country || ""; // Filter by country if specified
const query = country ? { country: new RegExp(country, "i") } : {};

try {
const titles = await Title.find(query);
res.json(titles);
} catch (error) {
console.error("Error fetching titles:", error.message);
res.status(500).json({ error: "An error occurred while fetching titles" });
}
});

// Start the Express Application
const PORT = 3000;
app.listen(PORT, () => {
console.log(`Server running on port ${PORT}`);
});

Console of the Node app upon starting the server.

Test the API

Let’s fire up a new terminal and run a cURL!

curl http://localhost:3000/api/titles?country=india

Voila! 🤞😛

I’m leaving the results to you. Play around with the API, and add an endpoint to write/update documents. Go crazy!
(All code is uploaded to this repo. You can use it for reference 🥂)

Verifying the shards

I almost forgot! Let’s verify our shards. Some basic checks you could do are:

  • Mongos: Run sh.status() in the Mongo shell and check your “databases” section. Example result:
Shard status command result
  • Another method is to check the config db. We are interested in the chunks and collections collections. You should probably see something like this:
config.chunks collection

And,

config.collections collection

Conclusion

Insane, right? By this, we come to the end of this tutorial. I know it’s a little long, but wasn’t it worth it? (Rhetorical question 😛)

To solidify the concepts, read more about sharding and practice on MongoDB Atlas as well. Load preset data into your cluster and have fun. Also, don’t forget to check out “Aggregations on Sharded Clusters” mentioned in the MongoDB documentation.

Ciao!

--

--