How a lonesome bachelor grows to a large loving family (… of servers)

As most server side developers around, when selecting a database server, I used the well known and basically default choice — MySQL.
This is exactly what was done at the small startup I worked for — eDealya, just like small companies, we went with MySQL.

Then, a year and a half later, we started to develop a new micro-service, this time with mongoDB. For no special reason, just to try a new technology, just to get a feel of NoSQL, and just to add a fresh buzzword to our CVs.

How we got started

I was tasked with setting it all up. Not knowing where to start, I looked for a quick guide, found an interactive tutorial on the mongoDB website that was really cool and easy, I completed it and even won a mongoDB Cup (Yey!)

After playing a little with an installed mongoDB server on my laptop, I immediately noticed a major advantage over relational SQL. mongoDB is much more flexible and does not require changes to the software infrastructure each time you want to add a field to an object (something that happens over and over in a new micro-service).

So the decision was made. We went with mongoDB!

When our mongoDB was young and single

To set up a new mongoDB server, we used an Amazon EC2 instance (which we used for all our services). It was a simple t2.small instance.

After setting up a clean simple linux machine, the installation was quick and easy, a mongoDB instance was up and running in minutes and we could now connect to our service.

The development of the DAO layer was surprisingly easy and intuitive, the Java mongoDB Driver was very similar to the mongoDB shell.

Basically there was no need to touch that layer again since there is no need to make changes when adding a new field, changing the schema or building a prepared statements. Very soon I was thinking about relational SQL as an old dinosaur.

Getting traffic

Once everything was up and running, we opened the service for public use and within a few days started receiving some traffic. As you might guess, traffic was very low at the beginning so we couldn’t really tell how stable mongoDB was, but there were no problems at all and everything was stable with great response times.

As the service became more popular and traffic increased, response times grew. We needed a way to scale without losing data.

Fortunately, our service was only receiving data from clients and responding with ‘200’. So it was pretty easy to move traffic away from the database (temporarily). Our read traffic consisted of cron jobs to generate reports for the company and for our users, our solution for this was to delay generating these reports till after the migration (with almost no damage).

The solution for writing was to use Amazon SQS, a Simple Queue based service to receive data from the web server and hold it until consumed. When the traffic moved away from the database to the queue, we switched to a stronger server (actually turning off the older weaker one). mongoDB was up and running again, so we started SQS consumers to persist the missing data from the queue to the database. Eventually, we redirected the traffic back to the database. We could just leave it to work with the queue, but there was no need for that. Of course we wanted this same mechanism to be available in the future, so we left the code with the ability to reroute the traffic and control its consumers.

As we had prayed for response times dropped and the new server held up for about a month.


After that month, we started getting significantly heavier traffic and the server was reaching its limit. We needed to scale up again. This time around it was easier since we could reuse that queue mechanism.

So again we rerouted the traffic to the queue and started the scale. First, we restarted our mongod process to be replica-ready and sharding-ready as instructed on the mongodb website, added 2 new replicas and waited for them to populate. Later, we added 3 more servers to become a new replicated shard, started 3 new mongoDB config servers on an EC2 micro instances and mongos processes on each service machine.

When the new shard was up and running, it was still empty and not connected, to connect the new shard instance to the database, there are just a few simple things to be done (one example would be enabling sharding and pointing the grid to the new shard).

After defining a collection as sharded, mongoDB balancer starts migrating chunks of data until all shards are balanced.

Chunks balancing can take a long time, hours, even days, but not to worry, mongoDB is fully usable during that time and each query will be forwarded to the correct shard since the mongos gateway knows which chunks are already migrated.

Stable again! but not for long

The new servers grid that included 2 shards of 3 replicas, 3 config servers and mongos gateway for each micro-service (total 9 dedicated mongoDB servers), was holding great, it was fast and stable, the heavy cron jobs executed much faster. We were in a database heaven.

But, as expected, after just a few months that was just not enough and again … we needed to scale. If the migration we saw until now was pretty easy, requiring only some minor work, what we needed now to scale up, was almost nothing!

We needed only one thing: bring up a new replica set and add it as a new shard. Again, the mongoDB balancer will start its work and move chunks from the first two shards to the new one to make them balanced and optimized. From this point on as many shards as needed can be added to scale up for writing and as many replicas to each shard as needed, to scale up for reading.

Over 2 years, we scaled up to 5 shards of 3 replicas (18 dedicated mongoDB servers grid in total), with each shard that gave us more and more months of low response times.

Like all families, not everything is perfect

Nothing is perfect in the software world, not even mongoDB. There were problems we encountered while scaling up. For instance, when we went from three to four shards, something really bad happened. While the mongoDB shard balancer was working, one of the shards got disconnected from the VPC and the machines just died. We still don’t know why, maybe someone did it by mistake, maybe an AWS problem, but the fact is that this was bad. During that time, there was no access to the data on the disconnected shard until we managed to get it back up, but this was not our main problem. A while after everything returned to normal, we began to notice that db.someCollection.count() command returned a lower sum of each iterated document in the database.count(). The count command was critical to the reports generated by the cron jobs, so we had to change the scripts to ensure they were not using the count() command (that is getting the number of documents from the collection metadata).

That was strange, and we were afraid that data got lost, but fortunately that did not happen. It turned out the problem was only on the collection metadata, event still, this needed to be fixed.

We tried everything. We even tried to add a new shard and remove the damaged one but that did not help. The count() command always got a lower number (with the same difference). The only thing that probably could have helped, was to create a new grid and copy to it all the data from the old one with a server script, but we never got to that.


MongoDB is the coolest and easiest database I have ever worked with, by far. It is very flexible (anything can be indexed!) and as shown easy to query and scale. Still, I can understand the fear some big companies have of using mongoDB, it being relatively new, lacking the maturity of most relational databases, there is less information on the web about problems and crashes (this is improving), and data remains the most valuable asset software companies have — this allows no risks to be taken. But for smaller companies who are looking for highly scalable database and enjoyable development, my advice would be mongoDB.

Did you like this post? Follow me on Twitter: @DimaGolds

Dima Goltsman, Server Side Engineer @Wix — WixBookings