Picking the right database for the job

Divino Neto
Inloco Tech Blog
Published in
8 min readApr 3, 2019

Our decision-making process to choose a database that would fit our requirements.

The year was 2018. In Loco Media, our mobile ad network and In Loco’s main product was alive and thriving. At that time we were very comfortable in using MongoDB for most of our solutions, and it was doing well enough, so we kept using it more and more.

However, as is with all startups, along with growth came some problems, and there were some parts of our services that were not performing so well.

To assess and remedy that during the second quarter of that year, a team called Ads Tech Scale was put together to fix many of those tech debts. The focus of the team was to identify, evaluate and fix tech debts that would hinder our services in terms of scale and reliability for the time being and in the foreseeable future.

The First of Many!

Here’s our scenario: we have our mobile SDK installed in many apps, which is responsible for requesting ads to our backend by making an HTTP request containing a bunch of information related to the user context, including a user’s unique id. When an ad request arrives in our delivery system, one of our features requires us to match the user id to a list of predefined target groups, which are nothing more than a big set of users with similar characteristics between them.

To achieve that, we indexed these target groups using 2 MongoDB instances in a master/slave configuration.

The problem started when we needed to add or delete one of these groups. To be honest, they are more volatile than I’m making it sound like. Most of the time, these target groups were created for a single ad, based on requirements from the customer. So, at any given moment, we had on average 200 of these groups indexed (and they were also deleted and created regularly). During a day we executed anywhere between 5 and ten operations, and we had to remove or add the group id from all the documents of the users in it. This means 1,000,000+ edits in our not-that-resilient-MongoDB happening at the same time.

This solution design got unsustainable fast and was causing nightmares to SREs and other developers. The DB server’s CPU would reach 100% in fractions of a second and no operation would succeed, which increased the time to edit the target group and made us fail to deliver in a reasonable time. We mitigated this by using a circuit breaker, avoiding querying our DB when it wasn’t responding as fast as we needed and using an empty list as a fallback instead. However, it was clear that we needed a new solution for supporting the current scale, which has been growing ever since.

Time for a change…

A tire needing to be changed

Before we go deeper into the design process and the solutions proposed, these are the requirements we had to fulfill:

  1. Support a large number of reading requests in the format: “I want all the target groups for the user X”.
  2. Support a few huge (1,000,000+) batches of writing requests in the format: “Add the target group Y to the user X”.
  3. Be simple to maintain and reason about.
  4. Only increase the cost of maintaining the feature by at most 50%.

So here is the process we went through. In essence, we tried a lot of different databases to get to a definitive answer. Each database was validated against its limitations that concerned us the most, so we didn’t have a unique predefined test for all the architectures we tried. Also, as this is a service involved in the programmatic media delivery flow, we were looking for very low latencies.

#1 — Scaling up our existing MongoDB

Coffee being weighted in a scale

If you ever tried to scale up a MongoDB to cluster with partitioning and all the fancy stuff, you probably noticed how complicated it quickly gets (and, for the record, we never had a dedicated team of MongoDB specialists). Besides the expected extra nodes, you also need to maintain a new service called configs server, which keeps track of the cluster state, making sure all nodes can share information correctly.

That said, and with the vast number of multi-master solutions available, we decided not to go deeper into this path.

#2 — Using a Cassandra cluster

Cassandra DB logo

At that time, Cassandra was another trendy piece of tech in our company.

This was the first solution we tried to implement. We set up a cluster with three nodes with data keyed by user id, deployed a couple of instances of our delivery system to query from it and started simulating the cluster updates we wanted to support in real time. As we already expected, this solution was a charm. Regarding latency and throughput, this was the best solution we would find in our tests.

BUT… there is no free lunch. If you ever tried to maintain a Cassandra cluster, you know how temperamental it can get sometimes. We’ve had a few years of experience with Cassandra by then, and we were starting to understand we didn’t have the expertise and the resources to develop the required amount of tooling for it. If you want to read more about this, you can read this post from one of our colleagues.

Good thing we decided to keep looking after this: a few months from there we’d start to migrate our Cassandra instances to Scylla. Maybe in the future, we’ll try Scylla for this use case.

#3 — Using a CouchDB cluster

Many couches in a movie theatre

Notice that we are not talking about Couchbase here (yet). We are actually talking about CouchDB, its younger sibling with a non-binary protocol. Since a senior employee had used CouchDB previously in his career and had great memories of it, we had high expectations about this one.

We did almost the same thing with CouchDB that we did with Cassandra. This time, we only used two nodes. The first thing we noticed: WOW, it had some high latencies. We first got shocked about this. A p99 of 5 ms for reads by key? Even MongoDB was performing better than that (when only considering requests that weren’t denied =P).

First, we thought two nodes might not be enough. Since it’s straightforward to scale up and down, we tried adding more nodes. We gave up when we reached six nodes and the latency didn’t move at all.

Soon we understood what was happening. CouchDB implements an HTTP Restful API, which, in comparison to a binary protocol, is full of overheads.

We didn’t stop there, though. We got impressed with the availability this database could offer. Even when we were writing at 10x the parallelism level we’d have in production, the two nodes we initially configured would keep handling the read load with 0 increases in latency or drops in throughput. We were impressed by this: we would need a much more expensive Cassandra cluster to achieve this level of availability.

If only a version of this database with a binary protocol existed…

#4 — Using a Couchbase cluster

While searching for solutions regarding the expensive communication protocol of CouchDB, a random Google search result took us to a page in the Couchbase forum. We had never heard about this database before that.

The fact that it used a proprietary binary protocol to deliver the features available in CouchDB was more than enough to convince us to try it.

We replicated the same setup from the CouchDB experiment. We soon discovered Couchbase was able to deliver the same (or very similar) level of availability as CouchDB while also presenting latencies that were more suitable for our use case. The p99.9 was less than half of the one reported by the CouchDB cluster!

Was it the end?

Finally, after only four iterations, that represented almost two months of work entirely dedicated to them, we found a solution that would be worth investing our work to maintain.

This is not to say that we didn’t have problems with it since. Every day we learn new things about Couchbase that improve our use of it, as it should be with any technology when there are no experts available.

What changed with this process is the fact that we are now using the right tool for the right job: our previous MongoDB setup was designed to work at “web scale” but, programmatic media delivery is a few orders of magnitude beyond that. In fact, we still use it in some of our user-facing web applications.

We don’t always have the time to invest in this kind of iterative process to find the best solution for our problems while working at a startup. We have to focus on shipping product features quickly, which doesn’t leave much time to specialize or handle tasks that won’t impact the product as fast, which explains a motto of ours “Done is Better than Perfect”. But, as In Loco grows more and more, we’ll mark very challenging tech debts as resolved, and many more eventually being created.

The problem we were having was related to a core component of a product that had lasted for years by that time (and is still alive). Before we started tackling it, every single day we would be bothered by at least three alerts telling us the CPU usage of our MongoDB instances were reaching 100%, and they usually required some action taking.

Today, after almost an entire year since this task began, this is one of the most impactful projects I had the opportunity to work on. It paid itself pretty quickly, and I expanded my knowledge about database infrastructure a lot.

Are you interested?

If you are interested in building context-aware products through location, check out our opportunities. Also, we’d love to hear from you! Leave a comment and let us know what you would like us to talk about in the upcoming posts.

--

--