How we scaled: 300K QPS & 3-5B requests a day

6 min readJun 13, 2015

Building a web-scale system is hard, especially when you have a really small team (<10 in engineering). But building good software is more about using the right technology for each use case than having hundreds of people in your team.

To showcase what we’ve done, I am sharing our experience in building Reduce Data (http://reducedata.com) and how we scaled it from almost nothing to peak of 3–5 B requests a day.

Key decisions

We designed for (large) scale. We knew that an ad serving platform won’t have scaling pains many years from now. The traffic growth could be phenomenal right from the start. So, we architected the system to scale out from the get go. The system was designed to scale both horizontally and vertically.
We chose Availability and Partition Tolerance (AP) over Consistency and Availability (CA) because our primary need was a low latency, high performance ad auction and serving platform. Data consistency wasn’t much of an issue (for example ads could start serving a few mins late and no one would care). Learn more about Brewer’s theorem (also known as CAP Theoram) here.
No vendor lock-in / Limited use of proprietary tech: Open source software has reached an unquestionable maturity in a vast variety of use cases and in order to keep costs low — we decided to have no vendor lock in with proprietary software.
We built a system around Mechanical Sympathy: The software was built with understanding of how the hardware works and should be able to best leverage it.
Limited use of Cloud technology: We decided early on against limited use of cloud tech because a) EC2 and counter parts tend to be very expensive when compared to barebones counter parts in ad serving use cases and b) Network jitter, disk virtualization etc showed increase in latencies on EC2 in our early tests.
Latency exists, cope with it and try to eliminate it. All lookups should happen in under 1ms. We used RocksDB and a variety of other solutions as primary caches / embedded databases.
We used SSD’s where possible, again to reduce latencies.
We did not virtualize hardware and took advantage of large specs (256GB RAM, 24 core machines) to parallelize a lot of the computation.
Disk writes, if any, were timed and flushed every N seconds with chunks of data.
Nginx was tuned to support keep-alive connections and Netty was optimized to support large concurrent load.
Key data was always available instantly (in microseconds), to the ad server. All of this data was stored in libraries / data structures in-memory.
The architecture should be shared, nothing. Atleast the ad servers which interfaced with the external bidders should be and they should be extremely resilient. We should be able to unplug ad servers and the system should not even blink.
All key data, results needs to be replicated.
Keep a copy of raw logs for a few days.
It was okay, if the data was a bit stale and the system inconsistent.
Messaging systems must be fault tolerant. They can crash but not loose data.

Current infrastructure

40–50 nodes across 3 data centres (primarily US and two nodes in Germany)

30 of them high compute (128–256G RAM, 24 cores, top of the line CPUS and where possible SSDs)

Rest of them, much smaller 32G RAM, Quadcore machines.

10G private network + 10G public network

Small Cassandra, Hbase and Spark Clusters.

Our key requirements were

The system should be able to support one or more bidders which send RTB 2.0 requests over HTTP (REST Endpoints)
The system should be able to participate in an auction responding with a yes or a no, price and an ad for a yes during the auction.
The system should be able to process billions of events each day peaking at several hundred thousand QPS, in order to choose a small subset of users from a large set of users who will be sent to your platform. The larger the pool of users you can have visibility into, the better it is for the advertisers.
Data should be processed as soon as possible, at least for key metrics.

Key technologies used were:

HBase and Cassandra for counter aggregation and traditional datasets for managing users, accounts etc. Hbase was chosen for a high write performance and its ability to handle counters fairly well which work well for use cases of near real-time analytics.
The primary language for the backend was Java. Although I’ve experimented with C++ and Erlang in the past, Java takes the cake as far as availability of skills go and JVM has matured into its own over the last few years.
Google Protobuf for data transfers.
Netty was chosen as the primary backend server, thanks to its simplicity and high performance characteristics.
RocksDB was chosen for writes of user profiles as well as reads during ad serving. It is the embedded database within each bidder. User profiles were synced acorss RocksDB using Apache Kafka.
Kafka was used as the primary messaging queue to stream data for processing.
CQEngine was used as the primary in-memory, fast querying system while certain data was stored using atomic objects.
Nginx was the primary reverse proxy. Much has been said about it, so we’ll leave it at that.
Apache Spark was used for quick data processing for ML processing.
Jenkins for CI.
Nagios and Newrelic for monitoring servers.
Zookeeper for distributed synchronization.
Dozens of third parties for audience segments, etc.
Bittorrent Sync was used to sync key data across nodes and data centres.
Custom built quota manger based on Yahoo white paper for budget control. See presentation below for more details.

Distributed Quota Management in An Ad Serving Environment - Reduce Data Blog

Distributed Quota Management in an Ad Ad Serving Environment OR How we manage ad spends across data centres without…

blog.reducedata.com

The System Design and Results

The ad server was built to be a simple, non blocking netty application that evaluates every incoming HTTP request / impression for campaigns using one of the many in-memory stores and CQ Engine Queries. This lookup did not incur any network latency, compute time or a blocking process (such as a disk write) and was run entirely in-memory. All computation happened within that node, in-memory, and mostly in process.

The ad server is a shared nothing system with some common components communicating asynchronously, once every few mins to the bidders. Shared components transferred state post computation (such as campaign results, performance, available budgets) every few mins.

Ad serving itself was extremely performant and delivered results with latencies between 5–15ms. Raw data is then asynchronously written into Kafka for processing.

Raw data was consumed by one or more java processes in chunks for aggregations within Hbase and spend / campaign status updates in Cassandra cluster.

Some part of the raw data is also sent into a spark cluster for adhoc processing.

Improvements

My belief is that RTB is killing ad tech and both RTB spec and the transport need to be rethought. It is an important organizational goal to try improving this situation.

There are also plenty of internal improvements that include better way of replicating data storage across RocksDB, introducing pre-aggregation using Disruptor framework and much more.

Questions?

Please do write a note here or tweet to me (twitter.com/azifali), if you have any further questions.

Liked this blog? Please recommend the story and / or follow me on medium and linkedin.

PS: I will continue to improve the quality of the content on this blog over the next few days. So follow me / recommend this blog for updates.