Race against the clock: how we manage to serve ads under 100 milliseconds
For an ad server, every millisecond counts. We must select an ad for a request as quickly as possible; otherwise we will be knocked out of the business because either the publisher’s ad server or the user won’t have the patience. At GumGum, we manage to fulfill tens of millions of requests every minute while maintaining a latency of less than 100 milliseconds. To understand how we do that, let’s trace an ad request.
The first step is to assign the request to a specific web server. Obviously, we can’t have a single server take all the requests for both cost and high availability reasons. Since our business is across the globe, we run a number of AWS EC2 instances in multiple AWS regions. To mitigate the latency incurred by geographical distances, we use the Geolocation routing policy offered by AWS Route53 to direct the request to the nearest server location.
As previously mentioned, the load is distributed among multiple servers for high availability and low latency, but how many servers should we use? The more the merrier or less is more? Neither! If we run too many servers, we are putting more pressure on our databases because there will be more connections established. In addition, the cost will increase because we use spot instances, and a spot request for more instances is less likely to be fulfilled. On the other hand, if we run too few servers, we can’t afford to lose even one server because a large capacity will be taken out and our availability will be harmed. So we figure out a sweet spot by choosing proper instance types and auto scaling policies to avoid problems on both sides.
Now we have a bunch of servers to service the requests, how does the load balancer decide to which server the request should be sent? Currently we use the round robin policy, but we are looking to explore the intelligent traffic flow offered by Spotinst to score better efficiency.
Request decorating and filtering
After a server receives a request, it decorates the request by gathering contextual data around the request and runs a few filters to decide if the request is invalid. The contextual data of the request is the metadata of the environment — a webpage, image, video, etc. — where the ad will be placed. The topic of an article, the sentiment of the text, and objects in an image are all examples of contextual data.
Now we will encounter the first database in this blog post, DynamoDB, where we store the contextual data. When it comes to databases, there are a ton of techniques to minimize the latency. DynamoDB already provides very good performance, but again, for an ad server, every millisecond counts. That’s why we adopted the DynamoDB accelerator (DAX) which is basically a cache in front of DynamoDB. If you are interested to learn more details, check out our blog post — Migrating from ElastiCache for Memcached to DAX.
One of the important techniques that enables us to handle a huge amount of traffic effectively and efficiently is that we try to avoid wasting resources as much as possible. To that end, we validate the request through a list of filters and discard it as soon as we find out it’s not worth further processing.
Now the request has survived the filters and made it to the next step — ad filtering. We have thousands of candidate ads to select from for each request. To select the best ad, we apply more than 100 ad filters. Most of the ad filters check the restrictions configured for the ad. For example, some ads are configured to be displayed on certain websites. That data is stored in MySQL for persistence, but we can’t afford to query MySQL on every single request. So we cache everything up in HashMaps which gives us the best latency for retrieving the restrictions.
Another optimization we make to expedite the ad selection process is finding the optimal order of executing the ad filters. We make filters that take less time to complete while rendering more ads invalid run before those that are more expensive and filter less ads. Another big part of ad selection is to exclude the ads that have already reached their serving goals. Some ads have a cap in the number of times they can serve because advertisers have a fixed budget.
Due to the distributed nature of our application, we need a centralized datastore that can handle high load on both write and read requests because the ads are constantly served and validated against their goals. For this reason, we chose ScyllaDB as our ad performance (how many times an ad has been served) database. To learn more about our story with Scylla, give Migrating from Cassandra to ScyllaDB a read.
As a famous saying goes: when there is a database there is a cache (yes, I made it up :-)). However, we need to constantly check an ad’s progress against its goal, so a cache seems useless here. Despite the constraint, we still manage to use a cache to speed up the process. The cache is only used to get rid of the ads that have already met their goals. For those that are still firing on all cylinders, we consult the database. Even though a server still makes 400k queries to the database, the latency can still be kept as low as 1 millisecond, thanks to the proper data modeling so each query can directly hit the partition storing the data.
Real time bidding
When the execution of the ad filters completes, we may end up with multiple ads. The next step is to divide them into 2 groups, in-house ads and DSP (demand-side platform) ads.
For DSP ads, we don’t have the ad creatives (the actual ad design to show the user), so we need to ask DSPs for them through real time bidding as per OpenRTB. The bid requests are essentially HTTP requests, and sending the bid requests and waiting for the responses make up the largest chunk of our latency and is a bottleneck that could bring the application down. Therefore, we take advantage of a fine-tuned HTTP connection pool and thread pool to facilitate the job. Besides, some DSPs are less likely to respond or take a longer time to send back bids, so we utilize a 3rd party traffic shaping tool called Rivr offered by Simplaex to help us predict which bid requests will not yield any revenue and hence are not worth sending. This has enabled us to cut back on the number of bid requests to a great extent and maintain our low latency in a cost-efficient way.
After we get the responses from DSPs, we select the winning ad that would bring us the most money, and send back a response with that ad. That completes our journey of the ad request trace.
We traced a typical ad request and looked at how the ad server at GumGum optimizes for latency at each step of the way. The actual workflow of the ad server is a lot more complicated than described in this blog, but that nitty-gritty is not important for this post. In conclusion, we looked at a few ways the ad server at GumGum optimizes for latency:
- Place servers close to the clients
- Have enough servers to split the load but not too many
- Use cache as much as possible
- Avoid waste of resources and unnecessary processing
- Choose the right tool for the job
- Tailor NoSQL data modeling to application access patterns
We are looking to achieve an even lower latency by exploring the following optimizations:
- Routing requests to the servers based on their capacities instead of distributing the load equally
- Execute ad filters concurrently
- Use a more efficient JVM or garbage collection algorithm