Ad Delivery Network Performance Optimization with Flame Graphs
When dealing with a busy ad delivery network, low latency and high throughput is the name of the game in online advertising. This is the reason why we chose OpenResty as our delivery layer.
Last September, our delivery layer experienced a surge in traffic on our European servers. This was because our CPU usage in North America was maxing out, and our front-end load balancer was redirecting US traffic to our other servers in Europe. Unfortunately, it meant that some traffic was lost due to timeout issues.
The problem was quickly resolved by adding more hardware to spread the CPU usage, thus buying us more time to figure out why did our CPUs maxed out. All crises are opportunities in disguise so we dove right in and started a full audit of the performance of our delivery layer. The primary goal was to reduce the CPU usage, thus improve the mean response time of our engine.
While doing our research on how to optimize OpenResty applications, we found a tool that could produce flame graphs for profiling our application. These graphs represents the CPU usage and produce a visualization of the stack trace of our system.
As the flame graph illustrates, one function (tab_len) appears to take an excessive amount of CPU time. This function computes the length of a table and it was responsible for 50% of the CPU time spent by the engine when delivering an impression. As we dug into our code base, we found that this function is used to select the best ad for the user. In other words, this very inefficient function is called billion of times every day.
While computing which ad is better to display to the user, we need to randomly choose an item from a population according to weights assigned to every item:
This is an intuitive and a bit naïve way to approach random weight selection but it has serious performance issues. Essentially, this function fills a table with elements occurring as many times as their weight (multiplied by 100 to account for floating point weights) and picks a random element from the table. This is highly inefficient because we were inserting elements in tables thousands of times before selecting an item.
We refactored the function and improved the performance drastically. Essentially, the new function creates intervals based on the elements weights and picks a random number. If the number lies within the interval, the item is selected. This function supports floating-point numbers and creates much smaller tables.
Looking at the flame graph after refactoring shows a big improvement in performance: the function is not visible on the graph now.
While there are still functions that could be improved, none of them stands out the way like the previous one did in the earlier flame graph.
A substantial improvement can be noticed on the afternoon of September 24 after the code was deployed. This underlines how the flame graph can be particularly useful tool to help us pinpoint functions causing issues in our system. A quick refactor on our weighted random function ended up reducing our CPU load by ~40% during peak traffic hours.
We could have spent countless hours trying to find the culprit of this CPU consumption by hand or even trying to time the request-response cycle of our requests, but having the flame graph tool helped us tremendously.
This shows again that monitoring, benchmarking and instrumenting our code is a huge help when debugging issues and improving performance. In other words, if you can measure it, you can improve it.