4 Lessons Learned From Optimizing Our 100B-daily-events Web Handlers

Published in

AppsFlyer Engineering

9 min readMar 3, 2022

AppsFlyer is the global leader in marketing analytics. To be a market leader we need to process insanely large amounts of data.

I’m a part of the team that handles a significant part of AppsFlyer’s incoming traffic, over 100 billion events per day. These events are received in our web handlers where they undergo authentication, validation and normalization, and then published to other consumers.

A few weeks ago we had a large feature ahead of us that required new code in the web handler. We knew that the feature would require extensive work on the service, so we decided it’s a good time to show it some love by refactoring and optimizing it.

The optimization process was a success, where eventually we increased the throughput of our web handlers by more than 25% (and more than that, as you will understand by the end) while decreasing processing times, thus improving our customer experience and saving cloud computing costs. During the process I’ve learned a few important lessons, and I hope you will find them useful.

First lesson — The profiler is your best friend

A profiler is an application that connects to your application process and pulls runtime data from it — thread behavior, CPU usage, memory layout and more.
A profiler can possibly answer many interesting questions, like -

How much time is spent in each function in the code?
When does a thread wait on a lock, and which lock is it?
How much time does a certain thread or function spend on the CPU?
What are the “hot-spots” of memory allocation in your system?

I see the profiler as an X-ray machine — It will not necessarily tell you what is causing your issues, but it will allow you to see them. Once you plug the profiler you get a wonderful picture of how your software behaves in real-life.

Our web handlers are written in Clojure (which runs on the JVM) so we used a JVM profiler.
I can recommend two popular profilers -

The open-source VisualVM — Very common profiler, comes with many of the basic functionalities. Will help you deal with most common issues.
Commercial JProfiler — Very powerful profiler, with many unique and very useful features that really gives it the edge. It’s not cheap, but it’s worth it.

The profiler was the main tool I used during the optimization process.
This was actually a process of iterations — each one began with a thorough profiling of the application to see if the actions from the previous iteration had improved behavior and to locate new areas for improvement. Then another change, and the process starts all over again.

The profiler allowed us to spot and address some significant performance issues -

An esoteric flow had caused significant performance degradation for all traffic due to a wrong algorithm being used.
A specific thread that handles a small part of the flow for 100% of traffic had been using almost 100% of CPU, meaning it had actually reached an upper bound of traffic, being a bottleneck for all other operations in the service.

The profiler is the strongest tool in your arsenal, use it to the fullest.

Second lesson — Performance issues tend to follow the Pareto principle — Aim for the significant 20%

The Pareto Principle (AKA 20/80 principle) is a characteristic of some human systems, where roughly 80% of consequences come from 20% of causes.

This principle shows in many areas of human life -

80% of lands in Italy are owned by 20% of the people
80% of time spent on Customer Service is associated with 20% of customers
80% of traffic on the Internet is associated with 20% of the pages

Turns out this is true for optimizing your code as well.

Well, I didn’t mathematically prove that. This is left as an exercise to the reader :)

The key point is that by fixing just a few, significant performance issues — you will gain the most value.

It’s also safe to assume that optimizations will not usually be tasks of their own, but rather attached to some bigger tasks (just like we did), or even executed as a developer initiative, so your time working on it will probably be limited.

Being cognizant of the points above yields two important guidelines -

Search for the significant performance issues and address them first
Stop optimizing when you feel you’ve reached the “other 80%” of issues

Finding the most significant issues is not a trivial task, but here are some examples of places you can start your search from -

Threads using the CPU almost 100% of the time, creating a bottleneck
Locks that are accessed from many threads and might halt them
Operations that are performed repeatedly and can be batched together (for example a database read)
CPU-bound flows containing synchronous blocking I\O that holds the whole pipeline back
Offloading a very quick, CPU-bound task to another thread. It’s usually not worth the context switch (more on this in the next section)
Extensive GC, caused by either a permissive allocation of memory or a process having insufficient memory
An algorithm that involves iterating over large data structures. These can be real low hanging fruits, with optimizations being both very significant and very easy
Working with notorious data types (e.g Java’s BigDecimal arithmetics)

Depending on the issue, you can validate that it’s a real issue by either using a profiler, reading the code or by simple common sense. It is pretty obvious that batching a database read, if possible, will save you time, CPU and network.

As for the second guideline, how do you know when to stop optimizing?

That depends on your application, and to what extent you are trying to optimize it.

But generally speaking, when you feel your optimizations have little effect, if any, you have probably crossed the line to the “other 80%”.

You will usually face this when you move from the ‘big’ issues towards more local, anecdotal issues. How these issues look depends much on your business logic — optimizing some regex for a generic web server could be negligible (even if it was significantly optimized), but optimizing text parsing logic, even just a little — could be a huge gain for a text search engine.

The point is that when you start optimizing these “negligible” issues, where the lemon has almost no more juice to squeeze — this is probably where you should stop optimizing.

Bottom line — address the “significant 20%” first, stop optimizing when you feel you’ve reached the “other 80%”.

Optimizing your code be like

Third lesson — Avoid offloading short-living, CPU-bound tasks to other threads

Usually, multi-threading seems harmless. Moreover, intuitively it seems beneficial — your software can execute its logic in parallel, and while one thread handles something, others can do some other work.

This is true for many cases, but one case where this assumption does not stand is CPU-bound tasks, specifically short-living ones. Why? Context switching.

Context switch is the act where one process or thread is leaving the CPU and another one takes its place. Context switching is a relatively slow operation, with both direct and indirect effects, and although switching between threads of the same process is faster than switching processes, it still incurs significant overhead.

For execution of blocking operations, it is very clear why context switching is worth it — you pay microseconds at most in context switching to offload a blocking operation that takes milliseconds or more. Moreover, once a thread is waiting on a blocking operation to complete (e.g a disk read or network call), the OS removes it from the CPU and blocks it from returning until the blocking operation is done, giving room for other threads to use the CPU.

Now let’s see what happens with CPU-bound tasks — things like serializing/deserializing, calculations, searching in data structures, local data manipulations.

We still pay microseconds at most in context switching, but offload a (usually) much quicker task.

What’s worse is that the task we offload is not blocking, meaning it competes with the previous thread for the CPU.

Now you might say to yourself “Hey, what do we have multiple cores for?”, and that’s somewhat right. You might suffer from high cache miss rate (due to thread running on another core), but for moderate load scenarios you will probably not experience any issue.

When we move towards higher scale scenarios, this becomes more and more problematic. All cores are highly utilized, so threads might wait until they reach the CPU. Since in our scenario all the threads are always in runnable state (as they are not performing any blocking operations), they will preempt each other so context switches will happen very frequently, putting stress on OS with thread scheduling and context switching overhead. And all of this is without mentioning thread contention concerns.

So instead of running a very short, CPU-bound task on the thread it is already running on and “paying” just the run time of the task, we offload it to another thread so we also pay in context switches, OS thread scheduling, cache invalidation and more.

This precious lesson shined through our optimization process, and led to one of the most significant optimizations we did — we had a few pipelines that did some very quick, CPU-bound operations asynchronously. Once we removed all of them, and performed all CPU-bound work on the same thread (the request handling thread), the behavior of the service significantly improved with machine load decreased by around 30%, which makes sense because now there are less threads waiting for CPU; and since we reduced context switching drastically, the CPU overhead they cause could now be used to process more requests. And indeed we saw a significant increase in the throughput of the service.

This is huge, and illustrates how significant the context-switch overhead is on a high scale.

P.S — Some ideas mentioned in this section are tightly related to the concepts of non-blocking programming — a paradigm aiming to utilize the usage of non-blocking/asynchronous APIs and to minimize the number of threads (and context-switches associated with it), all that in order to build applications that scale-out well.

Non-blocking programming is beyond the scope of this article but you can read this nice introduction.

Fourth lesson — Improvement all across the board

In our case, good ones

When we finished the optimization process we were pretty satisfied with the results we saw.

Around 30% reduction in EC2 compute cost and reduction in latencies was better than we expected.

But as we monitored the web handler after the optimizations we found out there’s more.

The web handler writes the data it receives to Kafka, which has built-in batching — records are grouped by partitions (for a certain time window or batch size) and sent batched to the Kafka broker. Not only that, but multiple batches targeted at the same Kafka broker are grouped together to be sent in a single network request.

As I mentioned, the optimization allowed us to decrease the number of instances, so now each instance of the service handled significantly more requests per time unit.

This means that for a given instance of the service, the probability of two records ending up in the same batch or two batches ending up in the same network request is higher than before.

And indeed we saw that the average batch size and average request size sent to Kafka brokers have increased. Why is this so important? Because batching causes less network operations and reduces the overall data sent, as every network request has some overhead associated with it.

This had two major effects -

A visible decrease in Kafka cluster utilization — decrease in CPU usage, disk usage, latencies - almost every metric had improved. The size of the topic (in bytes) was actually reduced by around 10% just by this!
Since overall data sent has decreased we also saw a decrease in AWS Regional Data Transfer costs — the amount of data transferred between AWS availability zones, which in our case is the traffic between the service and the Kafka cluster. This one is relatively expensive and we saw an approximate 10% decrease, which makes sense in respect to the topic size reduction.

These side effects are very significant, and taught us an important lesson — optimizations do not end with the optimized service. They have a ripple effect, stretching from the optimized service to infrastructure, network and downstream systems. We are sure we still haven’t seen all the effects of the optimizations we implemented.

Thats all!

The optimization process was fun and very educational. I got to improve our customers’ experience and reduce cloud costs significantly, and I learned a lot of lessons — the most important ones I shared with you here. I hope you find it useful and embark on your optimization adventure as well.