All engineers with code in production have been in a situation where all the alerts are red, the activity feed is intermittently down, nobody knows what’s happening, and the service is on life support. This happened to us in October 2017, when traffic to vevo.com surged and our feed began intermittently failing at 120,000+ requests/sec due to excessive memory consumption.
Over the past 18 months, Vevo has invested heavily in monitoring and alerts, which paid off during this outage. Our ops team was quickly alerted, and in coordination with our on-call engineer, we began our pre-established incident management process.
At a high level, this involves creating a Jira issue and linking it to Slack with the Slack Connector. This automatically creates a Slack channel and mirrors any updates in Slack and the issue comments. This makes reviewing timelines and coordinating action simpler during the initial investigation as well as during later impact analysis.
Afterward, our ops team took immediate action and increased both node count and memory capacity. The service stabilized, despite a steady rhythm of restarts every 5–10 minutes. Now the investigation could begin in earnest to find the root cause.
Using jstack -l 1 | grep “tid=” (lowercase L and one), it was clear that the memory leak was from too many threads. We had more than 20,000 threads all named pool-#-thread-1. In other words, something was creating hoards of single thread executors. The output looked something like this and went on for ages:
After a lot of digging, many incorrect hypotheses, and a liberal use of reflection to name thread pools that were not publicly accessible, we finally found the culprit: a third party websocket library we were using. At first glance, everything seemed fine, but every connected client was being assigned its own Executors.newSingleThreadScheduledExecutor() to handle heartbeats and other connected-client messaging. For the most part, these threads were idle: just a few heartbeats per minute and another message at disconnect.
After all the hard work to locate the issue, the fix was fairly simple: create one shared scheduler thread and a pool of worker threads to dispatch the events to the clients. This reduced our thread count and memory footprint substantially since a handful of threads are capable of what was previously being handled by tens of thousands.
Our metrics tell most of the story here. Memory and CPU usage became very flat after rolling out the fix:
Lesson #1: Name Your Threads
The first major takeaway is that if you’re using threads, you should be naming them. Tracking this down would have been trivial had the threads been appropriately named. Internally we have a library that wraps Java’s Executors class that mandates names so you won’t see any generic names coming from Vevo’s libraries. Unfortunately, it is extremely common to see generically named thread pools in many libraries; the Executors class just makes it too easy. Here’s a simple example for creating a ThreadFactory to provide nicely named threads.
Lesson #2: Understand What Your Java Profiler Is Hiding
The second interesting takeaway is that with threads even using a Java profiler doesn’t tell the whole story when examining memory use. Our heap stats looked pretty normal during all this because the majority of memory was being allocated for each thread’s stack outside of the JVM.
We’ve since added metrics reporting to all of our thread pools so tracking down and monitoring thread use across our systems is easier than ever. Using the improved websocket code, which was recently incorporated, we have a nice pool of workers handling thousands of connected clients without a hitch:
Lesson #3: Develop an Incident Management Process Before Your Next Incident
Lastly, in times of crisis, I can’t overstate how valuable having an established incident management process is. I shared a few details of how ours works, which is loosely based on this guide, but be sure to find something that works for your team. Slack and Jira work well for us, but the key is that all of our ops, developers, and managers know where to look during an incident to see what is happening and what mitigating action is being taken. This keeps everyone focused on their area of expertise and helps to bring the issue to a close as quickly as possible.