Kubernetes, Java and GC walks into a bar

The curious case of a greedy service, surprising metrics and a prank call that went wrong

Jakub Coufal
Jamf Engineering
7 min readApr 26, 2021

--

This blog post was written under the Wandera brand prior to its acquisition by Jamf. That’s history, we are now #OneJamf

Photo by Steve Allison on Unsplash

At Wandera we love Kubernetes and Java even more. Most of our microservices are written in Java and most of them run in numerous Kubernetes clusters around the world. I would dare to say that most of our engineers love Java the same as our platform teams love Kubernetes. But the marriage of those two was born out of necessity rather than love so there are a few rough edges here and there.

Some time ago one of our teams noticed intermittent network issues that would cause failures in the communication between two of our edge services. The workaround was simple, failover the traffic of the affected service to a healthy instance and failback after a while, but we couldn’t find anything in the thread dumps that would hint what the underlying problem was. The backend team introduced a metric and an alert to detect open circuit breakers (CBs) between the edge services so that at least we could react in a timely fashion without affecting our customers.

At Wandera not only we release services independently multiple times a day in a continuous delivery fashion but we can also enable features for our customers in a similar way by using an in-house built feature flag system. One part of this system is a common library used by most of our Java services for checking if the specific feature is enabled for a particular customer/admin/device/whatever. The library was written some time ago and without significant changes used over the course of the years, so you could probably call it a battle-proven library. But as with everything the devil hides in detail and a change in the environment or input parameters could lead to surprising results. This will come into play later.

The Tiger team

The alert was introduced before Christmas and it was relatively quiet, until the beginning of January. At this point, the alert started being triggered regularly with its corresponding escalation to the On-Call engineer (rightly so). It seemed to be happening only in our US DCs which was one of the busiest at the time.

We tried changing the local DB that runs next to our edge services and we added more resources into the worker nodes in the cluster, but even though the response times of the services started to improve, the circuit breakers kept happening. We had some theories but no hard evidence. For example one of the theories was that the problem could be related to the new DB backend (from embedded to external) which brought some extra latency and was not handling the load well. Our Prometheus metrics were showing increasing latency to the DB under load while the DB metrics showed no increase whatsoever.

Despite the intermittent nature of the issue and its relatively low impact, we decided to call for a Tiger Team to focus on the issue, together, for a couple of days.

  • The team isolated the issue to one of our network services
  • It is somehow connected with the release of the service or an additional load brought to it
  • When locked up, the service is unable to recover unless the load is fully taken from it

While the issue was under investigation by the Tiger team, product and backend teams were enabling a couple of the features for our customers, features that were increasing the load on this service. The conclusion wasn’t hard to make - there was the observation of a cause and effect. Enabling the feature flag causes more load on the service which in turn locks up. The easiest and (arguably most hated) solution was to increase the compute resources assigned to the service. The issue went away and even the metrics showing the latency of the DB calls were better which caused some head-scratching.

Nevertheless, the Tiger team stayed on task and kept analysing the effect of increased resources… and then it happened again. The product team enabled some features for another batch of US-based customers which trigger the same alert in one of our European DCs at the exact moment of enabling the flag. But there was no increase in load in the DC as the customers affected does not run their device fleet in the EU.

Prank call

How could enabling some features for customers on the other side of the world affect the services on the other one? Do you recall the library I mentioned earlier (the one that we use to control feature flags)? This was the time when the prank call went wrong.

Imagine you want to prank call someone (customer with a feature enabled) so you order a phone book (feature flag with a list of customers for which it is enabled) but the book comes as an unsorted set of names so you take your time and build a tree of names for you to better search for the victim of your call (in the end there are just a couple of names). You find the name, make a call, laugh for a while and as you no longer need the book you throw it in the waste bin. Later the cleaner (Garbage Collector) comes and takes out the trash.

You live your life like that with the occasional prank call to make your day but what if it becomes an obsession? Suddenly you are ordering tens of phone books and spend your time translating them to the search tree, your waste bin and your room is full of old phone books and the cleaner needs to spend more time cleaning the mess up. It even comes to that cleaner needs to throw you out of your room (Full GC) to clean it so you have even less time to make the calls but you are insatiable and start arguing with the cleaner about the time spent in the room (CPU quota). On top of that, the phone books are getting bigger and bigger.

This is how our library for feature flags was behaving which we uncovered by detailed profiling of our edge service, but why was it a problem now? and how is it connected with Kubernetes?

Death by GC

In Kubernetes you have an option to keep the resource usage of the container in check, it is called container limits. The limits if set on the CPU gives the container a limited portion of the CPU time in a specific time window. Whenever a container reaches the limit it is stripped away from further CPU cycles until the window closes. This is done to reduce the effect of a noisy neighbour on the other containers and protect other services running on the same node.

Java, on the other hand, has this process of garbage collection that cleans up unused objects in memory which even though being (usually) fast could be quite expensive to run. It could be so expensive that it could deplete the whole CPU quota in a particular timeframe on top of that it happens so quickly that it is quite difficult to observe using regular metrics.

By the combination of the two and a glitch in the library, our service fell victim to the effect of death by GC.

  • The FF checking generates a ton of garbage (and takes precious CPU cycles on parsing)
  • For Java to cope with it, the GCs had to be invoked
  • More GCs meant higher service latency
  • Higher latency meant more threads needed for handling requests
  • More threads → more context switching
  • More context switching → more CPU cycles needed
  • CPU reaching the container limits
  • The container is being throttled
  • Not enough CPU for garbage collection
  • Memory is not reclaimed fast enough, garbage is piling up
  • More Garbage collection needed means more CPU needed
  • Not enough CPU → CPU throttling
  • CPU Throttling → Latency increased
  • Even more threads and GC needed
  • The vicious circle is now complete
  • Service is now (almost) unresponsive

The fix

To fix the situation, the team analysed the library and identified the place that needed fixing. In the end, it was just a little tiny abstraction of parsing java.util.Properties into java.util.HashMap that caused the issue. After the fix, the story reads like this:

After you find the name, make a call, and laugh for a while you keep the book as you might need it later. Only on the rare occasion of a new phonebook revision being out you order a new one and throw out the old.

With the fix in production, it was time to measure the results.

  • CPU spent on checking FF plummeted by 97%
  • Java Heap memory allocated per second dropped by 80%
  • And finally, the mean response time of the service improved by 60%

Conclusion

  • A few hours/days of full focus on the issue could be bloody effective
  • With limits set to your containers in Kubernetes always have the alerts for CPU throttling set
  • Every abstraction in the code comes with a price, measure their effect so it doesn’t get out of hand
  • The telemetry (logs, metrics, traces) is important but in some cases insufficient
  • It is important to be able to profile the service in production

--

--