Surfing on Lava

How Feedback Control and Tuning the BEAM Helped us Weather the Meltdown-Patching Storm

by Rafal Studnicki and Simon Zelazny

Our Stack

We run several Erlang and Elixir services at Grindr, and among them is our geo-presence service. It has moved forward a bit since we last talked about it, and nowadays it’s handling much more traffic.

The soft real-time properties of the Erlang VM allow us to set high standards for our services in terms of the tail latencies (maximum, .999, .99). Depending on the operation type, we usually expect the response times to be under 50 ms, and we monitor them closely.

Warning Signs

When tail latency response times rise above normal stable state levels, it’s usually a good (and early) indicator of something going wrong with our system.

In November, we experienced 2 occurrences of tail latencies going up much higher than usual. These events seemed unrelated to any change to client or server code, and abated on their own after a couple hours.

Fig. 1. November 2017: Anomalously high tail latencies that mysteriously came and went (note logarithmic scale).

The System Goes Bad for Good

Right before Christmas, the mysterious ‘hiccup’ state become the new normal.

As our automated deployments were in freeze mode for the holiday season, we decided to put our faith in the self-regulating capabilities of our system, and not deploy any changes manually.

Fig. 2: Bad times

The calls are protected by feedback control wrappers, which keep track of the response time of recent calls, and throttle inputs in such a way as to maintain a target figure. This means that even though something in the system was slowing down particular responses by a large factor, the average response times were kept in check by the regulator.

If the average metric shot up too high, the regulator mechanism would refuse a fraction of inbound calls until the average fell to appropriate levels.

Fig. 3: The price paid for stability is controlled rejection of some requests (load-shedding).

Figuring it out

With the holidays behind us, we dived in and launched our favorite BEAM inspection tool, the system_monitor :

Fig. 4: A session with system_monitor.

If something is ruining your systems’ soft-real-time properties, it will most likely be revealed by starting a system monitor and waiting a couple of seconds for the alerts to come in.

In the abbreviated session above, we got information that long_gc(long garbage collection) is happening only in one type of processes, the Phoenix Tracker. This made us think that’s something going on with the data that these processes are storing in their state.

Process heaps clocked in at around 2–3 megabytes — not enough to cause slowdowns from the sheer amount of copying done at gc time. Additionally, the process message queues were empty, which meant that large message structures were not contributing to the rootset.

The Tracker processes differ significantly from all other processes in the system. Their heap size is also different — they are the only processes consistently using over 512 kilobytes of memory per process. Heaps of this size are allocated using so called single-block carriers (see Erlang in Anger, chapter 7, and the ERTS manual), which means that the Erlang VM calls out to the operating system for extra memory whenever it needs to adjust Tracker heaps.

This gave us the hint that perhaps (de)allocating memory directly in the system is taking a longer time than usual. Perhaps because of infrastructure work being conducted on the underlying hardware stack? The Cloud is made out of Big Iron, after all.

The Moment of Truth

We had a working hypothesis but we couldn’t reproduce the symptoms in laboratory conditions, even when simulating 4× production traffic. The memory allocation slowness seemed to only occur on our production Presence cluster. Our last recourse was to actually put skin in the game: deploy changes to production and test our assumptions under real-world conditions.

We gave bumping the single-block carrier limit a shot. Our goal was to force garbage collection of these particular processes to reuse memory in the multi-block carriers which is allocated up-front by the VM, instead of single-block carriers. This way, garbage collection of the Tracker heaps would not require calling out to the operating system, but remain within userland code.

+MHsbct 10240   # single-block carrier threshold, in kb, up from 512
+MHasbcst 80960 # absolute single-block carrier shrink threshold (scaled proportionally)
+MHlmbcs 102400 # largest multi-block carrier size (scaled proportionally)

Fig. 5: The emulator flags in vm.args

The result was immediately visible in the form of a 2-orders-of-magnitude drop in subscription processing times. We’d effectively bypassed the Linux kernel in managing our large Tracker heaps.

Fig. 6: Drop in tail response times after re-configuring memory allocators in the Erlang VM. Note logarithmic scale.

The cost of this improvement, as one could expect, was an increase in memory usage. Since multi-block carriers are much larger now, the memory blocks inside them are less efficiently utilized, leading to higher memory fragmentation. In our case memory consumption went up by approximately 20%, but that’s definitely a price we can pay for reducing tail latencies by two orders of magnitude.

Fig. 7: Memory usage before and after the change

Takeaways

The cloud computing environment is full of uncertainty. Sometimes, we can understand root causes, but sometimes the system’s underlying complexity and opacity work to maintain its secrets. In the case described above, we could plausibly attribute the instability to Meltdown or Spectre mitigation work at Amazon, but hard evidence is hard to procure.

What did end up crucial in helping us power through the holiday season was our built-in load regulator, based on the principles of feedback control. Our load regulator treats the entire system as a black box, measuring only the average response times. It makes no difference which part of the stack is slowing down: be it application code, VM code, the network, or the hardware itself.

By treating the whole system as a black box, we were able to maintain uptime and reasonable responsiveness in the face of instability of unknown origin. Additionally, access to Erlang’s great introspection tools (system_monitor foremost among them), gave us good hints on where to look for concrete mitigation steps.