Troubleshooting: A journey into the unknown

Troubleshooting is a journey. It’s a long, unpredictable trek, one where you know the start and the end points but have zero knowledge about the actual path you need to take in order to get there, to find the root cause of the problem. In your backpack you have knowledge, past experience, and various troubleshooting techniques. For a systems engineer, the enjoyable task of finding the root cause of a problem often feels exactly like this journey into the unknown.

Photo by Mahkeo on Unsplash

This particular journey relates to an issue we had on servers in our Distributed Load Balancing (DLB) service. The issue itself had nothing to do with load balancing, but the applicable knowledge gained was invaluable. The journey was worthwhile, as always.

The start

Here is the start point of our journey. What you see is that there is a sudden drop of incoming traffic to a single member of the DLB group. Other members instantly picked up the traffic in an equal manner thanks to the Equal-cost Multi-Path routing we have deployed.

The rate of incoming traffic dropped to zero within a sampling interval (we pull and store metrics every 10 seconds) and traffic recovered after ~50 seconds. In our systems there are five possible reasons for these kinds of traffic drops:

  1. Switches on the north and south sides stop selecting the server as the next hop for incoming traffic due to a configuration change
  2. The Bird Internet Routing Daemon stops running on DLB server
  3. anycast-healthchecker withdraws routes for all services as they fail their health check.
  4. The BFD protocol on switch side detects that the server isn’t sending hello messages and stops routing traffic to it
  5. Network cable or network card glitch

We examined the log of anycast-healthchecker and found out that all services were successfully responding to health checks. We then looked at the bird log and found the following:

All DLB servers are dual-home and they establish BGP peering with the switches on the north and south sides. According to RFC4486, the messages in bold indicate that Bird daemon received a BGP message to reset the BGP peering due to a configuration change on the switch side.

We looked at the Bird code and switch logs, and we found out that the switch asked for resetting the BGP peering due to three consecutive missing BFD hello messages. Such messages are exchanged over UDP protocol with an interval of 400 milliseconds and a tolerance of no more than three missed packets (after which the session is declared down).

The DLB server hadn’t sent BFD hello messages for a period of 1.2 seconds! The most interesting part from the above log is that the failure happened concurrently with both BGP peering, which are established over 2 different network cards to different physical switches.

This made us believe that something on the host caused the loss of 3 consecutive BFD messages; it’s very unlikely to have hardware issues at the same time on two different network cards, cables, or switches.

Several occurrences of the issue

The exact same issue was happening on multiple servers at random times across the day. In all occurrences we saw the same lines in the bird log. So, we knew the end of our journey, we just needed to find what makes the system to not send three consecutive UDP packets every 400 milliseconds. We store logs to ElasticSearch and created a kibana dashboard to visualize those errors and started investigating each occurrence.

Our servers are directly connected to the Internet, therefore we looked at possible small duration attacks on the TCP layer. UDP traffic is not allowed, thus we excluded the possibility of an attack with UDP 80/443 traffic. We didn’t notice any sudden increase of incoming TCP, ICMP, and HTTP traffic before the occurrence of the issue.

We also looked at the haproxy log for possible SSL attacks, but we didn’t notice any unusual traffic pattern. So we knew that there was nothing external to the system that could explain the problem.

The first interesting find

The next stage of our journey was haproxy itself. We use collectd for collecting system statistics and haproxystats for haproxy statistics. Both tools help us to gather a lot of performance metrics about haproxy and the system as well. Furthermore, haproxy emits log messages, which contain very useful information that can help figure out what‘is going on in the server.

haproxy exposes CPU usage per process (we run 10 processes): we noticed a spike to 100% utilization around the same time Bird received the messages to reset the BGP peering. In the following graph we can see that all of the haproxy processes had 100% CPU utilization for at least 1 data point.

The sudden increase of CPU usage wasn’t always followed by BGP peering resets. In some cases, BFD issues were reported by Bird before those CPU spikes. Nevertheless, we continued to investigate the CPU spikes as they were very suspicious.

The CPU utilization of a process is the sum of User Level and System Level CPU usage. Thus, we needed to know if haproxy was spending all this CPU power for performing its tasks (SSL computation, business logic processing, etc.) or for asking the system to do something like dispatching data to various TCP sockets or handling incoming/outgoing connections. The two graphs below suggest that haproxy was spending CPU cycles at the system level.

This gave us a really good starting point for doing more in-depth analysis on what was causing those CPU spikes. We reviewed the haproxy configuration several times and there was nothing suspicious there. haproxy software hadn’t been upgraded recently, so we excluded a possible software regression which could have caused this behaviour.

We contacted HAPROXY TECHNOLOGIES, INC for assistance. They asked us to collect more information about sessions and TCP connections as there was a bug that could cause high number of TCP connections in CLOSE-WAIT state — but, according to them, that specific bug couldn’t cause CPU spikes.

We also looked at memory utilization of haproxy and there wasn’t anything suspicious there either. But, in all the occurrences we saw a sudden increase of free memory. The system freed ~600MB of memory around the same time as we’d been seeing those CPU spikes.

It wasn’t very clear to us if those two observations (CPU spikes and the sudden increase of free memory) were the cause or the symptom of our issue. Moreover, this sudden increase of memory could be related with garbage collector being invoked by some other services. So, more digging was required to clear up the fog in our path.

(s)Tracing the unknown factor

We run many daemons and cron jobs on our servers. In some occurrences of our problem we saw a puppet run happening at the same time. We decided to look at what was executed on every puppet run.

We set up some scripts that were running pidstat against puppet and few other daemons. Since the issue was happening at random times across a lot of servers, we had to pick few servers to run those scripts and wait for the problem to appear.

After a few days of waiting, we had several traces to analyze. Puppet really loves CPU and it can easily lock a CPU for 4–5 seconds. But it wasn’t causing our problem. Other daemons were hungry for memory and CPU resources at a level that couldn’t explain the sudden increase of free memory.

HAProxy support department suggested deploying a script which could run strace and dump sessions when haproxy CPU usage at system level went beyond 30%. The script below was deployed on a single server and was manually invoked for all haproxy processes.

We deployed a script to dump the number of connections, using the ss tool, and the sar tool was adjusted to capture CPU, memory, and network statistics. All those tools were gathering information every second. Since it only takes 1.2 seconds for a BFD session to be detected as down, we had to gather information on such a small interval.

While we were waiting for the problem to appear on the target machine, we decided to move the Bird daemon to a CPU core which wasn’t used by haproxy. Those 10 haproxy processes are pinned to the last 10 CPUs of the system, so we pinned Bird daemon to CPU 1 and assigned -17 nice level to it. We did that in order to make sure it had enough resources to process BFD messages while haproxy was spinning at 100% CPU utilization. We also changed the CPU priority on puppet agent to utilise less CPU resources.

Light at the end of the tunnel

The issue appeared on the target server and our tracing tools ended up collecting 150MB of data (something very close to 10 millions of lines to read!). We analysed pidstat, sar, ss and strace outputs and we made the following observations together with Willy Tarreau, the author of HAProxy and Linux Kernel developer (note that the high CPU utilization started at 12:41:21 and lasted till 12:41:30):

No change of incoming requests per second prior the problem:

haproxy stopped for ~500 milliseconds in the middle of some operations, indicating that it was interrupted:

High rate of page free / cache free activity, system was freeing 600MB of RAM per second for a period of ~6 seconds:

System had 92% of the memory already allocated prior to the issue and started freeing a second later:

CPU saturation at system level while the system was freeing memory:

Some low-rate activity for page swapping:

haproxy was writing data, which is odd: it shouldn’t know how, considering that when the service starts, it closes all the file descriptors related to all the files that can cause I/O operations to the filesystem:

All the haproxy processes started to do some minor page faults. They’d touched a free memory area for the first time since that area was last reclaimed:

Memory usage of haproxy processes remained stable and didn’t change one second later, showing that it was just touching memory that was aggressively reclaimed by the system:

Willy Tarreau also inspected session information as they were dumped from haproxy memory and didn’t find anything unusual. He finished his investigation with the following:

  1. virtual machines using memory ballooning to steal memory from the processes and assign it to other VMs. But from what I remember you don’t run on VMs (which tends to be confirmed by the fact that %steal is always 0)
  2. batched log rotation and uploading. I used to see a case where logs were uploaded via an HTTP POST using curl which would read the entire file in memory before starting to send, that would completely flush the cache and force the machine to swap, resulting in random pauses between syscalls like above, and even packet losses due to shortage of TCP buffers.

Given the huge amount of cache thrashing we’re seeing (600 MB/s), I tend to think we could be witnessing something like this. The fact that haproxy magically pauses between syscalls like this can be explained by the fact that it touches unmapped memory areas and that these ones take time to be allocated or worse, swapped in. And given that we’re doing this from userspace without any syscall but consecutive to a page fault instead, it’s accounted as user CPU time.

I also imagined that one process could be occasionally issuing an fsync() form (after a log rotation for example), paralyzing everything by forcing huge amounts of dirty blocks to the disks; that didn’t seem to be the case and There wasn’t ever any %iowait in sar reports, implying that we weren’t facing a situation where a parasitic load is bugging us down in parallel.

Another point fueling the theory of memory shortage is sar’s output (again) showing that the memory was almost exhausted (92% including the cache) and that it started getting better at the exact same second the incident happens.

To sum up: memory shortage led to a sudden and high-rate freeing of memory which locked all CPUs for ~8 seconds. We knew that high CPU usage from haproxy was the symptom and not the cause.

Finding the memory eater(s)

What triggered our system to free memory at such high rate (600MB/s) and why was this so painful for our system? Why did the kernel use so much memory (~92%) for caches while active memory was always below ~8GB? There were many questions to answer, which brought us back to tracing mode.

Willy suggested to issue echo 1 > /proc/sys/vm/drop_caches upon log rotation, which we did in all servers. We also issued once echo 2 > /proc/sys/vm/drop_caches in two DLB groups. Both of these actions calmed our issue down but only for a small amount of time.

From the many processes running on our servers, we picked 5 with the highest resident memory (RSZ) and started monitoring them very closely with pidstat. We also started monitoring memory activities, noticing a high number of entries for dentry objects in the cache:

atop was also reporting ~100% of SLAB memory as reclaimable memory:

The output of tracing tools we had put in production didn’t provide much useful indicators about which process(es) could cause that high memory consumption for caches. haproxy log (which is rotated every hour) had ~3.5GB of data and dropping page caches upon log rotation excluded rsyslogd from the investigation as well.

We started to read documentation about memory management and realized that our system may not be tuned correctly, considering that our servers have 96GB of total memory, only ~8GB of active memory and have the following memory settings in place:

  • vm.vfs_cache_pressure set at 100
  • vm.dirty_background_ratio set at 3
  • vm.dirty_ratio set at 10

So, the system had a lot of free memory to use for caches — which it did, and it wasn’t aggressively reclaiming memory from caches even when dentry objects in cache occupied 80GB. That led the system to have around 800MB in free memory in some cases.

We changed vm.vfs_cache_pressure to 200 and freed reclaimable slab objects (includes dentries and inodes) by issuing echo 2 > /proc/sys/vm/drop_caches. We started to see more free memory available (~7GB) after 2 days and then we increased vm.vfs_cache_pressure to 1000. That made the system to reclaim memory more aggressively — and the issue was almost entirely resolved.

We continued our investigation in the area of dentry caches and found this bug report for curl tool. The bug report states that, when curl makes a HTTPs request there were many access system calls to files that don’t exist, have random names and are unique per invocation:

We knew that we use curl in the check_cmd for each service check in anycast healthchecker daemon and that check runs every 10 seconds for ~10 services. So, we fired up a one-liner to plot the number of dentry objects in cache per second:

In the following graph we can see that the number of dentry objects was increasing at a high and constant rate:

Bingo! We found the tool which was polluting dentry cache. Finally, we see our destination; time to prepare the cake.

The fix was very easy — just setting the environment variable NSS_SDB_USE_CACHE to YES was enough:

We adjusted check_cmd for each service check in anycast-healthchecker which utilized the curl tool, and we also modified a cron job, which was running curl many times for HTTPs site. In the following graph, we can clearly see that the pollution was stopped as the number of dentry objects in the cache wasn’t increasing:

Conclusions

  • Your sampling intervals for statistics may hide problems. In our case, we collect metrics every 10 seconds and we were able to (eventually) see the issue clearly. If you collect metrics every minute on systems that receive ~50K requests per second, you won’t be able to see problems that last less than minute. In other words, you fly blind. Choose the metrics to collect and pick the intervals very wisely.
  • Abnormal system behaviors must be investigated and the root cause must be found. This secures the stability of your system.
  • Reshuffling TCP connections when a single member disappears and appears in ECMP group didn’t impact our traffic as bad as we initially thought it would do.

I would like to thank Marcin Deranek, Carlo Rengo, Willy Tarreau and Ralf Ertzinger for their support in this journey.


Would you like to be an Engineer at Booking.com? Work with us!