Linux Kernel bug hunting

Pavlos Parissis
May 25, 2018 · 7 min read

The wonderful journey of troubleshooting has just started :-)

We had zero information about what could possible cause this. So, we thought to start the investigation by looking at system statistics. The interval of pulling those statistics was too high and as a result the statistics hid valuable information.

  • kernel.softlockup_panic = 1
bnx2x: [bnx2x_clean_tx_queue:1205(eth0)]timeout waiting for queue[0]: txdata->tx_pkt_prod(19537) != txdata->tx_pkt_cons(19447)bnx2x: [bnx2x_state_wait:310(eth0)]timeout waiting for state 9watchdog: BUG: soft lockup — CPU#7 stuck for 22s! [kube-apiserver:63429]

The first interesting find

The system statistics per second revealed something very interesting. We looked at all system metrics with the sar tool and we found that the CPU utilization at system level increased to 100% just a few seconds before the crash and only for the CPU mentioned in the crash:

sar -f sa14 -s 08:12:00 -P 41 <← this the CPU mentioned in the crash
Linux 4.14.32–1.el7.x86_64 04/14/2018 _x86_64_ (56 CPU)
08:12:00 AM CPU %user %nice %system %iowait %steal %idle
08:12:01 AM 41 0.00 0.00 0.00 0.00 0.00 100.00
08:12:02 AM 41 0.99 0.00 0.99 0.00 0.00 98.02
08:12:03 AM 41 1.00 0.00 0.00 0.00 0.00 99.00
08:12:04 AM 41 0.00 0.00 0.00 0.00 0.00 100.00
08:12:05 AM 41 0.00 0.00 0.00 0.00 0.00 100.00
08:12:06 AM 41 0.00 0.00 0.00 0.00 0.00 100.00
08:12:07 AM 41 0.99 0.00 0.99 0.00 0.00 98.02
08:12:08 AM 41 0.00 0.00 0.00 0.00 0.00 100.00
08:12:09 AM 41 1.00 0.00 0.00 0.00 0.00 99.00
08:12:10 AM 41 0.00 0.00 0.00 0.00 0.00 100.00
08:12:11 AM 41 0.00 0.00 0.00 0.00 0.00 100.00
08:12:12 AM 41 1.00 0.00 0.00 0.00 0.00 99.00
08:12:13 AM 41 0.00 0.00 0.00 0.00 0.00 100.00
08:12:14 AM 41 0.00 0.00 0.00 0.00 0.00 100.00
08:12:15 AM 41 0.00 0.00 0.99 0.00 0.00 99.01
08:12:16 AM 41 1.00 0.00 0.00 0.00 0.00 99.00
08:12:17 AM 41 1.00 0.00 0.00 0.00 0.00 99.00
08:12:18 AM 41 0.00 0.00 0.00 0.00 0.00 100.00
08:12:19 AM 41 0.00 0.00 1.00 0.00 0.00 99.00 <<---
08:12:20 AM 41 0.99 0.00 54.46 0.00 0.00 44.55 <<--- CPU Spike
08:12:21 AM 41 0.00 0.00 100.00 0.00 0.00 0.00 <<---
08:12:22 AM 41 0.00 0.00 100.00 0.00 0.00 0.00
08:12:23 AM 41 0.00 0.00 100.00 0.00 0.00 0.00
08:12:24 AM 41 0.00 0.00 100.00 0.00 0.00 0.00
08:12:25 AM 41 0.00 0.00 100.00 0.00 0.00 0.00
08:12:26 AM 41 0.00 0.00 100.00 0.00 0.00 0.00
08:12:27 AM 41 0.00 0.00 100.00 0.00 0.00 0.00
08:12:28 AM 41 0.00 0.00 100.00 0.00 0.00 0.00
08:12:29 AM 41 0.00 0.00 100.00 0.00 0.00 0.00
08:12:30 AM 41 0.00 0.00 100.00 0.00 0.00 0.00
08:12:31 AM 41 0.00 0.00 100.00 0.00 0.00 0.00
08:12:32 AM 41 0.00 0.00 100.00 0.00 0.00 0.00

What is next?

We have a diverse hardware profile for our Kubernetes master servers, where we use Dell and HP servers with different network cards. So we excluded a hardware problem or firmware issue. Needless to say, we went through the manual process of confirming that we use the latest firmware for cards, BIOS and other devices. Moreover, we have servers in the Distributed Load Balancing service that run the same kernel but we hadn’t observed any crashes there. So we thought the issue could be related with the type of the workload that we run on master servers for Kubernetes clusters.

Approaching Kernel developers

We tried (sometimes unsuccessfully) to resolve the addresses in the call traces, and we saw recent changes (kernel version 4.12) in that code area, leading us to believe it might be a kernel bug. However, the percentage of trust on this statement is very low as we don’t understand the internals of the kernel.

Conclusion

Hunting kernel bugs does NOT require deep understanding of the kernel internals. It only requires from you to follow standard troubleshooting techniques and communicate the issue in a clear way to the right people.


Zero errors policy

During this troubleshooting, we saw kernel errors during the boot process. At Booking.com, we have a zero errors policy for our systems. Thus, we decided to address those errors. We contacted the maintainers of the kernel subsystem which reported the error, see 4.14.x yields errors for EDAC thread. A developer replied to us within a day, and after several e-mails exchanges, we found the issue.

Booking.com Infrastructure

Core Infrastructure in Booking.com

Pavlos Parissis

Written by

Booking.com Infrastructure

Core Infrastructure in Booking.com