I can reproduce 15-years old WTF with TASK_UNINTERRUPTIBLE

Finally, I can do it in a pure software environment on my will!

George Shuklin
OpsOps
4 min readSep 9, 2020

--

I speechless to express my elation! Finally! I can reproduce the most HATED bug in the kernel. The bug which haunts only highly loaded production systems with costly hardware and only in rare conditions!

For the duration of all my professional life I occasionally stumble upon it. Whole 15 years of my career. It was impossible to reproduce, impossible to fix and impossible to live with. There were only two options from this bug: Either commit suicide by oneself or ask the kernel to do so. Moreover, the bug is so nasty, that normal shutdown does not work — system hangs forever. There was a special ritual for this type of error: echo b > /proc/sysrq-trigger, and you need to type it without auto completion (or else, your console is dead, if the bug affects the disk under root filesystem).

The bug description

Under unclear conditions some SAS HBA (LSI mostly, but I got this from Adapetec too once or twice) cause massive outage for all devices on a specific SCSI host (scsi host is not ‘the server’, it’s the scsi-specific name for HBA port in scsi-bus). Usually it happens when SATA drive dies in SAS enclosure in a specific way in the middle of transation, but I saw this in other configurations too. When this happens, not only the ‘buggy device’ is dead, but all other devices on this bus are dead too. Moreover, they are not just ‘dead’, but any access to those devices put accessing process into TASK_UNINTERRUPTIBLE state (D+ in ps/top), and put it in this state forever. When a process in this state it’s impossible to:

  1. Kill it (kill -9 is delayed until process is returning from D+ state, if it’s never returns, kill is ignored forever).
  2. Stop it (Ctrl-Z no longer works). As far as I understand, it’s the same as kill -SIGKILL, but with kill -SIGSTOP).

On a practical side that means if your application ever tried to access the device (any of the whole affected enclosure), you can’t Ctrl-Z it. If the affected device is an underlying block device for your root drive, than every non-cached filesystem access cause the same problem, including innocent cd. F.e. ‘autocompletion’ may wants to scan some directories or write some temporary files. One TAB and console is dead forever. Obviously, shutdown procedures does not work too, because they try to run files from root filesystem which is ‘D+ forever’.

Additionally, if the application in HUNG (D+) state has some threads, only one thread locks. Other threads are free to do anything except exit or signal processing. That means, in a cluster environment, if one thread is stuck in D+ state, application can respond on heartbeats and other network requests, but can’t work. Moreover, and this is why I say it ‘haunts’, it CAN NOT EXIT. It can’t work and can’t stop. Many cluster solutions are totally unexpect this (if you can’t work, just die, please!).

So, the single solution is a hard reboot. Either via ‘b’ magic of the kernel, or by means of out of band power management (IPMI/DRAC/ILO/PDU API).

I’ve reproduced it in SOFTWARE!

Finally. No expensive SCSI enclosures with dozens of drives. No more highload. No more specific hardware at all!

The simple and humble null_blk driver allow me to create those processes!

Warning: don’t do this on production or your laptop. Use a spare VM (VM on your laptop is fine, as it’s a pure software thing).

That’s all. After it finishes (or in the parallel console, as it take a long time to finish), you can try to access /dev/nullb0 device.

And, … D+ is awaiting you!

It shows all symptoms of the properly hung system, but with reasonable timeout (completion_nsec=100 000 000 000 is 100s). If you want, you can make it as big as you want.

Now its’ possible to test a software for all possible crap with D+. Finally! I’m so happy!

Is this a bug?

The thing is that most kernel developers don’t think of D+ state is a bug. It’s a ‘special state’, and they think it’s a short event. When it finishes everything continue to works just fine. The truth is that sometime some drivers cause this state, but without finite timeout. It’s a “driver’s fault” they said. Unfortunately, this ‘not a bug thing’ cause huge headaches in production and is impossible to fix.

Drivers developers are ignoring this problem too, because they can’t reproduce it. Of cause they can’t. They need buggy hardware for this and tons of IO. Did you ever saw a kernel developer machine with 30+ drives, uptime of few months and a sustained load of 100k+ IOPS? I never did. They can’t reproduce, therefore there is no bug.

The main issue here, is under some conditions one buggy hardware (half-dead SATA drive) cause outage for all other devices on the bus.

(For gory details: if your enclosure have LED subsystem, access to this subsystem through /sys can cause D+ state too!).

So, it’s no one’s problem but operators. And I finally can reproduce it! Hurray!

--

--

George Shuklin
OpsOps

I work at Servers.com, most of my stories are about Ansible, Ceph, Python, Openstack and Linux. My hobby is Rust.