Ceph RBD Performance Debug Journey

Xiaolong Jiang
4 min readJul 22, 2021

--

If you are using ubuntu bionic+ and using RBD as your block device, you are probably going to hit performance problem. Then this article is for you!

The way we setup RBD in production today is to create rbd device on one machine. Build XFS on top of this rbd block device and then export XFS as NFS which is used by our customers.

Everything works until one day we onboarded more users. The NFS daemons are going into uninterruptible D state. Clients were seeing degraded performance due to NFS server daemon unresponsive D state. There is no clear indication what’s going on . To unblock customer immediately, failover action was taken and new file server stood up. However after a while, we came into same issue again. Stacktraces and dmesg were taken for further debugging. And we concluded failover will not help us and not really buy us much time. We really have to spend all efforts to debug and put off the fire.

There are multiple reasons NFS daemon can go to D state. For example, in-active ceph pg, bad network, bad osd outlier etc. However after checking all these, no conclusion can be reached and they are all looking reasonable.

Another observation is ceph is taking very high 5K IOPS with only 50MB/s throughput. I can not explain this and don’t know why. 5K IOPS is not a big deal but it’s too much for 50MB/s throughput. It’s unlikely client is doing very small IO based on the use cases. So what is really going on ?

Next I know I have to reproduce it and can not debug and change directly in production.

To simulate customer traffic, I used fio to simulate different IO size. Fortunately, I am able to reproduce the problem even with 1MB IO size sequential write. This is good news. Be able to reproduce is a big milestone, which gives me repeatable way to monitor any stats I want.

I looked at nfsiostat from client perspective, everything is inline with what I expect, sending bytes with wsize configured. Even mounted NFS client in the same machine in the NFS daemon, unfortunately problem is still there. Then I used iosnoop to track IO size and latency going into rbd block device, I noticed the IO size becomes 4K when performance degregated, generating unexpected high IOPS. I would expect IO size to be 4MB since rbd block device object size is 4MB. So this is strange.

While exhausted reproducing again and again, we went to NFSd stacktrace and we noticed for every IO, it’s doing fsync. Below is stacktrace

So we tried to make NFS daemon async mode. This quickly solved the problem since IO will be batched in XFS pagecache, the flusher thread will flush pagecache into block device with IO size 4MB.

However async has it drawbacks and if file server crashed, it’s possible we are going to lose in-flight bytes which have not been flushed into block device.

So we need to continue investigate why fsync is slowing down so much since even it’s doing fsync, ceph should still be able to provide good throughput due to its horizontal scalability nature.

After looking further into above stacktrace, I noticed threads are all blocked on rq_qos_wait . what’s the hell of this? Why threads need to wait? IO should just go to block device without any wait. This is suspicious.

Then I cloned bionic source code and started to read NFS/block device layer code, and found this throttling code which is part of overall stacktrace. I took a more detailed look at this code and found it’s throttling based on IO latency. For our rbd setup, latency is definitely high since it’s network based block device, not local SSD or nvme. This is encouraging! This is linux patch to add this feature.

This feature works on local SSD, but definitely not working on our RBD setup. So next step is either tune the threshold or completely disable this feature. With all I have and my understanding, I reached out our amazing performance team and hope their experience may give an extra eye. The slack threads are about 200 back and forth msg long. Eventually we nailed it and found the exact config to disable this feature. It turns out this feature is also a surprise to our performance engineer since it’s a relative new feature added into kernel. We both learned quite a bit from each other.

To disable this throttle logic based on latency, we need. to do echo 0 > /sys/block/<block>/queue/wbt_lat_usec . After disabling this feature, I ran my previous fio cmd and it works like a charm and performance is just great.

Finally we are able to use nfsd sync option for consistency and also achieve great performance.

It’s fun(desperate) to debug something which I feel hard in the first place, feel challenged and improved over the journey. Feel free to reach out to me if you want to learn more details.

--

--

Xiaolong Jiang

Staff Software Engineer at Kolena. Ex Netflix. Ex Apple. Ex Clari. Ex eBay