Linux Fuse File System Performance Learning

Xiaolong Jiang
4 min readOct 10, 2020

--

Recently I worked on a FUSE file system performance improvement. There has been a lot of research work done on this area before https://www.fsl.cs.stonybrook.edu/docs/fuse/fuse-tos19-a15-vangoor.pdf

It’s quite a bit learning for me along the way, from how FUSE works to what kind of areas we can improve.

By default, latest FUSE default options are already pretty good. For example, multiple-threading, big write 128KB, 128K read ahead etc. What I found is, for small IO 4K write, performance is only about 20% of native file system. This is the almost worst scenario including single job/multiple job using fio benchmark tool. The reason why 4K IO write is each write will cause one getxattr call which is to fetch security cap. This doubles the write traffic including write call itself. Right now there is no way to cache this call in the kernel side so it involves kernel/user space context switch and occupy the fuse queue which in turn slow down regular read/write call. There is some research work done by https://github.com/extfuse/extfuse but this requires kernel change to support eBPF. This is just research work and kernel changes are unlikely to be merged in the main train in next couple of years. So I haven’t found any way to avoid this call if we want to support extend attribute feature.

Enough blockers. So how do we improve? It turns out there is an option called “writeback” cache, it’s fuse page cache. By default, FUSE choose to use write through cache which is forward write request to fuse layer based on IO size and page size. In 4K IO size pattern, it will forward each 4K write into fuse layer and put into the fuse queue. This is quite expensive. What write cache can help here is instead of forward requests directly to fuse layer immediately, it will cache in the page-cache and then later on there is a “writeback” thread to flush dirty pages into fuse layer. The beauty of this is it’s able to batch 32 4K IO pages into one big 128K bytes for sequential write. Over the long run, the batch makes it possible to significantly lower the write request itself. However we still can not avoid the getxattr call. By doing writeback, we are able to double the throughput. By the way, writeback cache won’t help for random writes since there is no way to batch them into one single write to fuse layer, so you will notice performance can jump very high in the initial few seconds due to cache, but quickly comes down due to const flushing later on. When you use writeback cache, you need to be aware it’s possible there are bytes in the page cache which is not flushed to fuse layer yet. So if your fuse daemon crashes, you may lose bytes in the page cache.

While comparing with fuse2 and fuse3 passthrough performance, there is a difference in the mix read/write mode. It turns out in fuse3, there are extra calls to getattr causing more traffic into fuse queue. The reason why there is this call is fuse needs to make call to see whether files changes so it can invalidate page cache. However it’s not necessary for FUSE whose write/read all go through name linux box, since write will update read page cache anyway. It’s only useful in general for network based fuse file system where page cache can be easily invalidated if there are external changes made to file system. To disable this check, we need to do disable this flag “FUSE_CAP_AUTO_INVAL_DATA”

Bytes buffer splicing between kernel and user space is also something I tried, but it didn’t help throughput.

While investigating what I can do to improve, I learnt quite a bit by looking at linux kernel code. Coming from Cassandra database background, I noticed a lot of choices are fundamentally similar in nature. For example, the fuse daemon and fuse queue communication are handled in a way like handshake. Cassandra Peer to Peer node communication is done in a similar way. Initial connection setup is fuse in kernel will send handshake msg to fuse daemon in the user space and user space will send reply msg to kernel fuse module. The communication msg will exchange version protocol e.t.c, so client daemon will agree features set supported by kernel fuse module.

Lastly, there is a way increase from 128KB to 1MB for fuse write request when sending pages from kernel to FUSE daemon. But it’s not possible in older linux kernel, there is a “max-page” support in kernel up to 256 pages in one write request starting kernel 4.20 version. For older linux version, it’s capped to 128KB. (32 pages)

Finally, this is my first blog to just summarize what I learnt about FUSE. Hopefully it can also be helpful to someone who is looking to push FUSE performance even further. Feel free to comment and reach out to me at xiaolong302@gmail.com if you are interested to learn more and learn from each other.

--

--

Xiaolong Jiang

Staff Software Engineer at Kolena. Ex Netflix. Ex Apple. Ex Clari. Ex eBay