High throughput object store access via file abstraction

5 min readAug 13, 2017

Many applications expect a file system interface for reading their data. The need to port these applications to cloud environments with object storage has led to the development of file to object gateways employing FUSE file-systems. However, some workloads like deep learning (e.g. Caffe) require high throughput reads, which challenges the design of these gateways.

In this post, I explore two solutions, Goofys and s3fs, for mounting a bucket via the S3 API as a file system using FUSE. The deep learning use case I am considering requires mainly sequential reading, or semi-sequential reading (where the file cursor position over time is approximately linear) of very large files (e.g. 1 TB).

In order to select the appropriate solution, I have investigated the impact of the network latency and of the object gateway overhead on the end-to-end performance, which is discussed next.

Network read performance is affected by the capabilities of the client’s NIC (e.g. 10Gbit/s), the object storage read (HTTP GET) operation throughput, and the number of concurrent GET connections issued to the object storage.

Both s3fs and Goofys use concurrent connections by default. Each connection is spawned to retrieve a different chunk of the file, using S3 range reads. s3fs allows customizing both the number of concurrent connections (via the parallel_count argument) and the chunk size (the multipart_size argument). Goofys on the other hand is currently hardcoded to use 5 connections and a 20MB chunk size.

Gateway overhead includes the overhead imposed by the FUSE calls, the overhead for caching algorithms and data-structures, etc.

To test only the FUSE calls overhead, I patched the FUSE readfile function, replacing it with a stub, so there would be no code execution in s3fs and Goofys, to understand the overhead of the FUSE library used by each solution. With this patch, each user read request to a file returns immediately with a zero-filled buffer, without being cached or processed by Goofys or s3fs.

To test the latency overhead of s3fs and Goofys, I patched (a second patch, which is not based on the previous patch) the code to remove any external network reads, replacing the object store read code with a stub function that simply fills the return buffer with zeros. In this way, we mock the object store service using a simple local in-memory service that always returns zero-filled data buffers.

The following figure illustrates the location of the two patches along the flow of a user read operation:

For each patch, we tested the throughput rate of a 5GB file read. The results are given in the following table:

We can see two interesting takeaways in these results:

1. The FUSE overhead for Goofys is substantially higher than s3fs (throughput is only 530 MB/s vs. 3150 MB/s). This may be due to translation overhead of the Go fuse library used by Goofys (https://github.com/jacobsa/fuse), compared to the native libfuse C library used by s3fs.

2. The overhead introduced by the Goofys implementation itself is negligible, reducing the throughput to 500 MB/s from 530 MB/s, whereas the added overhead for s3fs is considerable, reducing the throughput from 3150 MB/s to 1130 MB/s. This can be explained by the fact that Goofys uses background threads to read the object from S3, whereas s3fs fetches the object in the foreground of the user’s read requests. Additionally, s3fs uses a file-system based cache, whereas Goofys uses an in-memory cache. I tested s3fs with an in-memory cache by using a tmpfs volume as a cache directory (using the use_cache argument), and got slightly better results (1360 MB/s instead of 1130 MB/s).

Estimating the maximal read throughput

The use/miss-use of background threads has a direct consequence on the expected read performance of each solution, depending on the network throughput capabilities. If we denote the maximal object store read throughput as S (measured in MB/s), then we can roughly estimate the read performance of Goofys and s3fs using the following formulas:

The s3fs formula assumes using TLS connections and a tmpfs cache. These formulas are fit for my test-machine. For other environments, you would need to change the constants 500 (Goofys read throughput with readfile stub), and 1360 (s3fs read throughput with readfile stub) to respectively match the appropriate number for other environments. End-to-end experiments I made with IBM’s COS as a backend, verified those formulas are good approximation for the actual read throughput.

Let us now explain the formulas. In Goofys, the reading from the object store is performed in the background of the user’s read requests, using threads. Thus, the read throughput would equal the object store read throughput S, unless S is greater than 500MB/s, and in that case Goofys will be bottlenecked by its FUSE overhead.

On s3fs, the reading from the object store is done while the user request is waiting. Thus, the request latency (per 1MB) will equal the sum of latencies: the object store latency per MB (1/S), and the s3fs processing and caching latency per MB (1/1360). The object store latency equals 1/S. Once we calculate the total latency per MB, we get the throughput in MB/s by taking the arithmetic inverse of the latency.

Non-sequential read performance

The above analysis was considering the simplest case of pure sequential reading of objects. When considering random-reads the performance gets much worse, due to network latency. Note, however, that there are use cases involving reads and seeks where the s3fs caching architecture can outperform Goofys. The s3fs cache guarantees that not a single offset of the object would be retrieved twice. Goofys on the other hand clears the entire object cache on each seek operation. Thus, s3fs may perform significantly better in use cases that include semi-sequential reads (as defined at the beginning of the post).

Summary and Conclusions

Below I detail a summary of my findings regarding Goofys and s3fs:

For the high-thought read workload I was aiming for, s3fs seems better than Goofys. One reason is that s3fs cache, unlike Goofys’ cache, is adapted to seek operations. If your use case requires only sequential reading or a very small number of seeks (e.g. one per 5GB read) then Goofys is certainly an option to consider. The other appeal I found in s3fs is its theoretical reading throughput limit can exceed 500 MB/s.

For reference, the tests were done on a bare-metal machine running Ubuntu 16.04, and equipped with 2 x Intel E5–2650-V3-DecaCore, with 16 x 16GB Micron DDR4, and a 600GB Seagate Cheetah HD. We used Goofys 0.0.9, and s3fs 1.80.

High throughput object store access via file abstraction

Estimating the maximal read throughput

Non-sequential read performance

Summary and Conclusions

Written by Or Ozeri