Fast GCS Downloads with C++

Carlos O’Ryan (Google)
Dom Zippilli (Google)

Carlos O'Ryan
Google Cloud - Community
6 min readMar 8, 2021

--

The unaltered C++ logo with text “Fast GCS Downloads”. Learn more about Standard C++ at isocpp.org

Sometimes, you just want to find out: how fast can a download go? We all know cloud storage services can scale horizontally, but sometimes you just need to download a large file to a VM as fast as possible. In this post we will discuss a small program we wrote to help in these cases. It is both useful as part of a larger workload, and also helps troubleshooting, as it eliminates some possibilities from likely bottlenecks.

The idea in this program is simple: to download a large file open N parallel streams from GCS, each downloading a “slice” of the file. Then write the slices in the right portion of the file. There are, however, some additional considerations:

  • This approach requires sparse file support in your filesystem, making this unsuitable for some FUSE filesystems and for some file types.
  • In our tests, decrypting each stream takes about 40% of a GCE vCPU, while the actual percentage probably varies by VM model, it should be clear that one cannot create an unbounded number of streams. The program (maybe conservatively), creates 2 threads per vCPU. If you wanted to change this, the program accepts a different number of threads from the command-line.
  • Because starting a stream takes several milliseconds, it is not worthwhile to create very small slices (think of the extreme case of a 100 slices, each 1 byte long). This is true even when there are enough vCPUs to handle the additional work.

The program can achieve very high throughput, even on a relatively small VM. For example, we were able to download a 32GiB file at over 1GiB/s, using a c2-standard-4 VM:

With that said, let’s jump into the code and explain what it is doing. You can find the full program on GitHub. Keen readers may note that this program uses C++17 features. We think that yields more readable code in this example. If you are using C++11 or C++14, no worries, the library supports both.

The first thing the program needs to do is find out how large of an object we want to download, this is done with just two lines of code:

Note the use of .value(), the C++ client library returns StatusOr<T> from most functions. This is an outcome type which either contains the result or an error. The .value() returns the contained value on success, and throws an exception in case of errors. As the client library already retries most operations, a simple program like this can just log these errors and exit, there is no need to implement a retry loop, as the library already contains it. C++ exceptions are a natural way to express that flow of control. More complex applications could examine the error code at the call site, and continue the work despite the error, maybe in some degraded mode.

Once we have information about the size of the target object, we can simply calculate the slice size and number of threads to perform this download. We use a small function that calculates the size of each slice:

If the object is large enough, the function makes each slice approximately equal in size:

Of course we need to deal with the case where the object size is not a multiple of the number of threads. We give any extra bytes to the last slice:

Finally, if the object size was too small, we just create enough slices of the minimum size:

The bulk of the work is to download each slice. This is defined in a small function, so we can schedule the work in different threads. The function receives the name of the bucket and object to download, as well as the range of bytes to download from said object. Finally it receives a file descriptor where the received bytes will be stored:

We create a separate client in each thread, to minimize contention, and then start the download for the desired range:

We will read data in relatively large blocks, 1MiB in this case, this minimizes the system call overhead for pwrite() (see below). Downloads are streamed, so so the read block size does not have much effect in the download performance, beyond the usual function call overheads. We also need to keep track where in the file this data needs to be written:

Then we just read a block, and if there is an error we simply stop. Note the use of .read(); the client library minimizes copying when you use this (standard for C++ iostreams) function. It is the most efficient way to read unformatted data:

With the data in a buffer we update some counters and write to the destination file descriptor. Note that we use is.gcount(); while the client library always waits until your buffer is full, the slice may not be a multiple of the buffer size, and may receive a “short” buffer at the end:

Finally we continue reading until the slice is completely downloaded, and then return some informational messages:

That was the bulk of the work. We do need to to schedule these threads:

And then wait for them:

Creating the destination file just requires a few lines of code too. Unfortunately, this is relatively low-level code. We need to perform concurrent writes, and none of the standard C++ facilities (iostreams) or most commonly used libraries (Boost.org) support them. So we just resort to some POSIX APIs:

There is, of course, some amount of code to handle command-line parsing and report the performance results, but that is less interesting.

Thanks for Reading!

If you think this was interesting, give us a star on GitHub today. You can find more examples and documents for the Google Cloud C++ client libraries.

--

--