Fast GCS Downloads with C++

Carlos O’Ryan (Google)
Dom Zippilli (Google)

Published in

Google Cloud - Community

6 min readMar 8, 2021

The unaltered C++ logo with text “Fast GCS Downloads”. Learn more about Standard C++ at isocpp.org

Sometimes, you just want to find out: how fast can a download go? We all know cloud storage services can scale horizontally, but sometimes you just need to download a large file to a VM as fast as possible. In this post we will discuss a small program we wrote to help in these cases. It is both useful as part of a larger workload, and also helps troubleshooting, as it eliminates some possibilities from likely bottlenecks.

The idea in this program is simple: to download a large file open N parallel streams from GCS, each downloading a “slice” of the file. Then write the slices in the right portion of the file. There are, however, some additional considerations:

This approach requires sparse file support in your filesystem, making this unsuitable for some FUSE filesystems and for some file types.
In our tests, decrypting each stream takes about 40% of a GCE vCPU, while the actual percentage probably varies by VM model, it should be clear that one cannot create an unbounded number of streams. The program (maybe conservatively), creates 2 threads per vCPU. If you wanted to change this, the program accepts a different number of threads from the command-line.
Because starting a stream takes several milliseconds, it is not worthwhile to create very small slices (think of the extreme case of a 100 slices, each 1 byte long). This is true even when there are enough vCPUs to handle the additional work.

The program can achieve very high throughput, even on a relatively small VM. For example, we were able to download a 32GiB file at over 1GiB/s, using a c2-standard-4 VM:

Downloading file-32GiB.bin from bucket xxxxxx to file /s/foo.binThis object size is approximately 32GiB. It will be downloaded in 8 slices, each approximately 4GiB in size.
… … …
…
Download completed in 29643ms
Effective bandwidth 1105.42 MiB/sFile size and CRC32C match expected values

With that said, let’s jump into the code and explain what it is doing. You can find the full program on GitHub. Keen readers may note that this program uses C++17 features. We think that yields more readable code in this example. If you are using C++11 or C++14, no worries, the library supports both.

The first thing the program needs to do is find out how large of an object we want to download, this is done with just two lines of code:

auto client = gcs::Client::CreateDefaultClient().value();
auto metadata = client.GetObjectMetadata(bucket, object).value();

Note the use of .value(), the C++ client library returns StatusOr<T> from most functions. This is an outcome type which either contains the result or an error. The .value() returns the contained value on success, and throws an exception in case of errors. As the client library already retries most operations, a simple program like this can just log these errors and exit, there is no need to implement a retry loop, as the library already contains it. C++ exceptions are a natural way to express that flow of control. More complex applications could examine the error code at the call site, and continue the work despite the error, maybe in some degraded mode.

Once we have information about the size of the target object, we can simply calculate the slice size and number of threads to perform this download. We use a small function that calculates the size of each slice:

namespace po = boost::program_options;std::vector<std::int64_t> compute_slices(
    std::int64_t object_size,
    po::variables_map const& vm) {
auto const minimum_slice_size =
    vm[“minimum-slice-size”].as<std::int64_t>();
auto const thread_count = vm[“thread-count”].as<int>();
std::vector<std::int64_t> result;
// … explanation continues below …

If the object is large enough, the function makes each slice approximately equal in size:

… … …
auto const thread_slice = object_size / thread_count;
if (thread_slice >= minimum_slice_size) {
  std::fill_n(
      std::back_inserter(result), thread_count, thread_slice);

Of course we need to deal with the case where the object size is not a multiple of the number of threads. We give any extra bytes to the last slice:

… … …
  result.back() += object_size % thread_count;
  return result;
}

Finally, if the object size was too small, we just create enough slices of the minimum size:

… … …
  for (; object_size > 0; object_size -= minimum_slice_size) {
    result.push_back(std::min(minimum_slice_size, object_size));
  }
  return result;
}  // compute_slices()

The bulk of the work is to download each slice. This is defined in a small function, so we can schedule the work in different threads. The function receives the name of the bucket and object to download, as well as the range of bytes to download from said object. Finally it receives a file descriptor where the received bytes will be stored:

std::string task(
    std::int64_t offset, std::int64_t length,
    std::string const& bucket, std::string const& object,
    int fd) {
  // … explanation continues below …

We create a separate client in each thread, to minimize contention, and then start the download for the desired range:

… … …
  auto client = gcs::Client::CreateDefaultClient().value();
  auto is = client.ReadObject(
      bucket, object, gcs::ReadRange(offset, offset + length));

We will read data in relatively large blocks, 1MiB in this case, this minimizes the system call overhead for pwrite() (see below). Downloads are streamed, so so the read block size does not have much effect in the download performance, beyond the usual function call overheads. We also need to keep track where in the file this data needs to be written:

… … …
  std::vector<char> buffer(1024 * 1024L);
  std::int64_t count = 0;
  std::int64_t write_offset = offset;

Then we just read a block, and if there is an error we simply stop. Note the use of .read(); the client library minimizes copying when you use this (standard for C++ iostreams) function. It is the most efficient way to read unformatted data:

… … …
  do {
    is.read(buffer.data(), buffer.size());
    if (is.bad()) break;

With the data in a buffer we update some counters and write to the destination file descriptor. Note that we use is.gcount(); while the client library always waits until your buffer is full, the slice may not be a multiple of the buffer size, and may receive a “short” buffer at the end:

… … …
    count += is.gcount();
    check_system_call(“pwrite()”,
        ::pwrite(fd, buffer.data(), is.gcount(), write_offset));
    write_offset += is.gcount();

Finally we continue reading until the slice is completely downloaded, and then return some informational messages:

… … …
  } while (not is.eof());
  return fmt::format(
      “Download range [{}, {}] got {}/{} bytes”, offset,
      offset + length, count, length);
} // task()

That was the bulk of the work. We do need to to schedule these threads:

auto slices = compute_slices(metadata.size(), vm);
std::vector<std::future<std::string>> tasks(slices.size());
std::int64_t offset = 0;
std::transform(
    slices.begin(), slices.end(), tasks.begin(),
    [&](auto length) {
      auto f = std::async(
          std::launch::async, task, offset, length, bucket,
          object, fd);
    offset += length;
    return f;
});

And then wait for them:

for (auto& t : tasks) std::cout << t.get() << “\n”;

Creating the destination file just requires a few lines of code too. Unfortunately, this is relatively low-level code. We need to perform concurrent writes, and none of the standard C++ facilities (iostreams) or most commonly used libraries (Boost.org) support them. So we just resort to some POSIX APIs:

auto constexpr kOpenFlags = O_CREAT | O_TRUNC | O_WRONLY;
auto constexpr kOpenMode = S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP;
auto const fd = check_system_call(
    “open()”, ::open(destination.c_str(), kOpenFlags, kOpenMode));
… … …
check_system_call(“close(fd)”, ::close(fd));

There is, of course, some amount of code to handle command-line parsing and report the performance results, but that is less interesting.

Thanks for Reading!

If you think this was interesting, give us a star on GitHub today. You can find more examples and documents for the Google Cloud C++ client libraries.

Fast GCS Downloads with C++

Carlos O’Ryan (Google)Dom Zippilli (Google)

Thanks for Reading!

Written by Carlos O'Ryan

Carlos O’Ryan (Google)
Dom Zippilli (Google)