Google Cloud Storage : What bucket class for the best performance?

One of the most common performance questions we get with respect to Google Cloud Storage is related to “What type of bucket I should use for the best performance?”

Well, to help figure that out, let’s describe the buckets, and run some tests to figure out the best situations.

GCS buckets in pragmatic terms

GCS asks you to create “buckets” in order to place your assets, where the first question is what type of service bucket you want:

Regional, Multiregional, nearline or cold-storage.

Nearline and Coldline are not intended for high-performance systems, so we’ll ignore those right now, and look at the performance of the Regional and Multiregional buckets.

First off, let’s clarify some terminology:

  • A Region relates to a location of a GCS datacenter.
  • A Multiregion relates to a collection of Regions (it’s a hierarchy thing)

Regional

There’s a common scenario where you want to have some client, or on-prem system uploading data to a cloud environment to do compute work, and then returning the information back to the client. Or, in sysadmin terms : “You want your VMs close to your data source to maximize throughput

Regional GCS buckets guarantee all data in this bucket lies in the specified region, for this exact reason.

Now, to be fair, you don’t have fine grained control what subregion in the region your data is in; over time it can migrate and move. As such write latency is replicated to 2 locations, where there’s different metadata listings for each location it’s copied for fault tolerance. This can cause a remote-round-trip sync write to all the other meta data locations on a write (due to strong read-after-write consistency).

This means Regional buckets are great for data processing since their physical distance is fairly tight, and the overhead of write consistency is low.

Multiregional

Multiregional Storage, on the other hand, guarantees 2 replicates which are geo diverse (100 miles apart) which can get better remote latency and availability.

More importantly, is that multiregional heavily leverages Edge caching and CDNs to provide the content to the end users.

All this redundancy and caching means that Multiregional comes with overhead to sync and ensure consistency between geo-diverse areas. As such, it’s much better for write-once-read-many scenarios. This means frequently accessed (e.g. “hot” objects) around the world, such as website content, streaming videos, gaming or mobile applications.

Perf characteriscs

To provide real numbers, we set up a test: upload a 2MB file to a bunch of regional and multiregional buckets, and then fetch that asset (with caching disabled) from a VM in us-west1.

This data appears to show that multiregional buckets perform significantly better for cross-the-ocean fetches, however the details are a bit more nuanced than that.

Looking back at the logs, the reason that the Multiregion buckets are performing better in those scenarios, is that the data was duplicated to region (of the multi-region) which provided a better access point (and lower latency) to our fetching client. (To confirm this, I ran the same exact test between us-west1 and europe-west1 directly, and got about ~175ms.)

A bucket of performance

What these tests show us is that there’s no specific performance difference of the classification of the buckets themselves. Rather the performance is dominated by the latency of physical distance between the client and the cloud storage bucket.

As such, we get a handy little rule here:

  • If caching is on, and your access volume is high enough to take advantage of caching, there’s not a huge difference between the two offerings (that I can see with the tests). This shows off the power of Google’s Awesome CDN environment.
  • If caching is off, or the access volume is low enough that you can’t take advantage of caching, then the performance overhead is dominated directly by physics. You should be trying to get the assets as close to the clients as possible, while also considering cost, and the types of redundancy and consistency you’ll need for your data needs.