I’m a huge fan of organization, and on more than one occasion, this has extended to naming conventions. Standardized naming helps you understand what files belong together, and what you’re about to be looking at. It makes searching for things easier, which traditionally also makes finding things easier.*
That being said, there are instances where the sweet smell of sequentially named files could work against you and really bog down your upload speed. It’s as fixable as it is disruptive, and that’s what we’ll cover in this post!
How did we get here? A bit of context
Let’s say you’ve got a security camera snapping photos and uploading them to your Cloud Storage Bucket. This camera conveniently names each file with the timestamp of when it was taken, and this makes it easy to find footage from specific moments, and view the photos chronologically.
At regular intervals throughout the day, these photos are uploaded from the local device to a Google Cloud Storage bucket for display, analysis (and easy searching!) — but you’ve run into an issue.
There’s nothing suspicious on the footage, but you wouldn’t even know it, because it’s taking SO LONG for your photos to upload in the first place. There’s a backlog forming from the uploads, and this backlog is preventing the local images from the camera from being cleared at the appropriate intervals. The delays are maxing out the local space on the camera, and preventing it from documenting anything further until that backlog is cleared.
This means your security system is unreliable until further notice, and nobody wants that.
First you check the basics: There’s no network connectivity issue, and everything else is running smoothly, so we’ll need to look deeper for the source of this disruption.
Let’s start with what happens when Google Cloud Storage uploads files.
Under the Hood — Load Balancing and Unbalancing
Quick note: In this post, we’re talking about the built-in load balancer within GCS, there’s a separate product called “Load Balancer” but that’s a whole other thing.
When uploading files to Google Cloud Storage, our friend the GCS load balancer is responsible for handling the upload by deciding how to distribute connections between shards.
This means the load balancer is looking for ways to best organize things that need to make it into your storage buckets. Since we love a good organizational mechanism, we dug a little deeper to learn how the load balancer determines the best way to distribute incoming requests to various parts of the GCS architecture.
We found that the load balancer uses file names to guide distribution…And so the plot thickens.
Course Correcting for Upload Performance
After asking around a bit, it was clear that the load balancer is a crafty mechanism indeed — it’s expecting us to have well defined, well named files ready for upload, and that expectation is built into its balancing process.
The load balancer thinks of your sequentially named files as individual members of a family. The load balancer wants to keep this family together, so every file that’s perceived as a part of this family (based on your handy naming convention) will be forced to use the same passage to the cloud.
This means that if you’ve got a security camera taking a photo every second, you’ve got 60 photos per minute, and 3,600 per hour. That’s 86,400 photos to upload a day, and if they’re all taking a single pathway, it’s going to get crowded, and that traffic is going to slow you down.
So what can we do about it?
Mixing it Up to Speed it Up
The key to avoiding this issue is diversity!** The more diverse the file names, the better the system can balance the load, since it’s not trying to organize everything into a single pathway.
**only if your upload performance issue is due to a bunch of semi-consistent file names, of course!
You can still keep your organized or automated naming conventions, but add a bit of noise to the mix, in order to avoid Load Balancing over-compensation.
- Let’s start with our filename
2. Create a random hash of the filename — this will transform the name into a noisy value
3. Add a marker (short string in the beginning) to make it pseudorandom*
* this is enough to avoid disrupting the load balancer while still preserving the ability to reconstitute image order
4. Prepend the hash to the original file name
A pseudorandom hash of the filename, prepended to the filename, will allow Cloud Storage to better partition the fixed range into shards, improving our upload performance without clobbering our data or disrupting our carefully curated system.
To review, this is the transition you’ll need to make for your sequentially named files:
To learn more about hashes and eTags, check out this article, and stay tuned for the next installment of Cloud Storage Bytes — the blog!