Google Cloud Storage & Sequential File Names
BUMBLECAM is a nursery camera company, and their application uploaded snapshots of the nursery cam on a regular basis to the owners of the camera. Their setup was pretty simple: Every half second, the camera would snap a photo, and store it locally. In battery saver / bandwidth saver mode, a bunch of pictures would be batched up on the device before uploading to the Google Cloud Bucket for the owner.
The problem they were seeing was that their upload times for the images was painfully slow. During the course of 20 minutes, this slowdown would result in a serious backlog of images that needed uploading, eventually resulting in the camera running out of memory, and stop working.
Finding the source of the problem
So let’s run down the checklist here.
First, we ran perfdiag on a sample bucket they created, and verified it had a high throughput from the source. Way higher than what the developer was seeing. This meant that the bucket itself was performing properly in terms of connection and upload speed from the client; the problem had to be in the type of data being uploaded, or how it was being organized.
Since the cameras were embedded hardware, we knew they weren’t using GSUTIL to do the uploads, but rather the native Python APIs to upload files. As such, we knew that the uploads were going through the fastest possible API.
Next we checked the sizes of the files, they were about 100k each. So, they weren’t big enough to use the composite file upload, and the developer was properly using parallel upload API, so we should be seeing max throughput there.
Basically, all of the most common cases for performance, things were already set up in the ideal manner.
How GCS works when uploading files.
At this point, I needed to do research myself, so I turned to Michael Yu’s NEXT 2017 talk, where he provided details on how GCS works behind the scenes. Here’s a generalization of how things work:
When uploading a number of files to GCS, the frontend will auto-balance the connections to a number of shards to handle the transfer. This auto balancing is, by default, done through the name/path of the file; Which is very helpful if the files are in different directories, since each one can be properly distributed to different shards.
Which means that how you name your files could have an impact on your upload speed.
When the files are co-located in the directory structure, and the names are sequential, the requests are constantly shifting to a new index range, making redistributing the load harder and less effective.
And that was exactly our issue. The developer was using the timestamp in the file path. (e.g. YYYY/MM/DD/CUSTOMER/timestamp)
Fixing the sequential name problem
One solution to this problem is to manually break up your linearly-named files into folders, and then upload the folders in parallel. For example, the foo and bar folders can be uploaded in parallel w/o stomping all over each other. However, Bumblecam wasn’t too thrilled at this option, since this would cause some new “house keeping” dependencies to show up in various parts of the pipeline, not to mention that it might cause other scaling issues down the line if they didn’t continue to create new folders.
What we finally settled on was , prepending a hash of the filename to the filename itself. There’s lots of hash functions out there, but we settled on one which generates a uniform distribution of values over a fixed range (e.g. 00000000 — FFFFFFF). This allows GCS to partition the fixed range into shards for better load balancing.
For BUMBLECAM , this was an easy fix, for a massive increase in performance.