Optimizing Google Cloud Storage small file upload performance
DNA SOAP is a bio-tech company focused on understanding more about the human genome.
Their latest release was a software package that allowed a distributed protein folding computing model; where internal machines and external machines could be used together to calculate protein folding operations.
While this was a great, cost effective way to tackle a very compute-heavy problem, their new software design wasn’t performing as well as they wanted. Any time they would create a new data set to be worked on, there was a 5–6 minute delay before any of the clients would start doing computation.
Their architecture was pretty straightforward. Once a new genome dataset came in, it would be mirrored across regions, and then a Data Coordinator would chop up the genome into workable blocks, and upload that data to Google Cloud Storage. Clients listening on the service would be notified of the available work via Pub/ Sub, and would grab the blocks directly from GCS to start working.
The problem, as defined by DNA SOAP, was that it took too long to upload all the files to GCS. Here was the interesting part : The data coordinator would modify the number and size of blocks based on the expected number of clients in that region. So with lots of users, there could be a lot of smaller files being sent around. The assumption then, is that there must be some correlation between file sizes and GCS upload performance.
Trying to duplicate
This seems like a straightforward enough problem to duplicate. We can generate a bunch of small files, upload them to a GCS bucket, and download them individually from the GCS bucket, and see what the time difference is.
To test this, I created a small python script which generated files of various sizes, and then uploaded each file to a GCS regional bucket 100 times (random name each time to remove collisions). Below, you see the performance (bytes/sec) charted against the size of the objects.
As the graph shows, upload speed improves as the object size improves, which is mostly due to the reduction of transactional overhead per event.
The reason for this is that GCS is really strong in terms of single stream throughput, but there’s a fixed latency cost per request that’s related to ensuring that the files are replicated and uploaded properly. As such, this transactional overhead is higher for smaller files, and as that number increases, so does the amount of overhead. As the files get larger, the transactional overhead is smaller, resulting in higher throughput.
This concept of high-overhead-per-operation is not a new one. If you’ve ever done SIMD programming on the CPU, the same idea exists : batch your operations together so that the overhead of each operation is mitigated across the set.
For DNA SOAP, fixing their upload performance on GCP followed the same idea: Parallel uploads.
Gsutil provides the `-m` option which performs a parallel (multi-threaded/multi-processing) copy which can significantly increase the performance of an upload. Here’s the same test, but batch uploading 100 200k files using `gsutil -m cp <src folder> <dst folder>` rather than uploading each one individually.
Although it’s more pronounced at the right side of the graph, the log-scale version shows that -m gives significant improvements in performance at the left side (smaller file sizes) as well.
The fix is in!
GCS is a powerhouse when it comes to upload performance, but that power comes once you hit a certain level of efficiency. The transactional overhead of tons of small files means lots of additional round-trips (to ensure consistency) and thus, less performance.
The takeaway? It’s more performant to batch upload your files with `-m` but if you have to do things one-at-a-time, make sure you’re using the direct API rather than going through GSUTIL for each one.
Which is faster, TCP or HTTP load balancers?
Want to know how to profile your kubernetes boot times?