Introducing gs-fastcopy

David Haley
7 min readJul 21, 2024

Efficient data transfer is key to performance, especially when scaling compute horizontally. In domains like bioinformatics, data science, etc., typical workload inputs and outputs range from 100s of megabytes to several gigabytes… or more. The most efficient way to copy data to/from cloud storage is to use CPU and I/O resources in parallel. While one thread uses the disk, another uses the network.

Lab-grade microscope imaging systems for cancer research are evolving from megapixels to gigapixels, to capture an area less than 0.001 square millimeters. That’s a lot of pixels in not a lot of space. Then again, cells are very small. For comparison, a modern smartphone camera sensor produces ~50 megapixels, or ~187.5 MB of data with 10-bit RGB colors.

The sheer number of bytes makes scale hard: never mind processing it. For instance, architecting around a series of quick steps is impractical if the steps have too much overhead. Overhead challenges are a story for another day. But while exploring those challenges, I learned that standard storage I/O in Python is surprisingly slow: see issue deepcell-imaging#248.

The result: open source Python library gs-fastcopy. [package] [source]

gs-fastcopy quickstart

gs-fastcopy is designed with simplicity in mind. The library provides a file-like interface to cloud storage objects, wrapping the complexity of efficient transfer & (de)compression.

import gs_fastcopy
import numpy as np

with gs_fastcopy.write('gs://my-bucket/my-file.npz.gz') as f:
np.savez(f, a=np.zeros(12), b=np.ones(23))

with gs_fastcopy.read('gs://my-bucket/my-file.npz.gz') as f:
npz = np.load(f)
a = npz['a']
b = npz['b']

The read API doesn’t support tuning: its implementation gcloud picks reasonable defaults. That said, gcloud reads from its configuration as usual so advanced users can tweak its behavior.

The write API supports these parameters:

  • max_workers: how many processes to allocate at most. The default uses Google’s default, 8 workers.
  • chunk_size: the size in bytes of the pieces to upload in parallel. The default uses Google’s default, 33,554,432 bytes (~34 MB).

Compression uses pigz if the object name ends in .gz. For downloads, unpigz is called on the downloaded file before streaming it to memory. For uploads, once the file is written, pigz compresses it before it’s uploaded to cloud storage.

Background & motivation

Standard read/write patterns are serial

If you follow the Google’s sample code to read/write files [permalink 20240711], you’ll see the Blob interface:

storage_client = storage.Client()
bucket = storage_client.bucket("my_bucket")
blob = bucket.blob("path/to/my/file")

with blob.open("w") as f:
f.write("Hello world")

with blob.open("r") as f:
print(f.read())

This provides a simple & idiomatic, file-like interface to cloud storage.

Libraries like smart_open provide even more convenience, supporting multiple cloud providers and on-the-fly (de)compression.

for line in smart_open.open('gs://my_bucket/my_file.txt'):
print(line)

with smart_open.open('gs://my_bucket/my_file.txt', 'wb') as fout:
fout.write(b'binary string')

Under the hood, smart_open wraps the blob interface for reads and writes [as of 2024–07–11]. Alas: the Blob interface reads and writes in serial. Which means that smart_open’s inline compression also works in serial.

Opportunities for parallelism

Parallelism doesn’t magically increase bandwidth. It lets the computers–client and cloud–keep working while other tasks are pending. When one thread writes to memory, another reads from the network. When one thread’s TCP buffer fills up waiting for ACKs, another is still sending data.

Similarly for (de)compression, which adds data-processing CPU time on top of disk/memory/network I/O. Serial processing means I/O devices sit idle while the algorithm runs.

Theory aside, here’s the data point that really motivated me.

Saving raw predictions output to gs://.../raw_predictions.npz
Saved output in 76.97 s

The save wrote a 1.2 GB npz file using savez_compressed[docs]. Using smart_open, the app opens a streaming Blob write, passing through the built-in gzip.GzipFile streaming compression [docs]. Both the compute & storage were in GCP region us-central1. The data was “only” 1.2 GB: the transfer should’ve been should be very fast but it took ~75s. Was the transfer fast but the compression slow?

I reached out to my friend Lynn Langit to gut-check the timing. She reminded me that the command-line tool gsutil uses parallelization. I did some research and found this great benchmarking work by Christopher Madden: High throughput file transfers with Google Cloud Storage (GCS)

I summarized Christopher’s work for comparison [spreadsheet]:

Bar chart of upload/download speeds for serial/parallel approaches. See spreadsheet linked above for numbers.
Summary of data in High throughput file transfers with Google Cloud Storage (GCS) by Christopher Madden.

These measurements show conclusively that parallel is an order of magnitude faster than serial. 🏎️

So which parallel approach should we choose? The gcloud CLI or the XML multipart API in Python? Both upload from and download to the file system.

gcloud CLI

  • (pro) default settings make good use of available resources
  • (pro) standard CLI tool using standard APIs
  • (con) uses composite uploads which don’t play well with storage classes & retention policies
  • (con) normal installation pulls in full CLI, not just storage component

XML multipart API

  • (pro) extremely fast with tuned settings
  • (pro) specialized API works well with retention, etc.
  • (pro) attempts to clean up intermediate files on exception
  • (con) default settings don’t scale with resources
  • (con) specialized API requires extra permissions

Compression

There’s another factor: compression. We want to store & transfer compressed files to reduce time & costs.

The Python XML multipart API doesn’t support compression options. Although gcloud storage cp supports in-flight compression with --gzip-in-flight-all, that leaves the file uncompressed in GCS and uses serial compression. gcloud also supports a local gzip before uploading with --gzip-local-all, writing a compressed file to GCS, however it also operates on the entire file in serial.

pigz (pronounced pig-zee) is a Parallel Implementation of GZip by Mark Adler, one of the zlib & gzip co-authors. It divides the input into chunks which are compressed in parallel. Its counterpart unpigz doesn’t decompress in parallel but does make use of multiple threads. (Note that decompression is algorithmically simpler than compressing.)

Design decisions

For download, we’ll use gcloud because it provides reasonable defaults: naive usage will be good usage. Although XML multipart can provide ~4x speedups when properly tuned, that’s only for in-memory download. We use unpigz to decompress downloaded objects so we need the file system.

Given that gcloud and XML multipart bandwidths are comparable–in fact gcloud is faster even with defaults–let’s use the simpler tool that doesn’t require tuning: gcloud.

Future work: if we aren’t using compression, it’s better to stream directly to memory. Perhaps even with compression, it might be faster to stream to memory and then decompress in-memory as well.

For upload, the XML multipart bandwidth is much higher even if the user needs to tune the parallelization. Furthermore, and more importantly perhaps, composite uploads don’t play nicely with various storage features like retention policies & storage classes. Composite uploads wrap normal APIs to write a series of files then call a compose API to join them, after which it deletes the objects. With retention policies, those intermediate files can’t be deleted…

Conclusion: XML multipart for uploads.

Future work: fall back to composite uploads if we don’t have multipart permissions.

Future work? : add helpers to detect GCP machine type & select parallelization settings accordingly. This is hard though, because we don’t know what else might be happening with machine resources and we don’t want to disrupt other machine processes.

For compression, we’ll use pigz and unpigz. We need files on disk anyway. I was astonished by how fast these tools are.

Tests

Look, ma, I have tests! 😎

The tests exercise uncompressed and compressed reads & writes while mocking out GCP calls (gcloud storage cp on the CLI, and upload_chunks_concurrently). In particular, the pigz and unpigz subprocess invocations are tested.

Benchmarking

Here are some measurements using data from the work stream that inspired this project, cancer research using DeepCell for cellular segmentation.

I used this benchmarking script [permalink]. It conducts a series of uploads & downloads using smart_open aka the Blob interface for serial transfer, and gs_fastcopy for parallel transfer. I ran everything in us-central1. The uncompressed numpy file size was 2.1 GB; it compressed to 1.2 GB using gzip and zipfile algorithms.

I ran it from my home computer (my internet speed tests hover around 400 Mbps), and from two machine types: n1-standard-8 and n1-standard-32 (machines we use a lot in our DeepCell work). I used all default settings.

Benchmark results [spreadsheet]

This really shows the impact of parallelism when the network itself isn’t the limiting factor. Here’s another graph with just the cloud results.

Benchmark results [spreadsheet]

This graph shows how important tuning is. Christopher’s work above demonstrated much faster upload speed, by matching workers to machine size. Here, with all default settings, it seems we get the most benefit when uploading a file compressed with pigz (parallel gzip).

Goes to show: the gs-fastcopy work is far from done. The next big step to achieve fast speeds by default is to inspect available processors to pick an appropriate number of upload workers.

Thanks for reading. I’d love to know what you think and/or your experiences using gs-fastcopy. Feel free to submit feedback, or open issues and pull requests in the GitHub repo!

Do you want to understand & improve your workload performance but aren’t sure where to start? I can help! I’m a recognized Google Cloud Expert, and have decades of experience building, tuning, and operating software including on GCP, AWS, and Azure. ️⚙️

Learn more: RedwoodConsulting.io

--

--