Medic7 was seeing a problem uploading their large video files. Their clients would make a video of some genetic test, and then upload it, and it was taking forever. They needed help.
To get a sense of their problem, I uploaded a bunch of 100MB 200MB and 500MB files to see the performance. You can see in the graph below, that the current upload performance seems to cap out at about 200M object size.
So if we’re uploading a bunch of 4 gig files (say, video editing) we need a better plan.
Breaking up the thing
The answer to this problem smacks you in the face while using gsutil. Any time you try to upload a “large file” you’ll see the following message.
Breaking this down, gsutil can automatically use object composition to perform uploads in parallel for large, local files being uploaded to Google Cloud Storage. This process works by splitting a large file will into component pieces that are uploaded in parallel and then composed in the cloud (and the temporary components finally deleted).
You can enable this by setting the `parallel_composite_upload_threshold` option on gsutil (or, updating your .boto file, like the console output suggests)
gsutil -o GSUtil:parallel_composite_upload_threshold=150M cp ./localbigfile gs://your-bucket
Where `localbigfile` is is a file larger than 150 MiB. This will divide up your data into chunks ~150MiB and upload them in parallel, increasing upload performance. (Note, there’s some restrictions on the # of chunks that can be used. Refer to the docs for more information)
Here’s a graph showing 100 instances of uploading a 300MB file regular, and with composite.
The challenge with composite uploads
Using parallel composite uploads presents a tradeoff between upload performance and download configuration: If you enable parallel composite uploads your uploads will run faster, but if you’d like to fetch the object using gsutil (or other python apps), then the client will need to install a compiled crcmod (see gsutil help crcmod) in order to download the file properly.
To be clear, this restriction for crcmod is temporary, and mostly there to protect the integrity of the data and ensure you your client doesn’t end up freaking out that things might look different. (CRC values and HTTP ETAG headers might show some difference.)
However, if this doesn’t work for your setup, you’ve got three options:
1) Maybe turn it off? Modify the `check_hashes` option of your config files to disable this step. NOTE: It is strongly recommended that you not disable integrity checks. Doing so could allow data corruption to go undetected during uploading/downloading.
2) Don’t use gsutil? To be clear, this isn’t an endorsed recommendation. However if you download the composite object using cURL, wget or http, then the fetch will work (you get the composited object). However, it’s strongly advised to still do crc checking, it’s just your responsibility to do it now.
3) Machine-in-the-middle? Another way to reduce this problem is to use a cloud-side instance to download the file (since crcmod can be installed there), and then re-upload it to a bucket in it’s entirety. To be clear, this takes time, and is more expensive (in terms of transaction costs), however this completely removes the crcmod restriction, and it might be a net-win, time wise, since GCP can easily get ~16 Gbits / sec upload speed from an internal VM.
The fix is in!
For Medic7, putting CRCmod on each of their internal clients was not an issue, since uploaded videos had to be fast, and were then processed internally before being moved to another GCS bucket for distribution, so the machine-in-the-middle approach was almost de facto for them. The use of composite objects resulted in a 50% performance improvement for their clients, which is pretty great!