Google Cloud Storage ~ Upload a large file using cURL

Rosyparmar
4 min readDec 19, 2022

--

There are multiple ways to upload files into Google Cloud Storage today. Some of these are listed below:

  1. Client Libraries
  2. Cloud Console
  3. XML/JSON API Multipart Upload
  4. gsutil tool

We are going to look at XML API Multipart Upload in this article.

As the name suggests, this upload method uploads files in parts and finally assembles all the parts into a final object that lands in your storage bucket. XML API multipart uploads are compatible with Amazon S3 multipart uploads. The API is preferred approach when:

  1. You have a large file to upload. With this API you can have parts of the file uploaded simultaneously, reducing time it would otherwise take to upload the entire file at a time.
  2. For some reason if one of the requests fails, you would only have to upload that specific part again without having to restart the entire operation.
  3. Access to Cloud Storage through the XML API is extremely useful when you are using tools and libraries that must work across different storage providers, or when you are migrating from another storage provider to Cloud Storage. In the latter case, you only need to make a few simple changes to your existing tools and libraries to begin sending requests to Cloud Storage.

Some of the size limitations to keep in mind when using this API include:

  • A multipart upload can have up to 10,000 parts.
  • An individual part has a maximum size limit of 5 GiB.
  • An individual part has a minimum size limit of 5 MiB, unless it’s the last part, which has no minimum size limit.

For the demonstration today, I am using Covid-19 public dataset from Kaggle. I am going to use curl command to quickly test out API functionality. For authentication, I am going to use OAuth Credentials.

Step 1: Split your large file into parts:

There are multiple ways you can do this. I am going to use the Split command in Linux to split the file by its size.

split -b 250Mb covid19-dataset.csv covid19

Step 2: Initiate a multipart upload:

This initial request generates an upload ID for use in subsequent PUT requests to upload the data in parts.

 curl -X POST -H "Authorization: Bearer #Place_your-OAuth-Token" \
-H "Content-Length: 0" -H "Content-Type:text/csv" \
"https://storage.googleapis.com/curl-test-upload/covid19data.csv?uploads"

Your output should look like:

<?xml version='1.0' encoding='UTF-8'?><InitiateMultipartUploadResult xmlns='http://s3.amazonaws.com/doc/2006-03-01/'><Bucket>curl-test-upload</Bucket><Key>covid19data.csv</Key><UploadId>ABPnzm7hnJXk2rFnv8k5VO5LPdJgyT96qDBiJeWZZ7nFeZ9BLsGUmcuCnk9Wolh38npSN88</UploadId></InitiateMultipartUploadResult>

Step 3: Upload a part

Each curl execution below uploads a part of a multipart upload, and returns a unique ETag for each part. The ETags must be used when completing the multipart upload

curl -v -X PUT --data-binary @covid19aa -H "Authorization: Bearer #Place_your-OAuth-Token" \
-H "Content-Type:text/csv" \
"https://storage.googleapis.com/curl-test-upload/covid19data.csv?partNumber=1&uploadId=ABPnzm7hnJXk2rFnv8k5VO5LPdJgyT96qDBiJeWZZ7nFeZ9BLsGUmcuCnk9Wolh38npSN88"

curl -v -X PUT --data-binary @covid19ab -H "Authorization: Bearer #Place_your-OAuth-Token" \
-H "Content-Type:text/csv" \
"https://storage.googleapis.com/curl-test-upload/covid19data.csv?partNumber=2&uploadId=ABPnzm7hnJXk2rFnv8k5VO5LPdJgyT96qDBiJeWZZ7nFeZ9BLsGUmcuCnk9Wolh38npSN88"

curl -v -X PUT --data-binary @covid19ac -H "Authorization: Bearer #Place_your-OAuth-Token" \
-H "Content-Type:text/csv" \
"https://storage.googleapis.com/curl-test-upload/covid19data.csv?partNumber=3&uploadId=ABPnzm7hnJXk2rFnv8k5VO5LPdJgyT96qDBiJeWZZ7nFeZ9BLsGUmcuCnk9Wolh38npSN88"

curl -v -X PUT --data-binary @covid19ad -H "Authorization: Bearer #Place_your-OAuth-Token" \
-H "Content-Type:text/csv" \
"https://storage.googleapis.com/curl-test-upload/covid19data.csv?partNumber=4&uploadId=ABPnzm7hnJXk2rFnv8k5VO5LPdJgyT96qDBiJeWZZ7nFeZ9BLsGUmcuCnk9Wolh38npSN88"

curl -v -X PUT --data-binary @covid19ae -H "Authorization: Bearer #Place_your-OAuth-Token" \
-H "Content-Type:text/csv" \
"https://storage.googleapis.com/curl-test-upload/covid19data.csv?partNumber=5&uploadId=ABPnzm7hnJXk2rFnv8k5VO5LPdJgyT96qDBiJeWZZ7nFeZ9BLsGUmcuCnk9Wolh38npSN88"

curl -v -X PUT --data-binary @covid19af -H "Authorization: Bearer #Place_your-OAuth-Token" \
-H "Content-Type:text/csv" \
"https://storage.googleapis.com/curl-test-upload/covid19data.csv?partNumber=6&uploadId=ABPnzm7hnJXk2rFnv8k5VO5LPdJgyT96qDBiJeWZZ7nFeZ9BLsGUmcuCnk9Wolh38npSN88"

curl -v -X PUT --data-binary @covid19ag -H "Authorization: Bearer #Place_your-OAuth-Token" \
-H "Content-Type:text/csv" \
"https://storage.googleapis.com/curl-test-upload/covid19data.csv?partNumber=7&uploadId=ABPnzm7hnJXk2rFnv8k5VO5LPdJgyT96qDBiJeWZZ7nFeZ9BLsGUmcuCnk9Wolh38npSN88"

Part of your output should look like, capture the ETag value from the response body:

Step 4: Complete a multipart upload

Completes a multipart upload by concatenating the parts into a single object.

curl -X POST -H "Authorization: Bearer #Place_your-OAuth-Token" \
-H "Content-Type:application/xml" \
-d "<CompleteMultipartUpload>\
<Part><PartNumber>1</PartNumber><ETag>b054435419dfc8d3f9302057c6d3bfe5</ETag></Part>\
<Part><PartNumber>2</PartNumber><ETag>a06f0d011f815d2bee2923159b9400ea</ETag></Part>\
<Part><PartNumber>3</PartNumber><ETag>5d8eb89ce5dc25d54896006b6583eaa0</ETag></Part>\
<Part><PartNumber>4</PartNumber><ETag>30b76dcc430cb975144d4493697186c4</ETag></Part>\
<Part><PartNumber>5</PartNumber><ETag>a2a6ca083c7e6d13893dc34c6daf43f6</ETag></Part>\
<Part><PartNumber>6</PartNumber><ETag>44917f12e6eab57a58afa23f2a54932e</ETag></Part>\
<Part><PartNumber>7</PartNumber><ETag>f3cb27fc77a43800f2492e5500f6f259</ETag></Part>\
</CompleteMultipartUpload>"
"https://storage.googleapis.com/curl-test-upload/covid19data.csv?uploadId=ABPnzm7hnJXk2rFnv8k5VO5LPdJgyT96qDBiJeWZZ7nFeZ9BLsGUmcuCnk9Wolh38npSN88"

Your output should look like:

<?xml version='1.0'
encoding='UTF-8'?><CompleteMultipartUploadResult xmlns='http://s3.amazonaws.com/doc/2006-03-01/'><Location>http://storage.googleapis.com
/curl-test-upload/covid19data.csv</Location><Bucket>curl-test-upload</Bucket><Key>covid19data.csv</Key›<ETag>"c354430a31eadff014a28eaac61b314-7"</ETag></Co
mpleteMultipartUploadResult>rosyparmar-macbookpro:DataFilerosyparmar$

You can verify if the object was uploaded via Cloud Console

Step 6: Verify that the entire file was indeed uploaded to the cloud correctly by verifying hash values of both the files

gsutil hash covid19-dataset.csv # Local file

gsutil stat gs://curl-test-upload/covid19data.csv # Cloud Storage file

--

--

Rosyparmar

Data Analytics@Google | Views, opinions expressed here are my own