Google Cloud Storage ~ Upload a large file using cURL
There are multiple ways to upload files into Google Cloud Storage today. Some of these are listed below:
- Client Libraries
- Cloud Console
- XML/JSON API Multipart Upload
- gsutil tool
We are going to look at XML API Multipart Upload in this article.
As the name suggests, this upload method uploads files in parts and finally assembles all the parts into a final object that lands in your storage bucket. XML API multipart uploads are compatible with Amazon S3 multipart uploads. The API is preferred approach when:
- You have a large file to upload. With this API you can have parts of the file uploaded simultaneously, reducing time it would otherwise take to upload the entire file at a time.
- For some reason if one of the requests fails, you would only have to upload that specific part again without having to restart the entire operation.
- Access to Cloud Storage through the XML API is extremely useful when you are using tools and libraries that must work across different storage providers, or when you are migrating from another storage provider to Cloud Storage. In the latter case, you only need to make a few simple changes to your existing tools and libraries to begin sending requests to Cloud Storage.
Some of the size limitations to keep in mind when using this API include:
- A multipart upload can have up to 10,000 parts.
- An individual part has a maximum size limit of 5 GiB.
- An individual part has a minimum size limit of 5 MiB, unless it’s the last part, which has no minimum size limit.
For the demonstration today, I am using Covid-19 public dataset from Kaggle. I am going to use curl command to quickly test out API functionality. For authentication, I am going to use OAuth Credentials.
Step 1: Split your large file into parts:
There are multiple ways you can do this. I am going to use the Split command in Linux to split the file by its size.
split -b 250Mb covid19-dataset.csv covid19
Step 2: Initiate a multipart upload:
This initial request generates an upload ID for use in subsequent PUT requests to upload the data in parts.
curl -X POST -H "Authorization: Bearer #Place_your-OAuth-Token" \
-H "Content-Length: 0" -H "Content-Type:text/csv" \
"https://storage.googleapis.com/curl-test-upload/covid19data.csv?uploads"
Your output should look like:
<?xml version='1.0' encoding='UTF-8'?><InitiateMultipartUploadResult xmlns='http://s3.amazonaws.com/doc/2006-03-01/'><Bucket>curl-test-upload</Bucket><Key>covid19data.csv</Key><UploadId>ABPnzm7hnJXk2rFnv8k5VO5LPdJgyT96qDBiJeWZZ7nFeZ9BLsGUmcuCnk9Wolh38npSN88</UploadId></InitiateMultipartUploadResult>
Step 3: Upload a part
Each curl execution below uploads a part of a multipart upload, and returns a unique ETag for each part. The ETags must be used when completing the multipart upload
curl -v -X PUT --data-binary @covid19aa -H "Authorization: Bearer #Place_your-OAuth-Token" \
-H "Content-Type:text/csv" \
"https://storage.googleapis.com/curl-test-upload/covid19data.csv?partNumber=1&uploadId=ABPnzm7hnJXk2rFnv8k5VO5LPdJgyT96qDBiJeWZZ7nFeZ9BLsGUmcuCnk9Wolh38npSN88"
curl -v -X PUT --data-binary @covid19ab -H "Authorization: Bearer #Place_your-OAuth-Token" \
-H "Content-Type:text/csv" \
"https://storage.googleapis.com/curl-test-upload/covid19data.csv?partNumber=2&uploadId=ABPnzm7hnJXk2rFnv8k5VO5LPdJgyT96qDBiJeWZZ7nFeZ9BLsGUmcuCnk9Wolh38npSN88"
curl -v -X PUT --data-binary @covid19ac -H "Authorization: Bearer #Place_your-OAuth-Token" \
-H "Content-Type:text/csv" \
"https://storage.googleapis.com/curl-test-upload/covid19data.csv?partNumber=3&uploadId=ABPnzm7hnJXk2rFnv8k5VO5LPdJgyT96qDBiJeWZZ7nFeZ9BLsGUmcuCnk9Wolh38npSN88"
curl -v -X PUT --data-binary @covid19ad -H "Authorization: Bearer #Place_your-OAuth-Token" \
-H "Content-Type:text/csv" \
"https://storage.googleapis.com/curl-test-upload/covid19data.csv?partNumber=4&uploadId=ABPnzm7hnJXk2rFnv8k5VO5LPdJgyT96qDBiJeWZZ7nFeZ9BLsGUmcuCnk9Wolh38npSN88"
curl -v -X PUT --data-binary @covid19ae -H "Authorization: Bearer #Place_your-OAuth-Token" \
-H "Content-Type:text/csv" \
"https://storage.googleapis.com/curl-test-upload/covid19data.csv?partNumber=5&uploadId=ABPnzm7hnJXk2rFnv8k5VO5LPdJgyT96qDBiJeWZZ7nFeZ9BLsGUmcuCnk9Wolh38npSN88"
curl -v -X PUT --data-binary @covid19af -H "Authorization: Bearer #Place_your-OAuth-Token" \
-H "Content-Type:text/csv" \
"https://storage.googleapis.com/curl-test-upload/covid19data.csv?partNumber=6&uploadId=ABPnzm7hnJXk2rFnv8k5VO5LPdJgyT96qDBiJeWZZ7nFeZ9BLsGUmcuCnk9Wolh38npSN88"
curl -v -X PUT --data-binary @covid19ag -H "Authorization: Bearer #Place_your-OAuth-Token" \
-H "Content-Type:text/csv" \
"https://storage.googleapis.com/curl-test-upload/covid19data.csv?partNumber=7&uploadId=ABPnzm7hnJXk2rFnv8k5VO5LPdJgyT96qDBiJeWZZ7nFeZ9BLsGUmcuCnk9Wolh38npSN88"
Part of your output should look like, capture the ETag value from the response body:
Step 4: Complete a multipart upload
Completes a multipart upload by concatenating the parts into a single object.
curl -X POST -H "Authorization: Bearer #Place_your-OAuth-Token" \
-H "Content-Type:application/xml" \
-d "<CompleteMultipartUpload>\
<Part><PartNumber>1</PartNumber><ETag>b054435419dfc8d3f9302057c6d3bfe5</ETag></Part>\
<Part><PartNumber>2</PartNumber><ETag>a06f0d011f815d2bee2923159b9400ea</ETag></Part>\
<Part><PartNumber>3</PartNumber><ETag>5d8eb89ce5dc25d54896006b6583eaa0</ETag></Part>\
<Part><PartNumber>4</PartNumber><ETag>30b76dcc430cb975144d4493697186c4</ETag></Part>\
<Part><PartNumber>5</PartNumber><ETag>a2a6ca083c7e6d13893dc34c6daf43f6</ETag></Part>\
<Part><PartNumber>6</PartNumber><ETag>44917f12e6eab57a58afa23f2a54932e</ETag></Part>\
<Part><PartNumber>7</PartNumber><ETag>f3cb27fc77a43800f2492e5500f6f259</ETag></Part>\
</CompleteMultipartUpload>"
"https://storage.googleapis.com/curl-test-upload/covid19data.csv?uploadId=ABPnzm7hnJXk2rFnv8k5VO5LPdJgyT96qDBiJeWZZ7nFeZ9BLsGUmcuCnk9Wolh38npSN88"
Your output should look like:
<?xml version='1.0'
encoding='UTF-8'?><CompleteMultipartUploadResult xmlns='http://s3.amazonaws.com/doc/2006-03-01/'><Location>http://storage.googleapis.com
/curl-test-upload/covid19data.csv</Location><Bucket>curl-test-upload</Bucket><Key>covid19data.csv</Key›<ETag>"c354430a31eadff014a28eaac61b314-7"</ETag></Co
mpleteMultipartUploadResult>rosyparmar-macbookpro:DataFilerosyparmar$
You can verify if the object was uploaded via Cloud Console
Step 6: Verify that the entire file was indeed uploaded to the cloud correctly by verifying hash values of both the files
gsutil hash covid19-dataset.csv # Local file
gsutil stat gs://curl-test-upload/covid19data.csv # Cloud Storage file
References:
Documentation referred: https://cloud.google.com/storage/docs/xml-api/overview
Dataset used : https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge