How to use gsutil and Python to deal with files in Google Cloud Storage?

Lynn Kwong
Apr 24 · 6 min read
Photo by Chris Ried from unsplash

Good Cloud Storage can be a convenient option if you want to store your data in the cloud programmatically, especially if you use the Google Cloud Platform. You can store any data in your work, such as plain text files, images, videos, etc. As a beginner you may prefer to use the Google Cloud Console to manage you files, which is very straightforward to use. However, as a developer, you would need to use the command line tool or the client library in your code to deal with Google Cloud Storage more programmatically.

Before we introduce the command line tool and the client library, we need to know two basic terms.

  • Bucket. A bucket is a special container that holds your data in the Google Cloud Storage. A bucket must have a globally unique name in the Google Cloud Storage system. Buckets cannot be nested but you can create folders inside a bucket to organize your data. A bucket acts like a folder directly under the root folder (\) in the Linux system, such as home, usr, bin, etc.
  • Object. An object is a piece of data stored in a bucket. As mentioned above, it can be any data. It is just a fancy name to call a file in Google Cloud Storage. Specially, an object is called a blob in the client libraries. Blob and object basically mean the same thing but are used in different circumstances.

Now that we know the basic terminology, let’s begin to use the command-line tool gsutil and the client library google-cloud-storage in Python to deal with buckets and objects/blobs. It’s better to show the commands in parallel for easier comparison. Before we can use the gsutil and the google-cloud-storage library, we need to install and configure them.

  1. Install gsutil.

gsutil is part of the Google Cloud SDK. Depending on your operating system, the procedures to install Google Cloud SDK can vary. Please check this tutorial for your specific operation system. It should be fairly easy to install. You need to authorize Cloud SDK after installation. Normally it is to run the following commands:

gcloud auth login
gcloud config set project YOUR-PROJECT-ID
gcloud auth application-default login
gcloud auth list

2. Install google-cloud-storage.

The installation of google-cloud-storage cannot be easier:

pip install google-cloud-storage

However, to use this library in your Python code. You need to use a service account because each bucket has some permissions and can only be accessed by authorized clients. You need to ask your system administrator to create a service account for you. As a developer, normally you don’t need to do it yourself. But if you you want to create one for studying purpose, you can follow this tutorial.

When the service account is ready, you need to add corresponding permissions (Storage Admin or Read/Write depending on your use case) for a bucket to this service account. Normally you would do it when the service account is created. However, if your group has a strict rule for bucket permissions, you would need to create a bucket in Console before hand and add permissions for a service account specifically.

After the service account is created and configured. We need to create a key for it and download the corresponding JSON credential file which can be used in your Python code. You can search for Service accounts in the Console. When the Service accounts page is opened, find your service account and open it. Then click the KEYS tab to create a new key. When a new key is created, a JSON file will be created and downloaded automatically. This JSON file is confidential. You should save it safely and not display it as raw text in your repository. Your can encrypt it using the Google Cloud Key Management Service (KMS).

Finally, we are almost there 😅. Let’s open your favorite Python IDE and configure Python to use this JSON key file.

from google.cloud import storage
client = storage.Client.from_service_account_json(
PATH_TO_KEY_JSON_FILE, project=PROJECT_ID
)

PATH_TO_KEY_JSON_FILE is the path to the JSON file you just downloaded. If it’s not in the same folder as your Python code, you would need to specify a valid path to it. It can be an absolute path, or a relative one to your Python project folder. PROJECT_ID is the ID for your project, which can be found in the home page of your Google Cloud Console.

Everything is ready, let’s begin coding 😃.

Create a bucket with gsutil:

gsutil mb -c standard -l eu gs://my-gsutil-bucket-12005

mb here stands for Make buckets. Don’t forget the gs:// prefix for a bucket to make it work properly. With gsutil, if we don’t specify any option, the bucket is created with the default storage class, in the default project and in the default geographical location as authorized for your Google SDK, which should be fine for most use cases. If you want to have more fine tuned options for your own bucket, you can check this document. However, please note that a bucket name must be globally unique, meaning that no body else has used it before, just like your Google email account.

Create a bucket in Python:

bucket = client.bucket("my-python-bucket-12006")
bucket.storage_class = "STANDARD"
client.create_bucket(bucket, location="eu")

The options for the storage class and location are similar as with gsutil. For simplicity, I will omit the options and just use the default ones in following examples.

If you delete a bucket, all objects in the bucket will be deleted. You may rarely need to do this in your work. If it is really needed, it is safer to do it via the Console, where you can have a better visual control of what you are doing.

Create an object in a bucket with gsutil, which simply means to upload a file to the bucket.

gsutil cp test.txt gs://my-gsutil-bucket-12005

test.txt is a local file on your computer which can be any file. If it’s not in the same folder as your Python code, you need to specify the valid path for it. On the right side, if you want to upload to a folder inside the bucket, you can add it after the bucket name, just as with any hierarchical file system.

Create an object in a bucket with Python, which means to upload a file to the bucket in Python.

bucket = client.bucket("my-python-bucket-12006")
blob = bucket.blob("test.txt")
blob.upload_from_filename("test.txt")

Specially, with Python, you need to create the object (also called blob) first, then upload a file as the data for this object.

Rename an object in the bucket with gsutil.

gsutil mv gs://my-gsutil-bucket-12005/test.txt gs://my-gsutil-bucket-12005/test-renamed.txt

As you have noticed, you can use the common Linux commands cp, mv, rm, ls to copy objects, move/rename objects, delete objects and list objects in a bucket. The command for gsutil is very similar to regular file and folder operation commands in Linux. As a comparison, the Python code is a little more complex but can also be easy once you master the pattern.

Rename an object in the bucket with Python.

bucket = client.bucket("my-python-bucket-12006")
blob = bucket.blob("test.txt")
bucket.rename_blob(blob, "test-renamed.txt")

Copy a file from a bucket to your local computer with gsutil.

gsutil cp gs://my-gsutil-bucket-12005/test-renamed.txt .

Copy a file from a bucket to your local computer with Python.

bucket = client.bucket("my-python-bucket-12006")
blob = bucket.blob("test-renamed.txt")
blob.download_to_filename("test-renamed.txt")

Note that unlike with gsutil, with Python you must specify a valid path for the local file name to store the remote object. If you specify dot as with gsutil, an error would occur.

Delete an object in a bucket with gsutil.

gsutil rm gs://my-gsutil-bucket-12005/test-renamed.txt

Delete an object in a bucket with Python.

bucket = client.bucket("my-python-bucket-12006")
bucket.delete_blob("test-renamed.txt")

Unlike other operations, we don’t need to create a blob before we delete it, we just need to specify the blob name (file name) here.

I think with this introduction, you should have learned enough to start working with buckets and objects in Google Cloud Storage using both gsutil and Python. The default settings and options should suffice for most use cases, if you need to have more controls for your buckets and corresponding operations, the following links can be helpful.

CodeX

Everything connected with Tech & Code

Lynn Kwong

Written by

Senior data engineer specialized in Python, JavaScript/TypeScript, Java/Scala, MySQL, MongoDB, Elasticsearch, API, Big Data, Cloud Computing, Git, etc.

CodeX

CodeX

Everything connected with Tech & Code

Lynn Kwong

Written by

Senior data engineer specialized in Python, JavaScript/TypeScript, Java/Scala, MySQL, MongoDB, Elasticsearch, API, Big Data, Cloud Computing, Git, etc.

CodeX

CodeX

Everything connected with Tech & Code

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store