Using Firestore database to access your Cloud Storage metadata

Published in

Google Cloud - Community

6 min readJun 26, 2022

Cloud Storage objects have associated metadata, which hold object key-value pairs. There are two types of metadata. Fixed-key metadata, like Content-Type, Content-Encoding and more. We can set its values, like“Content-Type=image/jpeg”

But we can also use custom metadata. With custom metadata we can set our own key-value pairs. For example, if we are working with media content like images, we can define metadata like “category=sports” or “category=animals”. If we work with invoices, maybe we can use metadata like “invoice-id”, “customer-id” or “billing-period”.

We can access our object metadata with the console, command line, REST APIs or code. In this way we can see the metadata for a particular object. But what if we want to see all the documents with some metadata value. For example, give me all the invoices for a certain customer-id?

That is where a database comes into play. In this blog we will see how Firestore is a good fit for accessing and making queries in our custom metadata.

Why Firestore?

Firestore is a fully managed, scalable, and serverless document database. For this use case, some of its features are very useful:

Schemaless. We do not know what metadata structure will be used. We need to add new attributes for an object if needed, or having different values for different documents.
Easy querying. With Firestore we have automatic indexing by default, so good performance is assured.
Serverless and fully managed. We do not want to spend time and effort in management, tuning and scaling.
Seamless Google Cloud Google Cloud services integration. We will use not just Firestore, but Cloud Storage, Cloud Functions and database access services like Cloud Run or GKE.
Cost effective. Firestore has a generous free tier available, which can be enough for starting with our custom metadata database for development environments.

How does it work?

We have to take care not just about ingesting new documents, but also keep it in sync, so also update custom metadata values when the document is modified, or remove the document from Firestore when object is removed from Cloud Storage.

The following video shows the solution in action.

Architecture

We will use these Google Cloud components:

Cloud Storage. We will use a bucket as a repository to our objects. These objects will save their custom metadata in Firestore
Cloud Functions. We will trigger a Cloud Function to update Firestore for Cloud Storage events, so each time an object is added, deleted, modified or archived we will sync with Firestore.
Firestore. Our database for holding and querying our object’s custom metadata values.

Solution deployment

The whole solution code is available in this repository.

We will use Cloud Shell to deploy the solution. First, we will set up some environment variables:

REGION=europe-west3
BUCKET_NAME=`gcloud config list — format ‘value(core.project)’`
COLLECTION=content

In this sample, we will deploy on region europe-west3, and will use “content” as the collection name in Firestore. We will create a bucket with the same name as the project. Use your own values for all of them.

Next, create a Firestore database in Native mode. If prompted, accept to enable API appengine.googleapis.com.

gcloud app create — region=$REGION
gcloud firestore databases create — region=$REGION

Create a Cloud Storage bucket in the same region:

gsutil mb -l $REGION gs://$BUCKET_NAME

Now we will create the Cloud Functions, from the code of this repo. First, download the code and build:

git clone https://github.com/mahurtado/StorageCustomMetadataFirestore
cd StorageCustomMetadataFirestore/CustomMetadataFirestoreCF
mvn package

Next step is deploying the Cloud Functions. Notice we will deploy one function per event:

Object Finalize (google.storage.object.finalize). Sent when a new object is created in the bucket.
Object Delete (google.storage.object.delete). Sent when an object is deleted.
Object Archive (google.storage.object.archive) . Sent when a live version of an object is archived or deleted.
Object Metadata Update (google.storage.object.metadataUpdate) Sent when the metadata of an existing object changes.

gcloud services enable cloudbuild.googleapis.com
 
gcloud functions deploy content-gcs-insert \
 — set-env-vars COLLECTION=$COLLECTION \
 — region $REGION \
 — entry-point com.manolo.content.InsertFile \
 — runtime java11 \
 — memory 512MB \
 — trigger-resource $GOOGLE_CLOUD_PROJECT \
 — trigger-event google.storage.object.finalize \
 — source=target/deployment
 
gcloud functions deploy content-gcs-delete \
 — set-env-vars COLLECTION=$COLLECTION \
 — region $REGION \
 — entry-point com.manolo.content.InsertFile \
 — runtime java11 \
 — memory 512MB \
 — trigger-resource $GOOGLE_CLOUD_PROJECT \
 — trigger-event google.storage.object.delete \
 — source=target/deployment
 
gcloud functions deploy content-gcs-metadata-update \
 — set-env-vars COLLECTION=$COLLECTION \
 — region $REGION \
 — entry-point com.manolo.content.InsertFile \
 — runtime java11 \
 — memory 512MB \
 — trigger-resource $GOOGLE_CLOUD_PROJECT \
 — trigger-event google.storage.object.metadataUpdate \
 — source=target/deployment
 
gcloud functions deploy content-gcs-metadata-archive \
 — set-env-vars COLLECTION=$COLLECTION \
 — region $REGION \
 — entry-point com.manolo.content.InsertFile \
 — runtime java11 \
 — memory 512MB \
 — trigger-resource $GOOGLE_CLOUD_PROJECT \
 — trigger-event google.storage.object.archive \
 — source=target/deployment

Before running our functions, we need to give them access to Firestore and Cloud Storage:

gcloud projects add-iam-policy-binding ${GOOGLE_CLOUD_PROJECT} \ --member=serviceAccount:${@appspot.gserviceaccount.com”>GOOGLE_CLOUD_PROJECT}@appspot.gserviceaccount.com \
 --role=roles/storage.objectAdmin
 
gcloud projects add-iam-policy-binding ${GOOGLE_CLOUD_PROJECT} \ --member=serviceAccount:${@appspot.gserviceaccount.com”>GOOGLE_CLOUD_PROJECT}@appspot.gserviceaccount.com \
 --role=roles/datastore.user

As a checkpoint, search in the Cloud Console > IAM: look for the service account [your project name]@appspot.gserviceaccount.com. It should look like this:

You can see also the deployed Cloud Functions:

Testing

At this point, the solution is ready. Any object uploaded to your bucket will launch the Cloud Function and save your custom metadata in Firestore.

You can use the Console to upload files as shown in the video, or use the command line. An example of object creation with custom metadata key “key1” and value “value1” looks like this:

gsutil -h “x-goog-meta-key1:value1” cp [path_to_your_file] gs://$GOOGLE_CLOUD_PROJECT

Access the Firestore with the console and see your metadata!

A look at the code

To finish, let us see the Java code used for the Cloud Functions (full source here). The code to be run when receiving an event is the method “accept”

In this example, we will use the same code for all the events, and will determine the method to execute depending on the event type. Notice how we call the doFinalilze method for both object creation and object update.

Note the logEvent call, which will save a log trace per event. Consider commenting it for large throughput scenarios.

In the same way, we call the method doDelete for both object deletion and archiving.

When an object is created or modified this code is executed:

We first construct a document with the custom metadata and write it into Firestore. Update is managed in the same way as insert.

As the document key, we use the full object Cloud Storage name, but we have to change the path separator with a different one, like “::”, just because Firestore does not support it as part of the document key.

The next piece of code shows how to manage document deletion:

About out-of-order processing

Storage events are built on top of pub/sub messaging. This means at-least-once delivery, so we have to deal with the chance of more than once for execution for a given event, as stated in the public documentation.

In order to manage this, we implement conditional writing, using the _updated field to avoid duplicate and out-of-order writing.

Conclusion

With this solution we can access our Cloud Storage object’s custom metadata, including direct access or complex queries using Firestore capabilities.