Storing Protocol Buffers data in GCP Datastore using the Datastore Java SDK

Holistic AI Engineering
3 min readFeb 3, 2023

--

Storing protocol buffers in Datastore can be tricky as we discovered whilst building one of our services. In this article, I want to share how we were able to get around it.

Prerequisites

For this article, I’m assuming that you’re already familiar with Java and protocol buffers.

What are protocol buffers?

Protocol buffers are a mechanism for serialising structured data created by Google. You can find more information about it here.

Protocol buffers are our format of choice here at Holistic AI. We use it for inter-server communications as well as for storage of data on Cloud Storage and Datastore.

What is GCP Datastore?

“Datastore is a NoSQL document database built for automatic scaling, high performance, and ease of application development” as per the GCP definition that you can find here.

Datastore works with the concept of an Entity, and an Entity can have one or many properties that can be used to store data.

Datastore comes with a couple of limitations around the size of an Entity, the current limit is 1MB per Entity. You can find more information about the Datastore limits here.

How to store protocol buffers in Datastore

Datastore does not have support for storing vanilla protocol buffers, but there are 2 ways to go around this:

  1. Store your data as a String value
  2. Store your data as a Blob value

Storing data as a String is pretty straightforward. With the Datastore Java SDK, we could do something like this

import com.google.cloud.datastore.BlobValue;
import com.google.cloud.datastore.Datastore;

class DatastoreRepository {

private final Datastore datastore; // Datastore class comes from the Java SDK

public void putEntity(MessageOrBuilder protoObject, String propertyName, Key key) {
String data = protoToString(protoObject); // (1)
Entity document =
Entity.newBuilder(key)
.set(
"property-name-goes-here", // (2)
StringValue.newBuilder(data).setExcludeFromIndexes(true).build()) // (3)
.build();
datastore.put(document); // (4)
}

}

A couple of observations with the code above

  1. protoToString is a helper method that converts protobuf to string that could look like this
public static String protoToString(MessageOrBuilder product) {
try {
return JsonFormat.printer()
.usingTypeRegistry(
JsonFormat.TypeRegistry.newBuilder()
.add(BoolValue.getDescriptor())
.add(StringValue.getDescriptor())
.add(Int32Value.getDescriptor())
.add(FloatValue.getDescriptor())
.build())
.print(product);
} catch (InvalidProtocolBufferException e) {
// do something here
}
}

If you’re already using the google-protobuf-java gradle plugin, you will get the JsonFormat class by default

2. When you store entities in Datastore, you need to specify a property name for every entity

3. You should exclude the property from being indexed because if you don’t do that you will face another Datastore limitation — the maximum size of an indexed string property is 1,500 bytes.

4. The put method will insert an entity if it does not exist, or update it.

Storing data as Blob is pretty straightforward with the Datastore Java SDK. The above code, which stores the data in a String format, can be rewritten to

import com.google.cloud.datastore.Blob;
import com.google.cloud.datastore.BlobValue;
import com.google.cloud.datastore.Datastore;

class DatastoreRepository {

private final Datastore datastore; // Datastore class comes from the Java SDK

public void putEntityAsBlob(MyProtoObject object, String propertyName, Key key) {
Entity newEntity =
Entity.newBuilder(key)
.set(
propertyName,
BlobValue.newBuilder(Blob.copyFrom(object.toByteArray())).setExcludeFromIndexes(true).build())
.build();
datastore.put(newEntity);
}

}

Conclusion

There are obviously pros and cons when it comes to both methods. The main benefit of storing protocol buffers data as a String is that it’s in a human readable format. This allows you to inspect the data in the GCP console.

The main benefit of storing data as blobs is that you can fit more data in the 1MB Entity limitation that Datastore has. From our initial testings, we were able to store approximately 4x more data using blobs, so that would be around 4MB of data that you can store inside an Entity.

As we’re constantly looking to learn, we would love to hear your feedback in the comments.

Happy coding 🛠️

Holistic AI is an AI risk management company that aims to empower enterprises to adopt and scale AI confidently. We have pioneered the field of AI risk management and have deep practical experience auditing AI systems, having reviewed over 100+ enterprise AI projects covering 20k+ different algorithms. Our clients and partners include Fortune 500 corporations, SMEs, governments and regulators.

We’re hiring :)

--

--

Holistic AI Engineering

We are the engineers at Holistic AI, the company that wants to change the way humans interact with AI systems. Check us out here https://www.holisticai.com