In this post, I’m going to write about Google’s way of Serializing and Deserializing the data in a specific format. It was initially developed by Google Developers as their own internal mechanism of sending data over the wire, keeping an eye on enhancing network performance and usage. It was mainly designed to be simpler than XML, yet smaller and faster in its storage and transmission.
This post briefly covers the need to know things about Protobufs (Protocol Buffers). It also serves as a base for the next one as that will be about the Tensorflow Records. Understanding protobufs made it easier to work with tfrecords as well.
What are protocol buffers?
As per the Google Protobuf Documentation,
Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data — think XML, but smaller, faster, and simpler.
In simple words, protobufs are just another way of serializing your messages into binary format. The serialized output is a sequence of bytes, pretty dense, making an effort to save space. The bytes are then transmitted over the wire which takes lot lesser time in comparison to sending the same data in XML or JSON formats. It is even faster to serialize the data and deserialize them with protobufs.
Why do we need Protocol Buffers?
Protobufs are just a replacement over the traditional way of transmitting messages using XML or JSON. This is mainly preferred to use when performance or load in a network is the primary concern of the application. Well, let’s see an example of what size a message will be in all these formats. I have a string “Medium” to serialize here.
The only difference in these formats is that the text is readable in XML and JSON, but not readable if it is in protobuf format. But if readability is not the concern, then we can see that the space taken by the protobufs is lot less than its XML and JSON variants. It allows us to send large chunks of data over the wire taking much lesser space in comparison to XML and JSON.
Protocol Buffer Format
The working of protobuf starts with declaring the structure of the data that we want to transmit in a protofile, followed by compiling it with protoc. This generates the compiled code which serves as the source of the data structure for the transmitted data, both at the sender side and the recipient side.
So here goes a small proto file containing a message type with two fields. The proto file’s extension is “.proto”. First field here is a string field “name” and the second one “age” of int32 type. The numbers at the right are tags, useful while deserializing the data. Deserializer uses the numbering scheme in the tags explicitly defined by us while decrypting the data. Now this proto can be compiled with protoc compiler as mentioned in the code, which would produce a compiled class output something like this. I have compiled with Python format as the output, which would allow us to use the message type as a normal python class and can be instantiated as below.
The code below uses the message from the protofile as a class, and instantiates it. It then assigns values to its attributes. This is similar to the way we instantiate a class in Python.
The serialization process is also very simple. Just call method SerializeToString to get its binary, which can be transferred in bytes format.
Understanding the whole serialization process is out of scope here, it uses the Serialization process of Varints to encode the data.
So, what exactly is going on in the backend which allows us to create instances of a Person class? We just created something called a proto file with a message type in it. And then we compiled it with the protoc compiler which created a file named as person_pb2.py, which has been imported here to access the message type Person as a class. To see what’s happening in the backend, I have analyzed the compiled file below.
So it starts with something called as a descriptor, as the name suggests, it gives basic information about the message type written in the proto file. There are different types of descriptors, FileDescriptor, FieldDescriptor, EnumDescriptor, EnumValueDescriptor, Descriptor etc.
This one is a FileDescriptor, which captures information about the contents of the file. It contains the name of the proto file, the package declared in, syntax showing that it follows proto3 version and some more information as well.
The next part of the compiled code produced is the descriptor for the actual message type defined in the proto file. The name of the descriptor starts with an Underscore followed by the message type name in full caps. It contains the name of the message type, its full name (includes the scope), and then the descriptors for the fields as well. The descriptor for the file was FileDescriptor above, for the message it is Descriptor and for the field it is FieldDescriptor. FieldDescriptor uses:
- index — to indicate the order of the field in the proto file
- number — to indicate the tag
- type — to indicate the data type of a variable denoted by a number (9 for String and 3 for int32)
- has_default_value — a boolean field to indicate if a default value has been provided
- default_value — indicates the value provided by default, if not provided then protobuf has default values defined for each of the data types. Please continue to the bottom section to find the default values.
- label — to indicate if the field is a required, optional, or a repeated field. Please continue to the bottom section to find more on these labels.
Okay, as of now we have seen the descriptors for the file, for the message, and for the fields types. But we have not seen a place where ‘Person’ class is getting created.
We know that an object gets created when we instantiate a class. Similarly, a class gets created when we instantiate a metaclass.
Google protobuf provides a metaclass ‘GeneratedProtocolMessageType’ which is responsible for creating classes from protocol message descriptors at runtime. It also injects the field descriptors into the classes output. And this allows us to use the person_pb2 as a module consisting the Person class and its attributes.
But still, one thing seems missing. There is no mapping yet between the file descriptor named DESCRIPTOR and the message descriptor _PERSON.
That’s what is shown here. File descriptor is mapped with the message descriptor by its name. Along with that, it also shows the messages getting registered to the symbol database maintained in the protobuf. Both the file descriptor and message descriptor are registered as symbols in the database.
Enumerations, Nested Types and Messages as field types
We can have enumerations for a list of predefined values, for ex if we want an address to be of either home type or an office type, then we can create an enumeration with both of them as its possible values. Enumerations must start with a 0 value, as it will be used as the default value considered for the field, if not explicitly provided.
We can also use nested message types, by declaring one message type inside another. Here we have added Address message type inside Employee message type.
We can also use one message type as the type of a field used in another message. Here Department uses Employee as a type for its field employee. This allows us to use custom datatypes for the fields.
The compiled python file can be found here. Descriptors for enum types and enum values are EnumDescriptor and EnumValueDescriptor.
Each of the message types can accept a default value while initializing the instance from the class. If not initialized, then protobuf provides these default values itself.
- For integers, default value is 0
- For strings, default value is empty string
- For bytes, default value is empty bytes
- For bool, default value is bool
- For enums, default value is the one with value 0
- For message types, default value is not set.
- Required is when a value is must to provide, if not provided then serializing a field without a value would throw an exception.
- Optional is when a value may or may not be provided. If not provided, then a default value will be used.
- Repeated is when the field can be repeated any number of times.
In proto3, Required is not available, as by default all the fields are Optional. This post follows proto3 version, hence no field is followed by Required or Optional.
There may be a case that, we would modify an existing protobuf after using it for sometime. In that case, we would need to take care of few things, as the changes needs to be backward compatible. If we want to add a new field, then the field must use a new tag. We must not update the tag of an existing field. As proto2 had Required fields, we should not remove the required fields while updating. However, the optional or repeated fields can be removed. Following these steps, a protobuf message type can be extended easily.
When not to use Protocol Buffers
Protobufs serialize data into binary formats which are then transmitted in bytes, the data looks very dense and smaller. It’s faster to transmit over wire, serialize and deserialize them. But it is not intended to use them, if the underlying application is a web browser, and the data is fed directly to it. It is not intended to use them if human readability is needed. Both XML and JSON are readable without even the knowing the schema, and also editable. Editing a protobuf serialized data is not advisable, as it would hamper the overall structure and meaning of the data.
So, this was all about Protobufs. While it looks compelling, one needs to know when to use and when not to use them. Overall, they are pretty good to use if Performance, network load or backward compatibility are the concerns.
- To learn more about Protobufs, follow Google ProtoBuf Docs.