Protocol Buffers: Text Format
--
When I first started using Google’s Protocol Buffers, I found them to be quite impressive. The concept of being able to marshal and un-marshal structured data into type safe containers, based on a schema that could be versioned is a huge win. Not only that, but support for lower level languages like C++ means that you don’t have to ever write one-off parsing methods or invent your own special message format for each project. Additionally, with support for binary fields, it becomes a breeze to marshal together BLOB data like an image, with it’s structured data, like the time it was taken, or the name of the person who took it. As Google puts it:
Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data — think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages.
The binary format of the marshaled message also means that there is minimal data overhead, and you don’t incur the same Base64 data bloat that you might get by trying to encode an image for transport over JSON. It is an added bonus that the marshaling and un-marshaling speeds are 20–100x faster than XML making it an ideal format for just about any domain.
Given these accolades, I dove in and started to apply the technology to my needs, but quickly started to rub up against an issue. Protobuf is a binary format, so working with it becomes tedious. Without an easy way to see what the message contained, I struggled to quickly implement the library and work with it. Specifically, I wanted a way to write configuration files that could be read into protobuf. If I was using protocol buffers to marshal and transport my data, I wanted to be able to easily read data into them. I even went as far as writing custom code to marshal TOML or JSON into a protocol buffer (this is easier/harder in some languages than others) but realized that the act of writing custom code went against the concept of the auto generated code that is defined by the .proto file.
Enter Protobuf Text Format
There is a secret part of the proto SDK that had been hiding from me. It is as elusive on the internet as an Ocelot is in the wild, but once I found it, it has made a world of difference. Built into both the protoc compiler, and into the SDK is an encode and decode functionality — from text! That means that protobuf could actually make for a legitimate configuration format, all while maintaining the huge benefits mentioned above.
So how does it work?
Start with a standard .proto schema definition, something like what I have below (named main.proto):
NOTE: Since I am using Go (and trying to make a simple example), it makes things a little easier for the explanation to use “main” as the package name — though this may be a poor choice in practice!
Compile that into your target language — I’ll use Go for my example:
protoc --go_out=. main.proto
This will generate the Go source code for handling my schema (in the same directory as your proto file). I’m now all set to start encoding or decoding messages from a plain text file into my protobuf.
The Text Format
I can write out a standard text file (main.txt) which contains the data that I wish to marshal into a protobuf. Here is an example of my Example proto data in text format:
name: “Larry”
age: 99
t: [{
comment: “hello”
}, {
comment: “world”
}]
labels: [{
source: “foo”
}, {
source: “bar”
}]
There are a couple different ways to structure the data (instead of using brackets, you could use <>) but unfortunately nothing in this realm is terribly well documented so it’s more of a game of try it and see if it works. I have found that the square brackets coupled with the curly braces are the easiest to read, most natural, and are supported — so it just seems to make the most sense to use those! @henridf also has a very short example of this in his protoc-encode.md file.
Extract
From here you are almost home. Now that you have both the source code generated, and your text file with the data in it, it’s as simple as using the UnmarshalText methods to read in the data into the language specific data structures.
In a similar fashion, you can also use the MarshalText method which will take the data in a data structure, and write it to an io writer (std out, or a file, etc) in text format which is human readable! We can code all this up just as I have it below:
Compile the above code with go build and then run the executable and you should have:
{Larry 99 [comment:”hello” comment:”world” ] [source:”foo” source:”bar” ] {} [] 0}
Printed to the command line!
It should also create a text file with the following contents:
name: "Larry"
age: 99
t: <
comment: "hello"
>
t: <
comment: "world"
>
labels: <
source: "foo"
>
labels: <
source: "bar"
>
You can see this format differs quite a bit from the format we used above. As mentioned, there are a few different formats that are supported.
Using “decode” and “ encode”
The last option for this is to use the encode and decode options which are built into the protoc compiler. I will show an example below, although I feel that this method is more confusing than the above so I would only venture through this last section if you feel you need a command line tool to encode and decode the messages.
Using the above example files still, we can take the contents of the text file and convert it to binary all in a one-liner:
protoc --encode=main.Example main.proto < gen_main.txt > gen_example.bin
This command tells protoc to encode the contents of gen_main.txt as type Example found in main.proto and store the results into gen_example.bin. If we run a hex dump we can see the contents of the new file:
$ hexdump -C gen_example.bin00000000 0a 05 4c 61 72 72 79 10 63 1a 07 0a 05 68 65 6c |..Larry.c….hel|
00000010 6c 6f 1a 07 0a 05 77 6f 72 6c 64 22 05 0a 03 66 |lo….world”…f|
00000020 6f 6f 22 05 0a 03 62 61 72 |oo”…bar|00000029
Similarly, we can decode a binary file through a similar process:
protoc --decode=main.Example main.proto < gen_example.bin
This should print the text format to the command line:
name: “Larry”
age: 99
t {
comment: “hello”
}
t {
comment: “world”
}
labels {
source: “foo”
}
labels {
source: “bar”
}
And that’s it! Hope you found this helpful!