Protobufs: the Good, the Bad, and the Ugly

Jan 15 · 9 min read

Protobufs are one of the “new hotness” coming out of Google. By defining a basic schema, you can generate strongly typed models for a multitude of programming languages. These models can be used by your code bases to handle data transfer, serialization, validation, and all other sorts of data goodness.

Protobufs are sent over the wire in a binary format. This format is not self-describing — there are no keys or other descriptive information sent with the data. To properly understand the data being sent, you need the original protobuf it was created with. However, the benefits of this format are twofold: message are extremely small, and are backwards- and forwards-compatible as long as you’re smart about your protobuf schema design.

Should you use protobufs in your next project? As with every technology, you should pick the right tool for the job. In the case of protobufs, their strongly-typed nature, small size over the wire, and backwards- and forwards-compatibility are attractive features. That said, if you’re going to use protobufs in your next project, you should familiarize yourself with the good, the bad, and the ugly.

The good

Simple schemas

Protobufs are dead simple to write. The following is an example of a protobuf written in proto3 syntax:

Notice how we’ve defined a message type for Instance, as well as an enum for Status, which can both be reused by any other message types (in this case, they’re being used by the Service message type). We also have a few primitves on display: string, uint64, and bool are just a few that are available to us. You can also see the use of a repeated type to give us an array of Instance types, and a map type for key-value tags that attach some metadata to our Instance type.

That’s about it for the basics! There are more fancy things you can do, such as importing types from other .proto files, and using special types such as Any or oneof, but we'll save that for another day...

Once you’re ready, simply run this .proto file through the protoc compiler to generate code for any number of languages:

  • C++
  • Java
  • Python
  • Go
  • Ruby
  • Objective-C
  • C#
  • PHP

Although the languages are different, the data models will be functionally identical, and will be able to send and receive the same binary format over the wire.

But these are just the languages that Google supports officially…

Custom generators and options

protoc, the protobuf compiler, supports custom generators that allow you to compile protobufs to various languages or add extra functionality through the use of options. Some examples include:

  • protoc-gen-validate, created by Lyft to add validation functions to your generated code
  • protobuf-c, a protobuf code generator for the C programming language

Let’s use protoc-gen-validate as an example in the following protobuf:

Notice how after the age field we've added an option that specifies validation rules. In this case our ConcertAttendee cannot have an age under 21.

When calling protoc, we specify --validate_out as a flag to pass our protobuf files through protoc-gen-validate. It runs alongside any other generators and produces validation functions for a number of supported languages, including Go, C++, and Java. More on this flag later...

Protobuf has a couple built-in options, such as [deprecated = true] which indicates that a field has been deprecated and should no longer be used. However, for the most part, the option syntax is used by custom generators for special functionality.

Backwards- and forwards-compatibility

One of the best parts about protobuf is its ability to be backwards- and forwards-compatible. This is something modern client/server applications have to deal with. When you have multiple versions of your client out in the wild, you need to be aware of the data it’s expecting to send and receive, and make sure that older versions remain supported by your latest server code for a reasonable amount of time.

Protobuf understands this, and bakes backwards- and forwards-compatibility into its design.

Let’s use this protobuf as an example:

When compiled to Go, we’re able to create an object like this:

When sent over the wire, our binary message ends up looking something like this:

Notice how the numbers we’ve been associating with fields in our protobuf end up becoming the “keys” in our binary data. This allows us to do a few awesome things:

  1. We can change the names of our fields at any time, and still deserialize the binary data
  2. We can send or receive new fields, and they can simply be ignored if they are unknown to the receiver
  3. We can stop receiving or sending old fields, and they can assume a default value on the receiver’s end

What this translates to is backwards- and forwards-compatibility between clients and servers. Developers need to be responsible when designing and modifying their protobufs to ensure that compatibility is retained. This generally means “don’t change the number used for a field” and “don’t use a number that’s already been in use for a new field”.

Having these numbers explicitly associated with fields also forces you to be thoughtful and critically consider your data contracts. It’s very easy to look at a diff on a .proto file and see where breaking changes will occur.

Sidenote: My example of a binary protobuf message above is not exact, but rather demonstrates how numbers are used to associate data with fields. I’ve used JSON to represent my example, but you can imagine how with the right techniques this message could be compressed and made even smaller, which is exactly what protobuf does!

The bad

proto2 vs. proto3

Up to this point, I’ve been writing all my protobuf examples using proto3 syntax, but this is not the only option.

proto2 is the predecessor to proto3, and had a small but significant change in design.

The best way to demonstrate these features is with an example:

Notice how each message field in proto2 explicitly requires a new keyword at the beginning:

  • required indicates that the field is required. If you try to deserialize binary data that doesn't include this field, the data may be rejected.
  • optional indicates that a field is optional. When you compile your protobufs and generate the model code for your languages, this results in reference types that could be null.
  • repeated is the same as in proto3, and effectively represents an array.

The most problematic of these is the required type. Google explains it best:

Required Is Forever: You should be very careful about marking fields as required. If at some point you wish to stop writing or sending a required field, it will be problematic to change the field to an optional field – old readers will consider messages without this field to be incomplete and may reject or drop them unintentionally. [...] Some engineers at Google have come to the conclusion that using required does more harm than good; they prefer to use only optional and repeated.

And it’s not just Google that has noticed this issue. Many large open source projects that use proto2 have also chosen to ditch the required field and make everything optional. For example, etcd only uses optional in all of its proto2 files (example).

Google realized this issue and removed the concept of a required and optional type altogether in proto3. Now all types are considered “optional”, but there is still a small yet important differences in behavior...

In proto2, if you specify a field as optional, it is compiled to a reference type in the language's generated source code. Because of this, you can know whether or not the data you're receiving included a certain field by checking if the reference (or pointer) is null. In proto3, everything is compiled to a value type, so if a field isn't included in binary data, that field assumes the zero-value.

Take for example a unit64 field that we want to default to 8080 if a value isn’t explicitly provided.

In proto3, this isn’t possible:

But with proto2, this is entirely possible and even built into the syntax as a feature:

So, should you use proto2 or proto3? That depends entirely on your situation. If you need the ability to know whether or not a field was explicitly provided, then you should stick with proto2. Otherwise, proto3 is the way to go, and even includes some new features you may want to have.

The ecosystem for proto2 is unfortunately not as strong as for proto3. For example, protoc does not support compiling proto2 syntax to as many languages as it does for proto3. However, open source protobuf projects are adding support for proto2, and Google itself is continuing to support the syntax by adding more supported languages to the protoc compiler.

Whether you decide to use proto2 or proto3, you should know that they use the exact same binary format, so if you decide to switch from one to the other, you will still have compatibility with messages sent over the wire.

The ugly

Protobuf’s compiler, protoc

Writing protobufs and using them in your code may be a breeze, but getting them to compile couldn’t be any more the opposite. Google’s protobuf compiler, protoc, does very little hand holding and makes it near impossible to figure out what you’re doing wrong. The biggest complaint about protoc is that its error messages are not helpful, a complaint that in my opinion is well deserved and something you’ll discover through experience.

I won’t explain everything there is to know about protoc, but I will explain some of the common pitfalls people encounter and how to avoid them.

Include -I everything you need

protoc is designed to do the least amount of work possible. It needs you to explicitly provide all necessary information when compiling protobufs.

One of those pieces of information are your protobuf paths. These are set using --proto_path, or the more common -I flag.

Assume we have the following folder structure:

Our foo.proto looks like this:

And our imported common.proto looks like this:

We can try and compile foo.proto, but protoc will complain:

This is because we didn’t include -I the protobuf folder. We need to explicitly tell protoc where to find all imported .proto files. In this case, our import path in foo.proto starts with the common folder, so we will include -I its parent folder: protobuf

Now we have one file compiling properly, but what if we want to compile all our .proto files?

Compile .proto files one directory at a time

Let’s assume the same folder structure from the previous section and try compiling all our .proto files at once:

What gives? Well, since we’re using the Go generator, each folder is treated as a separate package. For some reason, the generator complains if you try to compile multiple packages at the same time. There is a long discussion about this issue, and a recent comment by a protobuf contributor suggests that a fix will be coming soon. In the meantime, the best workaround is to compile your .proto files one at a time.

Because of this, I highly suggest you create a shell script to handle calls to protoc for your project. You may also consider using protowrap, a small protoc wrapper that Square made to solve this limitation.

Silver lining: --xxx_out flags

When compiling protobuf files, you need to tell protoc what kind of output you want to generate. This is where the --xxx_out flags come into play. Each one specifies a different generator to use, and will output a different set of files.

protoc has a number of these built-in for the languages it supports, such as --cpp_out, --java_out, and --go_out. The value passed to these flags is generally the directory you want the generator to save its output to.

A good pattern for compiling protobufs for multiple languages is to have different generators output to different directories:

This is also where custom generators come in. You’re not limited to just using the languages protoc supports. If protoc doesn’t know what a certain --xxx_out flag corresponds to, it will look in your $PATH for an executable protoc-gen-xxx to handle the generation. This could be a generator for documentation, validation code, or something else.

Wrapping up…

Protobufs have a lot of good things going for them. They’re easy to write, easy to understand, compile to a vast number of languages, and support custom generators for things like validation and help documentation. Their size over the wire is very small and they are designed to be backwards- and forwards-compatible. This makes them extremely attractive as a means of transferring data in a client/server architecture, or in a microservice environment.

However, they are not without their faults. Key design changes between proto2 and proto3 have fragmented the protobuf community, with many projects continuing or planning support for proto2. The biggest hurdle for someone starting with protobufs is the protoc compiler. It doesn’t make it obvious what you’re doing wrong and isn’t particualrly well documented, which can cause a lot of head scratching in the early stages.

So, are protobufs the right choice for your project? That’s up to you to decide. Every project has different needs, and developers should pick the technologies that fit best for their projects. Hopefully this post has given you a good idea of what it’s like to work with protobufs, and helped informed your choice.

Even if you don’t use them in your projects, I highly encourage you to play around with them. Some awesome projects like gRPC are coming out of protobufs, and with the large number of newer big open source projects using them, they might turn out to be the “new hotness” that sticks around for a while.

Kevin Snyder

Written by

DevOps Engineer @ Shipt. Go, Docker, Kubernetes. Building cool things and learning all the time.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade