Protocol buffers vs JSON | An interview anecdote

Uddeshya Singh
Geek Culture
Published in
5 min readMar 26, 2022

Protocol buffers (also known as protobuf) and JSON are both mechanisms of serializing structured data which encode data into byte sequences so that the data can be passed around in a distributed system adhering to certain protocols

That’s a basic introduction that we already know, If you’d like to know more about either of the encoding patterns, I’ll plug the links[1][2] down below in the resources section.

TL;DR

In this particular example, we figured out how Protocol buffer saved ~50% on space while transferring data on the wire as compared to JSON. The RPC used is Unary in nature with a non-repeated schema.

The Interview 👋

As far as this blog is concerned, allow me to take you on a small interview round. I was interviewing for a pretty renowned startup and I think this one was LLD + HLD round, in the initial minutes of the discussion, he wanted to gauge my grip on my projects and my past work, and due to my gRPC-go package contributions, the conversation dwelled into that direction.

“Awesome, so gRPC uses protocol buffers for encoding the data, and as you said, it’s more compact than JSON. Can you tell me how is it more compact than JSON?”

This question stumped me in my tracks right there! In a general scenario, if you don’t know the answer, It’s better to let them know you don’t know it, it’s much better than shooting arrows in the dark.
Guess what did I do? :) 😅

“For passing messages between server and stub(client), both entities need a fixed schema file (proto file in this context) and using that, the gRPC layer encodes data and sends it over the wire, I don’t remember the math[2] right now but probably after revision I can explain”

Luckiest shot in the dark in my opinion! The interviewer decided to move ahead and proceed with the low-level and high-level design round as scheduled.

Fast forward to today 🏃

I recently purchased O’Reilly’s “Designing Data-Intensive Applications” after many recommendations and it turned out to be a goldmine!

Four chapters in the book, I came across the topic of encoding again and had flashbacks of this interview and Why protocol buffer encoding is more compact than binary encodings of JSON. (PS: It’s only fair to compare binary encoding of JSON rather than textual because Protobuf in itself is binary in nature)

To illustrate that, allow me to take a custom example that you can easily understand and emulate at your end!

Case Study 📑

Imagine you have a gRPC Greeter service having the function SayHello which returns a string "hello + <name>" (The standard gRPC example is given in their repository[3]). This will be a unary RPC and not a streaming RPC

On invoking this function, the serializing layer will be using the protocol buffer definition defined below.

In case you are unfamiliar with “proto” files, it’s basically a predefined schema of services, the functions, and the data types that clients and servers will be allowed to use during the serialization and code generation. Here, a Greeter service will be registered in the server which will have one SayHello function which takes in HelloRequest type object and returns a HelloReply.

At the same time, Let’s make a standard HTTP/1.1 GET API that returns a similar message. Let’s keep this one static to Hello World keep the size comparison fair. Given below is the function definition of this API’s service layer.

PS: I am using the Gin framework to felicitate a quick prototype of this GET API.

Now, let’s try and hit both of them one by one. I’ll be using Wireshark to find the encoded data packet which is being sent over the protocol layer and the size, the key point of this research is how the size is compact in protocol buffers while data is sent over the wire?

JSON over the wire 📪

Wireshark snapshot
Wireshark snapshot of JSON Response

The image above is of a Wireshark snapshot of the response received by the client from our HTTP GET API, as you can see, The data {"message":"Hello World"} is encoded over the wire in 25 bytes (The string 7b 22 6d....22 7d is how the data is encoded over the wire)

One conclusion you can draw is that "{" is encoded to “7b” and so on and so forth. Within the same packet, Other metadata is also present like “Content-Length: 25” preceding the actual response bytes.

Protocol Buffer over the wire 📫

Wireshark snapshot of Protocol Buffer Response

aaaaand you guessed it right! This one is a snapshot of the RPC over HTTP2 which our Greeter Client called. Through all that noise, allow me to focus your attention on the GRPC data packet.

Notice how it’s also denoting “Hello World” but somehow the size is reduced to just 13 bytes!

Allow me to walk you through the distribution of these 13 bytes, the first one stores the field number (Notice in the proto file, how the field ‘message’ has been set to 1 and in the encoding, it’s mentioned as 0a ), the second byte is 0b which denotes what datatype is the value of? In this case, it’s a string

The rest of the 11 bytes are actually just the Value bytes which you can correspond to the letters. To dig deeper into how protocol buffers encode their data, I’ll suggest you look at this resource[2]

Insights 🍵

So, in short, we noticed that space saved by Protocol buffers is approximately 50%!

But what made this possible?

You must’ve noticed that in the protocol buffer’s encoding, we didn’t need to encode the entire field name alongside the message. Instead, all it stored was the field number and the field type, unlike in JSON where the entire field name “message” was encoded and it took 9 bytes, 7 bytes more in comparison to the former.

This is the advantage of Protocol buffer, both client and server share the same .proto file and hence both know the schema and the fields, hence they know how to decode and encode the values in a more space-efficient manner. Although, this comes with its own set of complications.

  • Maintaining backward compatibility between client and server applications becomes much more complicated.
  • Both server and clients need to have that common proto schema, otherwise, the communication will not work. This is one place where JSON clearly wins! It does not require either element to know what are the keys coming unless you are specifically unmarshaling a certain schema.

Hence, we find out why the protocol buffers are more compact in size!

Resources and Links 🔧

--

--

Uddeshya Singh
Geek Culture

Software Engineer @Gojek | GSoC’19 @fossasia | Loves distributed systems, football, anime and good coffee. Writes sometimes, reads all the time.