Inner view of protobuf encoding

Yash Suresh Chandra
8 min readMar 20, 2020

--

In this article we will see how protocol buffers encodes data to make it more compact (hence smaller) to transmit. This article does not focus on how to use protocol buffers but instead explores what it does under the hood. Code pieces demonstrated here are in python v3 but similar things will be happening in other languages also.

Here we explore encoding of 6 types -int, string, float, double, dict, array. Having understanding of these 6 types should be sufficient to get a fair idea of working protobuf.

binary puzzle

Protocol Buffers

Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data. Just like xml but smaller in size. It is useful in developing programs to communicate with each other over a wire or for storing data.

Installation

Installation of protobuf compiler is very easy. To include it in python using pip, just run -

pip install protobuf

and you are done.

Usage

We define .proto files on both sides (caller and callee functions). These .proto files contain the schema (called message) of data objects to be transmitted. After defining messages, we just need to run protoc (protobuf compiler) and it will generate the access classes. These access classes can be used directly in your code.

Taking an example of dummy json value -

data.json
  1. Define data.proto, specifying type and number for each field (here we have matched json keys to field names).
data.proto

2. Run protobuf compiler for this file, this will generate data_pb2.py in output directory.

protoc --python_out=. data.proto

3. Use data_pb2.py in your code -

protobuf-json-compare.py

After running this code, we get output -

b'\n\x185e4d67c4599d93d88340b3a3\x1a$9b6d7add-d95c-41cf-9a7f-eedf3b4ec5df*\t$3,754.962\x19http://placehold.it/32x328\x16B\x04blueJ\rJarvis DodsonR\x04maleZ\x06JASPERb\x17jarvisdodson@jasper.comj\x11+1 (942) 515-2514r(614 Grafton Street, Teasdale, Iowa, 3952z\xcc\x02Nisi cillum excepteur ad eu ipsum incididunt id nisi. Proident sit cupidatat sit ullamco ea velit veniam id. Officia nulla laboris nisi quis sint mollit. Labore cupidatat mollit ex magna magna qui sit nulla aliquip duis eiusmod sit non. Sit amet mollit pariatur non minim aliquip officia veniam in aliqua ad amet cupidatat cillum.\r\n\x82\x01\x1b2018-03-22T09:44:03 -06:-30\x89\x01\xde\x8d\x05\x85A%:\xc0\x91\x01\xaf\xce1 {SZ@\x9a\x01\x02eu\x9a\x01\x07nostrud\x9a\x01\x04quis\x9a\x01\x06cillum\x9a\x01\x06veniam\x9a\x01\x03non\x9a\x01\x05ipsum\xa2\x01\x0c\x12\nLorna Owen\xa2\x01\r\x08\x01\x12\tNona Long\xa2\x01\x13\x08\x02\x12\x0fRamona Delacruz\xaa\x011Hello, Jarvis Dodson! You have 7 unread messages.\xb2\x01\x05apple'
1112
777

Well, it did generate ~30% lesser bytes.

Good. Now let’s see how.

Basic Encoding Detour

To learn about protobuf encoding, first we need to revisit some of the techniques that are commonly used for encoding numbers.

varint encoding

Varints are represented in a special way. Each byte in a varint, except the last byte, has the most significant bit (msb) set — this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the two’s complement representation of the number in groups of 7 bits, least significant group first.

eg. 300

  1. binary representation of 300: 100101100
  2. representing it in groups of 7 bits (lower 7 bits are for number): 0000010 0101100
  3. reverse the order of groups (least significant group first): 0101100 0000010
  4. prepend 1 to first group (its not the end of number) and 0 to second group (signifying end): 10101100 00000010

so 300 becomes 1010110000000010

In JSON, it would have taken 3 bytes to represent 300 00000011 00000000 00000000 , after varint encoding, its 2 bytes 10101100 00000010 .

float encoding

Floating point encoding is a bit complex. You can learn in detail about it here. In a gist, float takes 32 bits in memory — 1 bit for sign + 8 bits for exponent (expressed in the form of e-127 where e is exponent)+ 23 bits for precision.

Let’s take example of number 123.375 as an example.

  1. convert 123.375 into binary: 1111011.011
  2. move decimal to left such that only one set bit is remaining: 1.111011011*(2^(133-127))
  3. we get exponent=133 ie. 10000101, precision=111011011 and sign=0 (for positive).

so 123.375 becomes 0 10000101 11101101100000000000000

double encoding

Encoding of double is same as floating point encoding, difference lies in the size. Double takes 64 bits in memory — 1 bit for sign + 11 bits for exponent (expressed in the form of e-1023 where e is exponent)+ 52 bits for precision.

Taking example of number 123.375.

  1. convert 123.375 into binary: 1111011.011
  2. move decimal to left such that only one set bit is remaining: 1.111011011*(2^(1029-1023))
  3. we get exponent=1029 ie. 10000000101, precision=111011011 and sign=0 (for positive).

so 123.375 becomes:

0 10000000101 1110110110000000000000000000000000000000000000000000

Protobuf Encoding

Now that we have refreshed our basic knowledge, let’s dig in!

wire types

Protobuf categorizes normally used data types into wire types which provides information to find the length of the value.

wire types chart

For wire type 2, length is also encoded along with value.

zero values

Zero values are not transmitted. 0, False, “”, does not take any bytes.

This creates ambiguity at receiving end whether value is 0 or value is not passed at all .

integer encoding

Protobuf encode integers (wire type 0) using varint encoding. If integer value is negative, it is always 10 bytes long.

# returns encoded int value
def encode_varint(int_value):
encoded_value = []
if int_value < 0:
int_value += (1 << 64)
while True:
if (int_value >> 7) > 0:
encoded_value.append((int_value & 0x7f) | 0x80)
else:
encoded_value.append(int_value & 0x7f)
break
int_value = int_value >> 7
return bytes(encoded_value)

Eg. running encode_varint(300) gives result b'\xac\x02' (hex representation of 10101100 00000010).

string encoding

String values are transmitted as it is. They are just transformed into bytes.

# returns encoded string value
def encode_string(string_value):
return string_value.encode('utf-8')

Eg. running encode_string('abc') gives result b'abc' .

float encoding

Protobuf encodes floats into little endian byte order (least significant byte comes first). As we saw earlier, float encoding representation is a bit different. Fortunately, python provides a way to make it easy. You can read more about struct from here.

# returns encoded float value
def encode_float(float_value):
return struct.pack('<f', float_value)

Eg. running encode_float(123.375) gives b'\x00\xc0\xf6B' (little endian hex of 0 10000101 11101101100000000000000)

double encoding

Similar to float encoding, double is encoded in little endian byte order using struct package.

# returns encoded double value
def encode_double(double_value):
return struct.pack('<d', double_value)

Eg. running encode_double(123.375) gives b'\x00\x00\x00\x00\x00\xd8^@' (little endian hex of 0 10000000101 1110110110000000000000000000000000000000000000000000)

tags

Protobuf does not send keys. It them replaces with tags. A tag consists of two things:

  1. field number — where the field lies in structured message.
  2. wire type — provides just enough information to find the length of the following value.

A tag is varint encoding of number field_number<<3|wire_type .

# returns varint encoded tag based upon field number and wire type
def get_tag(field_num, wire_type):
if wire_type in ['string', 'dict', 'array']:
return encode_varint((field_num << 3) | 2)
elif wire_type in ['varint']:
return encode_varint((field_num << 3) | 0)
elif wire_type in ['float']:
return encode_varint((field_num << 3) | 5)
elif wire_type in ['double']:
return encode_varint((field_num << 3) | 1)

Eg. running get_tag(1, 'varint') gives b'\n' (byte string for (1<<3|2) ie. 10)

encoding with tags

Since we have seen how primitive values are encoded and how to generate their tag values, we can write functions for them as follows (note that length is included for wire type 2 fields) -

# returns encoded string value along with its tag
def encode_string_with_tag(string_value, field_num, repeated):
# check if value is an array of string
if repeated:
return encode_repeated_string(string_value, field_num)
tag = get_tag(field_num, 'string')
encoded_value = encode_string(string_value)
encoded_len = encode_varint(len(encoded_value))
return tag + encoded_len + encoded_value
# returns encoded float value along with its tag
def encode_float_with_tag(float_value, field_num, repeated):
if repeated:
return encode_repeated_float(float_value, field_num)
tag = get_tag(field_num, 'float')
encoded_value = encode_float(float_value)
return tag + encoded_value
# returns encoded int value along with its tag
def encode_varint_with_tag(int_value, field_num, repeated):
# check if value is an array of varint
if repeated:
return encode_repeated_varint(int_value, field_num)
tag = get_tag(field_num, 'varint')
encoded_value = encode_varint(int_value)
return tag + encoded_value
# returns encoded double value along with its tag
def encode_double_with_tag(double_value, field_num, repeated):
# check if value is an array of double
if repeated:
return encode_repeated_double(double_value, field_num)
tag = get_tag(field_num, 'double')
encoded_value = encode_double(double_value)
return tag + encoded_value

Eg. running encode_string_with_tag('abc',1) gives b'\n\x03abc' (\n for tag, \x03 for length, abc is our encoded string)

dict encoding

We can see a dict as a collection of different type of values. It encodes each value and concatenates them.

# returns encoded dict
# this function assumes data in a particular format: it contains an array of field and values.
# each field is represented as a tuple of size 4 (value, wire type, field number, if field is an array)
# eg. message Msg { int64 id = 1; string text = 2; } => [(1234, 'varint', 1, False), ('hello world', 'string', 1, False)]
# also we can assume that every protobuf message is a dict at start, so this function is an entry point for encoding.
def encode_dict(data):
encoded_value = b''
for d in data:
value, wire_type, field_num, repeated = d[0], d[1], d[2], d[3]
if not value:
continue
if wire_type == 'string':
encoded_value += encode_string_with_tag(value, field_num, repeated)
elif wire_type == 'varint':
encoded_value += encode_varint_with_tag(value, field_num, repeated)
elif wire_type == 'float':
encoded_value += encode_float_with_tag(value, field_num, repeated)
elif wire_type == 'double':
encoded_value += encode_double_with_tag(value, field_num, repeated)
elif wire_type == 'dict':
encoded_value += encode_dict_with_tag(value, field_num, repeated)
return encoded_value

Since protobuf uses tags to identify field number of a field, there is no point in relying on the order of encoded values.

repeated encoding

There is a slightly different mechanism for encoding repeated values (array of values). It is divided in two parts -

  1. If it is an array of primitive numeric types (integer, float, double), then they are encoded within single key-value (tag + length of bytes + encoded bytes) pair. Such fields are called packed repeated fields.
# returns encoded array of float values
def encode_repeated_float(float_values, field_num):
encoded_values = b''
tag = get_tag(field_num, 'array')
for f in float_values:
ev = encode_float(f)
encoded_values += ev
encoded_len = encode_varint(len(encoded_values))
return tag + encoded_len + encoded_values
# returns encoded array of int values
def encode_repeated_varint(int_values, field_num):
encoded_values = b''
tag = get_tag(field_num, 'array')
for i in int_values:
ev = encode_varint(i)
encoded_values += ev
encoded_len = encode_varint(len(encoded_values))
return tag + encoded_len + encoded_values
# returns encoded array of double values
def encode_repeated_double(double_values, field_num):
encoded_values = b''
tag = get_tag(field_num, 'array')
for d in double_values:
ev = encode_double(d)
encoded_values += ev
encoded_len = encode_varint(len(encoded_values))
return tag + encoded_len + encoded_values

Eg. running encode_repeated_double([12, 33, 21], 1) gives b'\n\x03\x0c!\x15' (\n for tag, \x03 for length, \x0c for 12, ! for 33 and \x15 for 21)

2. Otherwise, protobuf just concatenates encoded value (along with tag) of each element.

# returns encoded array of string values
def encode_repeated_string(string_values, field_num):
encoded_value = b''
for s in string_values:
encoded_value += encode_string_with_tag(s, field_num, False)
return encoded_value
# returns encoded array of dict values
def encode_repeated_dict(dict_values, field_num):
encoded_value = b''
for d in dict_values:
encoded_value += encode_dict_with_tag(d, field_num, False)
return encoded_value

Eg. running encode_repeated_string(['abc', 'def', 'xyz'], 1) gives b'\n\x03abc\n\x03def\n\x03xyz' (\n\x03abc for encoded abc, \n\x03def for encoded def, \n\x03xyz for encoded xyz)

Putting it all together

From what we know now, we can join them together to create our own protobuf encoder.

encode.py

Parting words

There you have it. Even though we discussed for some of the types, other types like bool and enum are encoded in similar ways.

Now that you know, choose wisely if protocol buffers suits your requirements. If you are already using it, information above can help you debug issues. Either way, happy coding!

References

--

--