Inner view of protobuf encoding
In this article we will see how protocol buffers encodes data to make it more compact (hence smaller) to transmit. This article does not focus on how to use protocol buffers but instead explores what it does under the hood. Code pieces demonstrated here are in python v3 but similar things will be happening in other languages also.
Here we explore encoding of 6 types -int, string, float, double, dict, array. Having understanding of these 6 types should be sufficient to get a fair idea of working protobuf.
Protocol Buffers
Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data. Just like xml but smaller in size. It is useful in developing programs to communicate with each other over a wire or for storing data.
Installation
Installation of protobuf compiler is very easy. To include it in python using pip, just run -
pip install protobuf
and you are done.
Usage
We define .proto files on both sides (caller and callee functions). These .proto files contain the schema (called message) of data objects to be transmitted. After defining messages, we just need to run protoc (protobuf compiler) and it will generate the access classes. These access classes can be used directly in your code.
Taking an example of dummy json value -
- Define data.proto, specifying type and number for each field (here we have matched json keys to field names).
2. Run protobuf compiler for this file, this will generate data_pb2.py in output directory.
protoc --python_out=. data.proto
3. Use data_pb2.py in your code -
After running this code, we get output -
b'\n\x185e4d67c4599d93d88340b3a3\x1a$9b6d7add-d95c-41cf-9a7f-eedf3b4ec5df*\t$3,754.962\x19http://placehold.it/32x328\x16B\x04blueJ\rJarvis DodsonR\x04maleZ\x06JASPERb\x17jarvisdodson@jasper.comj\x11+1 (942) 515-2514r(614 Grafton Street, Teasdale, Iowa, 3952z\xcc\x02Nisi cillum excepteur ad eu ipsum incididunt id nisi. Proident sit cupidatat sit ullamco ea velit veniam id. Officia nulla laboris nisi quis sint mollit. Labore cupidatat mollit ex magna magna qui sit nulla aliquip duis eiusmod sit non. Sit amet mollit pariatur non minim aliquip officia veniam in aliqua ad amet cupidatat cillum.\r\n\x82\x01\x1b2018-03-22T09:44:03 -06:-30\x89\x01\xde\x8d\x05\x85A%:\xc0\x91\x01\xaf\xce1 {SZ@\x9a\x01\x02eu\x9a\x01\x07nostrud\x9a\x01\x04quis\x9a\x01\x06cillum\x9a\x01\x06veniam\x9a\x01\x03non\x9a\x01\x05ipsum\xa2\x01\x0c\x12\nLorna Owen\xa2\x01\r\x08\x01\x12\tNona Long\xa2\x01\x13\x08\x02\x12\x0fRamona Delacruz\xaa\x011Hello, Jarvis Dodson! You have 7 unread messages.\xb2\x01\x05apple'
1112
777
Well, it did generate ~30% lesser bytes.
Good. Now let’s see how.
Basic Encoding Detour
To learn about protobuf encoding, first we need to revisit some of the techniques that are commonly used for encoding numbers.
varint encoding
Varints are represented in a special way. Each byte in a varint, except the last byte, has the most significant bit (msb) set — this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the two’s complement representation of the number in groups of 7 bits, least significant group first.
eg. 300
- binary representation of 300:
100101100
- representing it in groups of 7 bits (lower 7 bits are for number):
0000010 0101100
- reverse the order of groups (least significant group first):
0101100 0000010
- prepend 1 to first group (its not the end of number) and 0 to second group (signifying end):
10101100 00000010
so 300 becomes 1010110000000010
In JSON, it would have taken 3 bytes to represent 300 00000011 00000000 00000000
, after varint encoding, its 2 bytes 10101100 00000010
.
float encoding
Floating point encoding is a bit complex. You can learn in detail about it here. In a gist, float takes 32 bits in memory — 1 bit for sign + 8 bits for exponent (expressed in the form of e-127
where e is exponent)+ 23 bits for precision.
Let’s take example of number 123.375 as an example.
- convert 123.375 into binary:
1111011.011
- move decimal to left such that only one set bit is remaining:
1.111011011*(2^(133-127))
- we get exponent=133 ie. 10000101, precision=111011011 and sign=0 (for positive).
so 123.375 becomes 0 10000101 11101101100000000000000
double encoding
Encoding of double is same as floating point encoding, difference lies in the size. Double takes 64 bits in memory — 1 bit for sign + 11 bits for exponent (expressed in the form of e-1023
where e is exponent)+ 52 bits for precision.
Taking example of number 123.375.
- convert 123.375 into binary:
1111011.011
- move decimal to left such that only one set bit is remaining:
1.111011011*(2^(1029-1023))
- we get exponent=1029 ie. 10000000101, precision=111011011 and sign=0 (for positive).
so 123.375 becomes:
0 10000000101 1110110110000000000000000000000000000000000000000000
Protobuf Encoding
Now that we have refreshed our basic knowledge, let’s dig in!
wire types
Protobuf categorizes normally used data types into wire types which provides information to find the length of the value.
For wire type 2, length is also encoded along with value.
zero values
Zero values are not transmitted. 0, False, “”, does not take any bytes.
This creates ambiguity at receiving end whether value is 0 or value is not passed at all .
integer encoding
Protobuf encode integers (wire type 0) using varint encoding. If integer value is negative, it is always 10 bytes long.
# returns encoded int value
def encode_varint(int_value):
encoded_value = []
if int_value < 0:
int_value += (1 << 64)
while True:
if (int_value >> 7) > 0:
encoded_value.append((int_value & 0x7f) | 0x80)
else:
encoded_value.append(int_value & 0x7f)
break
int_value = int_value >> 7
return bytes(encoded_value)
Eg. running encode_varint(300)
gives result b'\xac\x02'
(hex representation of 10101100 00000010
).
string encoding
String values are transmitted as it is. They are just transformed into bytes.
# returns encoded string value
def encode_string(string_value):
return string_value.encode('utf-8')
Eg. running encode_string('abc')
gives result b'abc'
.
float encoding
Protobuf encodes floats into little endian byte order (least significant byte comes first). As we saw earlier, float encoding representation is a bit different. Fortunately, python provides a way to make it easy. You can read more about struct from here.
# returns encoded float value
def encode_float(float_value):
return struct.pack('<f', float_value)
Eg. running encode_float(123.375)
gives b'\x00\xc0\xf6B'
(little endian hex of 0 10000101 11101101100000000000000
)
double encoding
Similar to float encoding, double is encoded in little endian byte order using struct package.
# returns encoded double value
def encode_double(double_value):
return struct.pack('<d', double_value)
Eg. running encode_double(123.375)
gives b'\x00\x00\x00\x00\x00\xd8^@'
(little endian hex of 0 10000000101 1110110110000000000000000000000000000000000000000000
)
tags
Protobuf does not send keys. It them replaces with tags. A tag consists of two things:
- field number — where the field lies in structured message.
- wire type — provides just enough information to find the length of the following value.
A tag is varint encoding of number field_number<<3|wire_type
.
# returns varint encoded tag based upon field number and wire type
def get_tag(field_num, wire_type):
if wire_type in ['string', 'dict', 'array']:
return encode_varint((field_num << 3) | 2)
elif wire_type in ['varint']:
return encode_varint((field_num << 3) | 0)
elif wire_type in ['float']:
return encode_varint((field_num << 3) | 5)
elif wire_type in ['double']:
return encode_varint((field_num << 3) | 1)
Eg. running get_tag(1, 'varint')
gives b'\n'
(byte string for (1<<3|2)
ie. 10)
encoding with tags
Since we have seen how primitive values are encoded and how to generate their tag values, we can write functions for them as follows (note that length is included for wire type 2 fields) -
# returns encoded string value along with its tag
def encode_string_with_tag(string_value, field_num, repeated):
# check if value is an array of string
if repeated:
return encode_repeated_string(string_value, field_num)
tag = get_tag(field_num, 'string')
encoded_value = encode_string(string_value)
encoded_len = encode_varint(len(encoded_value))
return tag + encoded_len + encoded_value# returns encoded float value along with its tag
def encode_float_with_tag(float_value, field_num, repeated):
if repeated:
return encode_repeated_float(float_value, field_num)
tag = get_tag(field_num, 'float')
encoded_value = encode_float(float_value)
return tag + encoded_value# returns encoded int value along with its tag
def encode_varint_with_tag(int_value, field_num, repeated):
# check if value is an array of varint
if repeated:
return encode_repeated_varint(int_value, field_num)
tag = get_tag(field_num, 'varint')
encoded_value = encode_varint(int_value)
return tag + encoded_value# returns encoded double value along with its tag
def encode_double_with_tag(double_value, field_num, repeated):
# check if value is an array of double
if repeated:
return encode_repeated_double(double_value, field_num)
tag = get_tag(field_num, 'double')
encoded_value = encode_double(double_value)
return tag + encoded_value
Eg. running encode_string_with_tag('abc',1)
gives b'\n\x03abc'
(\n
for tag, \x03
for length, abc
is our encoded string)
dict encoding
We can see a dict as a collection of different type of values. It encodes each value and concatenates them.
# returns encoded dict
# this function assumes data in a particular format: it contains an array of field and values.
# each field is represented as a tuple of size 4 (value, wire type, field number, if field is an array)
# eg. message Msg { int64 id = 1; string text = 2; } => [(1234, 'varint', 1, False), ('hello world', 'string', 1, False)]
# also we can assume that every protobuf message is a dict at start, so this function is an entry point for encoding.
def encode_dict(data):
encoded_value = b''
for d in data:
value, wire_type, field_num, repeated = d[0], d[1], d[2], d[3]
if not value:
continue
if wire_type == 'string':
encoded_value += encode_string_with_tag(value, field_num, repeated)
elif wire_type == 'varint':
encoded_value += encode_varint_with_tag(value, field_num, repeated)
elif wire_type == 'float':
encoded_value += encode_float_with_tag(value, field_num, repeated)
elif wire_type == 'double':
encoded_value += encode_double_with_tag(value, field_num, repeated)
elif wire_type == 'dict':
encoded_value += encode_dict_with_tag(value, field_num, repeated)
return encoded_value
Since protobuf uses tags to identify field number of a field, there is no point in relying on the order of encoded values.
repeated encoding
There is a slightly different mechanism for encoding repeated values (array of values). It is divided in two parts -
- If it is an array of primitive numeric types (integer, float, double), then they are encoded within single key-value (tag + length of bytes + encoded bytes) pair. Such fields are called packed repeated fields.
# returns encoded array of float values
def encode_repeated_float(float_values, field_num):
encoded_values = b''
tag = get_tag(field_num, 'array')
for f in float_values:
ev = encode_float(f)
encoded_values += ev
encoded_len = encode_varint(len(encoded_values))
return tag + encoded_len + encoded_values# returns encoded array of int values
def encode_repeated_varint(int_values, field_num):
encoded_values = b''
tag = get_tag(field_num, 'array')
for i in int_values:
ev = encode_varint(i)
encoded_values += ev
encoded_len = encode_varint(len(encoded_values))
return tag + encoded_len + encoded_values# returns encoded array of double values
def encode_repeated_double(double_values, field_num):
encoded_values = b''
tag = get_tag(field_num, 'array')
for d in double_values:
ev = encode_double(d)
encoded_values += ev
encoded_len = encode_varint(len(encoded_values))
return tag + encoded_len + encoded_values
Eg. running encode_repeated_double([12, 33, 21], 1)
gives b'\n\x03\x0c!\x15'
(\n
for tag, \x03
for length, \x0c
for 12, !
for 33 and \x15
for 21)
2. Otherwise, protobuf just concatenates encoded value (along with tag) of each element.
# returns encoded array of string values
def encode_repeated_string(string_values, field_num):
encoded_value = b''
for s in string_values:
encoded_value += encode_string_with_tag(s, field_num, False)
return encoded_value# returns encoded array of dict values
def encode_repeated_dict(dict_values, field_num):
encoded_value = b''
for d in dict_values:
encoded_value += encode_dict_with_tag(d, field_num, False)
return encoded_value
Eg. running encode_repeated_string(['abc', 'def', 'xyz'], 1)
gives b'\n\x03abc\n\x03def\n\x03xyz'
(\n\x03abc
for encoded abc, \n\x03def
for encoded def, \n\x03xyz
for encoded xyz)
Putting it all together
From what we know now, we can join them together to create our own protobuf encoder.
Parting words
There you have it. Even though we discussed for some of the types, other types like bool and enum are encoded in similar ways.
Now that you know, choose wisely if protocol buffers suits your requirements. If you are already using it, information above can help you debug issues. Either way, happy coding!