Improving Efficiency: LinkedIn’s Transition from JSON to Protocol Buffers

Roopa Kushtagi
5 min readJan 10, 2024

--

Programs usually work with data in at least two different representations:

1. In memory representation: In memory, data is kept in objects, structs, lists, arrays, hash tables, trees, and so on. These data structures are optimized for efficient access and manipulation by the CPU.

2. Data on file and data over the network: When you want to write data to a file or send it over the network, you need to convert it as a self-contained sequence of bytes.

This is necessary because the data structures used in memory, such as objects or pointers, are specific to the programming language and the runtime environment. Encoding transforms the in-memory representation of data into a format that can be easily and efficiently transmitted or stored as a sequence of bytes.

The translation from the in-memory representation to a byte sequence is called encoding (also known as serialization or marshaling), and the reverse is called decoding (parsing, deserialization, unmarshalling).

As this is a common problem, numerous encoding formats and libraries are available to encode.

JSON, XML, and CSV are widely known and widely supported standardized encoding formats. They are textual formats, and thus somewhat human-readable.

Challenges with JSON:

JSON being a standardized encoding format offers a broad support of programming languages and is human-readable. However, at LinkedIn, it posed a few challenges that resulted in performance bottlenecks.

1. The first challenge is that JSON is a textual format, which tends to be verbose. This results in increased network bandwidth usage and higher latencies, which is less than ideal. While the size can be optimized using standard compression algorithms like gzip, compression and decompression consume additional hardware resources, and may be disproportionately expensive or unavailable in some environments.

2. The second challenge was that due to the textual nature of JSON, serialization and deserialization latency and throughput were suboptimal. LinkedIn uses garbage-collected languages, like Java and Python, so improving latency and throughput is crucial to ensure efficiency.

When looking for a JSON replacement, LinkedIn wanted an alternative that satisfied a few criteria:

1. Compact payload size for reduced network bandwidth and lower latencies.

2. High serialization and deserialization efficiency.

3. Wide programming language support.

4. Easy integration into their existing REST framework for incremental migration.

After a thorough evaluation of several formats like Protobuf, Flatbuffers, Cap’n’Proto, SMILE, MessagePack, CBOR, and Kryo, it was determined that Protobuf was the best option because it performed the most effectively across the above criteria.

Let’s understand what a Protocol Buffer is.

Protocol Buffers:

Protocol Buffers or Protobuf was originally developed at Google.

Schema-based encoding: It is a binary encoding library and requires a schema for any data that is encoded. The schema provides a predefined structure for the data, allowing for more streamlined serialization.

With a well-defined schema, Protobuf can efficiently encode and decode data without the need to include field names or additional metadata in the serialized output.

Binary encoding represents data in a binary format, meaning the data is encoded using a sequence of binary digits (0s and 1s). This contrasts with text-based encoding formats like JSON, XML, or CSV, which use human-readable characters to represent data. Binary encoding is more compact and efficient in terms of storage and transmission resulting in faster serialization and deserialization processes. But it is not human-readable.

Protobuf supports a variety of data types, including simple types like integers and strings as well as more complex structures like class, structs, lists, etc.

The type efficiency ensures that data is encoded and decoded in a way that aligns with the underlying data types, reducing the overhead associated with type conversion during serialization and deserialization.

Protobuf supports multiple programming languages, allowing for efficient implementation in various language environments.

Example record:

{
"userName": "Martin",
"favoriteNumber": 1337,
"interests": ["daydreaming", "hacking"]
}

Schema used to encode:

message Person {
required string user_name = 1;
optional int64 favorite_number = 2;
repeated string interests = 3;
}

Protocol Buffers encode the same data as:

This encoding fits the record in only 33 bytes than compared to the 81 bytes JSON textual format.

Also, the inherent characteristics of binary encoding, compact representation, schema-based serialization, type efficiency, standardized binary format, language agnosticism, and empirical testing collectively contributed to Protobuf’s success in achieving high serialization and deserialization efficiency.

Using Protobuf resulted in an average throughput per-host increase of 6.25% for response payloads, and 1.77% for request payloads across all services at LinkedIn. For services with large payloads, there was up to 60% improvement in latency.

Below is the P99 latency comparison chart from benchmarking Protobuf against JSON when servers are under heavy load.

Source: LinkedIn Engineering Blog

Conclusion:

The shift from JSON to Protobuf at LinkedIn showcases tangible efficiency wins, emphasizing the importance of choosing the right encoding format for improved performance at scale.

Also, several topics like this are discussed on my YouTube channel. Please, visit — https://www.youtube.com/channel/UCZXINssVU5SlDsV9kqkrTMA?sub_confirmation=1

Appreciate your support.

Must READ for Continuous Learning:

• System Design: https://bit.ly/3S05RGS

• Head First Design Patterns: https://amzn.to/3uDtN9F

• Clean Code: A Handbook of Agile Software Craftsmanship: https://bit.ly/470W9Zf

• Java Concurrency in Practice: https://bit.ly/486vtqz

• Java Performance: The Definitive Guide:https://bit.ly/484BAMk

• Designing Data-Intensive Applications: https://bit.ly/3uDu4cH

• Designing Distributed Systems: https://amzn.to/487C7NV

• Clean Architecture: https://bit.ly/3RwMiWx

• Kafka — The Definitive Guide: https://amzn.to/3NaWUHZ

• Becoming An Effective Software Engineering Manager: https://amzn.to/3NHewv8

--

--