Experimenting To Make Python Services Faster

Mariano Wahlmann
CBI Engineering
Published in
3 min readFeb 4, 2020
Photo by Marc-Olivier Jodoin on Unsplash

We tested the implementation of Protobuf in C++ for Python to see if it would make inter-service communication more efficient.

At CB Insights, gRPC is our protocol of choice for inter-service communication.

Every time Service A talks to Service B, a request message is created → serialized → sent over the network → deserialized by the recipient → processed → a response message is created → serialized → sent over the network → deserialized by the sender.

In short, any RPC call involves two pairs of serializatzion-deserialization operations.

CPython, the official version of python, converts the source code into bytecode, an intermediate representation, which is then passed to the interpreter. The interpreter decodes the first instruction and calls a function (programmed in C) to do whatever operation it needs to do.

Once it’s done, the instruction pointer is incremented and the process is repeated until the end of the program. Because of this, CPython’s VM is very slow compared to compiled languages (C, C++, Go, Rust, etc) or other VM’s with JIT compilers (JVM, CLR, V8, Pypy, etc). To mitigate this, many performance-sensitive libraries build the critical parts in C/C++ and integrate them via the Python/C API.

At the core of gRPC is Protobuf — a fancy name to describe a binary serialization format. By default, Python’s Protobuf library uses a Pure-Python implementation to serialize and deserialize messages sent back and forth. And, as you can image, it is slower compared to other implementations in other languages.

There is a not-so-well documented option to use an implementation of Protobuf in C++ for Python. What this does is offload the critical serialization/deserialization of Python objects to C++ (machine code) and, in theory, is much faster. So, we decided to give it a try.

Making python services faster

To test this and isolate the test as much as possible from other influential variables, we decided to test a service locally (no network involved) with logging disabled that returns a constant response from memory (no databases or other services involved). To do this, we modified the service-template-python to return a list of pre-generated stringified UUID’s (36 byte long strings).

To run the test, we used ghz — a load testing tool for gRPC. By default, ghz sends 200 requests, in this case, not concurrent (to eliminate another variable) to the service under test and measures the response time.

./ghz -call=servicetemplatepython.servicetemplatepython.GetEndpointExample -proto=servicetemplatepython.proto -insecure -c 1 0.0.0.0:8888

The test scenarios were Python and C++ Probotuf’s implementation for list sizes of 1, 10, 100, 1.000, 10.000, and 100.000 UUID strings.

Results

Y-axis units are milliseconds. Light blue is C++ implementation and dark blue is Python.

On smaller payloads, both implementations perform roughly similar (also measurements are more affected by other variables such as CPU load, Garbage collection, etc.). But as the payload size grows, C++ implementation emerges as a clear winner, in some cases 2x-2.5x faster than Python’s implementation.

The beauty of this is that no code change is required for services. The only change needed is to install/compile the protobuf library built with the C++ implementation.

Originally published at https://www.cbinsights.com.

--

--