Implementing, and then replacing, an in-house communication protocol

Enrique Zamudio
bitso.engineering
Published in
7 min readSep 7, 2023

Back in 2017, the Bitso system consisted of a monolithic application written in PHP, which we needed to split into more manageable parts, to be able to gradually migrate logic and data into more robust platforms. This is not that story. Well, not exactly.

This is the story of one of the many challenges that splitting the monolith presented: how the existing application would communicate with the new modular services, instead of going directly to a monolithic database. There would be more network jumps; instead of the monolith talking to a database, it would be talking to several new services, written in Java, which would in turn talk to one or more databases. Increased latency was inevitable, but it was crucial to keep it to a minimum.

We evaluated some alternatives:

  • REST API — although great for public APIs, REST is not the best option for internal communication across services. HTTP offers a lot of features which are often simply not needed, and there’s an associated overhead cost.
  • grpc — this technology was in its early days (it was released in August 2016), and it was not available for PHP. But it was a very interesting alternative, and one of its strengths was the use of Protocol Buffers, a binary messaging format developed by Google.

The underlying message format used by grpc, Protocol Buffers, has some very neat features:

  • Very compact binary encoding
  • Fast serialization and deserialization
  • Support for several platforms, including PHP and Java
  • Strongly typed, with common types supported in several platforms, such as strings, 32-bit and 64-bit integers, booleans, simple maps, and of course each message can be used as the type of field in another message
  • Extensible messages, which allows easily adding fields without having to upgrade both sides of a communication channel at once

So, even though we couldn’t use grpc, we could leverage Protocol Buffers (or protobuf for short). All we needed was a simple protocol to transmit protobuf messages. So I wrote a simple proof-of-concept thing inspired by my previous experience with ISO8583, which is a binary messaging format/protocol widely used by banks; ATMs and credit card terminals use it. ISO8583 communication can be synchronous or asynchronous; each message is preceded by a fixed-length header which is interpreted as an unsigned integer indicating the length of the message that comes after it. So you merely need to read, say, 4 bytes, interpret that as an unsigned 32-bit integer, and then read however many bytes that number says.

Protobuf offers a DSL (Domain-Specific Language) to define messages. This is a very simple example of protobuf messages:

message UserLogin {
string username = 1;
string password = 2;
}
message LoginRequest {
int64 id = 1;
UserLogin login = 2;
}
message LoginResponse {
int64 id = 1;
int32 rcode = 2;
string error = 3;
}

The protobuf compiler can generate Java and/or PHP code from these definitions. For the transport, we had to write two implementations of the simple protocol: one in Java, focused on server-side but with a client which was initially for testing purposes only, although later it became the default client when we started using the same protocol between Java services, and a client in PHP.

Then in September 2017, grpc for PHP came out, which was awesome news, but it was only for PHP 7, and we were still using 5.5, so we still couldn’t use it. We kept using this (nameless) thin protocol to transmit protobuf messages across services for the time being. Which, as usual, really meant permanently; since we had overcome the comms obstacle, and it worked just fine, we simply forgot about it and kept using it, focusing on new and improved products for the company.

A big part of why this worked so well was because of how components were designed:

  • An interface with the desired behavior
  • An implementation that lives inside the service, with all the business logic, connection to database, etc.
  • A proxy implementation that lives in a library, which creates protobuf messages, sends them to the service, and converts the protobuf response back to whatever the interface defined.
  • A server-side translator (which we called a handler) to dispatch the protobuf requests, by calling a method in the “real” component and encoding the return value as a protobuf message to send back to the client.

So to follow the example above, we can have an interface with a login method:

public interface LoginService {
LoginResult login(String username, String password);
}

Then we’ll have an implementation of this interface, which validates the username and password against a database and returns a result indicating success or failure. This can be tested in isolation and has absolutely nothing to do with protobuf; but we can also define another implementation of the interface, which uses protobuf and our simple protocol underneath, to send messages to the service itself:

public class LoginProtoshim implements LoginService {

private SimpleClient client; //this will point to a remote host and port

public LoginResult login(String username, String password) {
//Create the request
var request = LoginRequest.builder()
.setId(autoincrementId++)
.setUserLogin(UserLogin.builder()
.setUsername(username)
.setPassword(password)
.build())
.build();
//send it to the service
var response = client.send(request);
//interpret the response and turn into a result.
//protobuf code should never leave the scope of this class on the client side.
if (response.getRcode() == OK) {
return LoginResult.SUCCESS;
}
return new LoginResult(response.getRcode(), response.getError());
}
}

In PHP we would do something similar; the protobuf classes are generated and we would use our own client to send the request and read the response.

The server side was a bit more complicated; we used netty for the underlying network sockets and wrote a thin layer on top of it, with message handlers that would receive an already parsed protobuf message and would have to return a specific protobuf response. The conversion would be the other way around:

public class LoginHandler {
private LoginService login; //this will point to the "real" implementation

public LoginResponse handle(LoginRequest request) {
//call the business logic with the data from the protobuf message
//protobuf code must neve leave the scope of the handlers on the server side
var result = login.login(request.getUsername(), request.getPassword();
var response = LoginResponse.builder().setId(request.getId());
//Copy the result data onto the response
if (result == LoginResult.SUCCESS) {
response.setRcode(OK);
} else {
response.setRcode(result.getErrorCode())
.setError(result.getErrorMessage());
}
//the caller of this method will send this response to the client
return response.build();
}
}

The handler has to be registered in a server component that can receive different message types, parse them, and then pass the incoming protobuf message to the appropriate handler.

To test the LoginHandler, we can either create protobuf messages by hand and send them using our low-level simple protobuf client, or leverage the LoginProtoshim and use it to send the messages. This second option has the benefit that we’re also testing the protoshim at the same time (my colleague Superserch was the one who came up with the name “protoshim”, when he actually wrote the first one).

When an external service needs to authenticate a user, it does so through the LoginService interface. At runtime, the LoginProtoshim is injected, but for testing, one of two things can be done: either mock the LoginService, or inject the implementation from the service instead, eliminating the need to mock anything (as long as the database and any other resources it needs are available).

This way of communicating across services worked really well at first, so as we used it more and more, we noticed what it was lacking, and we gradually added features like metrics, tracing, logging, retries, resilience (through the resilience4j library) and other amenities. One I particularly liked was that we could use persistent connections, but sometimes when a service is shutting down, the connection would be lost. Since we’re using kubernetes, we know there has to be at least one other pod available, so we handled reconnections gracefully in the low-level client, and even added a special response from the server indicating it’s in the process of shutting down, which the client interprets and then closes that connection and retries the request with a new connection (which we know will be accepted by a different pod, one which is not shutting down). Of course, not everything was ponies and rainbows; our Cloud Ops team was not particularly happy that there was no out-of-the-box support for liveness/readiness probes because it was a custom protocol, so we had to write our own; eventually we provided one, written also in Java, so we could leverage the simple client we already had, and compiled it to a native binary with GraalVM.

Fast forward to 2020. We finally managed to upgrade to PHP 7, and so now it is possible to use grpc on the client. By this time, grpc is really mature, and has some really nice features that our protobuf comms doesn’t, such as streaming large responses. So as a proof-of-concept we tested migrating a service from our custom protobuf to grpc, with excellent results: in just under a week, we could have the grpc server and a new protoshim using grpc underneath. This was mostly thanks to the fact that we could reuse the protobuf messages we already had and only needed to add the rpc calls; and of course grpc already includes metrics, tracing, and kubernetes provides liveness and readiness probes for grpc.

Today, we’re still in the process of replacing our custom protobuf comms with grpc. This is not a high-priority task because the current implementation is working well; the move is mostly so we don’t have to maintain something which, while being critical to our system, does not really add any value to our users or any sort of competitive advantage. The concrete grpc implementation of a component requires less code than the custom protobuf one, which means less code to maintain, and also grpc is widely used and well supported. And it’s open source, so if we find it lacks something we need, it’s preferable to contribute to the project instead of having to add to our own proprietary solution. So while it may make a few people a bit sad, there’s really no downside to deleting what was once an essential piece of our technology stack in favor of the thing it was trying to mimic since the beginning; either die a hero, or live long enough to become tech debt.

--

--