gRPC at JobTeaser.

Emmanuel Delmas
JobTeaser Engineering
5 min readJan 5, 2023

In the life of a software engineer, there are things that are well documented and others that are not. When starting to implement gRPC in 2019, the available documentation was very basic. I decided to share my painful experience with you.

gRPC logo

Working on a monolith with the will to migrate it to microservices, I was trying to extract our first business service. A kind of crash test to show the way to go. We had setup Kafka synchronisation and gRPC communication. I had clearly reached the limits of my abilities at that time. And with gRPC in particular, the lack of documentation combined to my incompetence led to a setup of gRPC communication between our first services that was less than production-friendly.

No benefits of the HTTP 2 protocol

Working in a Rails app, I first created a gRPC channel in the initializer to make it accessible as a global variable through the Rails config.
But gRPC channels do not support the memory copy that occurs during the fork of the Rails app by Unicorn in production.
To make it work, I decided to create the channel each time I needed a connection.

The goal of the HTTP2 protocol used by gRPC is to keep the TCP communication open and to reuse it several times without having to make the handshake again.

On the Rails app, the channels are now setup once per unicorn fork thanks to a proc and accessible through the Rails config, allowing persistent connections.

No load balancing

In Kubernetes, services handle load balancing to the targeted pods (where the server is running). So each time we created a channel, the traffic was sent to a selected pod thanks to the Kubernetes service. But after fixing the previous error, when a long connection was built, it would always send traffic to the same pod. Indeed, with a Kubernetes service, the load balancing is done at the time of the TCP handshake.

gRPC channel handles connection to multiple instances with a client side load balancing.

We now use headless Kubernetes services to get the IPs of all the pods. The gRPC channel is setup with the round_robin load balancing policy and builds a connection to all the pods.

---
apiVersion: "v1"
kind: "Service"
metadata:
name: "my-service-with-ip"
spec:
type: "ClusterIP"
ports:
...
selector:
...
---
apiVersion: "v1"
kind: "Service"
metadata:
name: "my-headless-service"
spec:
clusterIP: None
type: "ClusterIP"
ports:
...
selector:
...

No refresh on DNS resolver

The DNS resolver is the gRPC brick that will get the IPs from the given URL before creating the connections inside the channel. Each time a connection is closed, the DNS resolver is run again to recreate the missing connection. But what happens if a new pod is available? Nothing!

You need to close your connection to trigger a DNS resolver refresh. You can also build your own DNS resolver to trigger it whenever you want.

In the setup of our gRPC server, we added a max connection age to close every connection after 5 minutes. It will trigger a DNS resolver refresh and all the client gRPC channels will build connections with all available pods.

The TCP black hole

This one is a bit esoteric and need to be experimented.

When a server shuts down, it closes its connections by sending a signal that the connection is shutting down. I experienced a situation where the client didn’t acknowledge that the connection was closed, and continued to send traffic through this connection without having any answer. It led either to infinite wait without error, or DeadlineExceeded error if you had setup a timeout.

The gRPC Keep Alive ping ensures the connection closes bad connections in the event of a timeout.

I still don’t know where these bad closed connections came from. But we setup a Keep Alive ping sent every 10 seconds (in case of traffic) with a timeout of 1 second and it cleans bad connections.

Conclusion

I would not say that I fully master gRPC communication, but I’m happy with my investigation digging in the code of the lib in Ruby, Go and C, reading their design documents while trying to guess the method interfaces, and especially the format of the undocumented options. My setup is probably not perfect, but some devs of Algolia presented in a talk exactly the same setting that I did.

The stability of our gRPC communication was becoming a blocker before I implemented all these fixes in 2020. We had a lot of failures on services communication and recurrent downtimes of 5 minutes on our extracted services. I have to confess that I still see some little communication failures at some points (maybe due to a bad retry policy setup). But thanks to my new setup, failures are automatically fixed.

My next step is to improve my competencies on gRPC monitoring. I still feel a bit blind when working on gRPC issues and gRPC logs are awful to read.

My gRPC clients


# defining channel then stub
channel = ::GRPC::ClientStub.setup_channel(
nil,
"dns:my_service_url:50051",
:this_channel_is_insecure,
{
"grpc.lb_policy_name" => "round_robin",
"grpc.keepalive_time_ms" => 10000,
"grpc.keepalive_timeout_ms" => 1000
}
)
MyClass::Stub.new(
nil,
nil,
channel_override: channel,
interceptors: [MyInterceptor]
timeout: 3
)

# or defining directly the channel
MyClass::Stub.new(
"dns:my_service_url:50051",
:this_channel_is_insecure,
channel_args: {
"grpc.lb_policy_name" => "round_robin",
"grpc.keepalive_time_ms" => 10000,
"grpc.keepalive_timeout_ms" => 1000
},
interceptors: [MyInterceptor],
timeout: 3
)
import(
"time"
"google.golang.org/grpc"
"google.golang.org/grpc/balancer/roundrobin"
"google.golang.org/grpc/keepalive"
)
opts := []grpc.DialOption{
grpc.WithInsecure(),
grpc.WithBalancerName(roundrobin.Name),
grpc.WithKeepaliveParams(keepalive.ClientParameters{
Time: 10 * time.Second,
Timeout: 1 * time.Second,
}),
grpc.WithChainUnaryInterceptor(myUnaryInterceptor(), UnaryClientInterceptor()),
grpc.WithChainStreamInterceptor(mStreamInterceptor(), StreamClientInterceptor()),
}
uri := url.URL{
Scheme: "dns",
Path: fmt.Sprintf("myServiceUrl:50051"),
}
grpc.Dial(uri.String(), opts...)

My gRPC servers

::GRPC::RpcServer.new(
interceptors: [MyInterceptor],
server_args: {
"grpc.max_connection_age_ms" => 300000
}
)
import(
"google.golang.org/grpc"
"google.golang.org/grpc/keepalive"
)
grpcServerOpts := []grpc.ServerOption{
grpc.KeepaliveParams(keepalive.ServerParameters{
MaxConnectionAge: 5 * time.Minute,
}),

grpc.ChainUnaryInterceptor(myUnaryInterceptor(), UnaryServerInterceptor()),
grpc.ChainStreamInterceptor(myStreamInterceptor(), StreamServerInterceptor()),
}
grpc.NewServer(grpcServerOpts...)

Documentation

--

--

Emmanuel Delmas
JobTeaser Engineering

I love my wife and my three children. I'm engineering manager @JobTeaser. I'm Christian. I like cultivate vegetables.