gRPC at JobTeaser.

Published in

JobTeaser Engineering

5 min readJan 5, 2023

In the life of a software engineer, there are things that are well documented and others that are not. When starting to implement gRPC in 2019, the available documentation was very basic. I decided to share my painful experience with you.

Working on a monolith with the will to migrate it to microservices, I was trying to extract our first business service. A kind of crash test to show the way to go. We had setup Kafka synchronisation and gRPC communication. I had clearly reached the limits of my abilities at that time. And with gRPC in particular, the lack of documentation combined to my incompetence led to a setup of gRPC communication between our first services that was less than production-friendly.

No benefits of the HTTP 2 protocol

Working in a Rails app, I first created a gRPC channel in the initializer to make it accessible as a global variable through the Rails config.
But gRPC channels do not support the memory copy that occurs during the fork of the Rails app by Unicorn in production.
To make it work, I decided to create the channel each time I needed a connection.

The goal of the HTTP2 protocol used by gRPC is to keep the TCP communication open and to reuse it several times without having to make the handshake again.

On the Rails app, the channels are now setup once per unicorn fork thanks to a proc and accessible through the Rails config, allowing persistent connections.

No load balancing

In Kubernetes, services handle load balancing to the targeted pods (where the server is running). So each time we created a channel, the traffic was sent to a selected pod thanks to the Kubernetes service. But after fixing the previous error, when a long connection was built, it would always send traffic to the same pod. Indeed, with a Kubernetes service, the load balancing is done at the time of the TCP handshake.

gRPC channel handles connection to multiple instances with a client side load balancing.

We now use headless Kubernetes services to get the IPs of all the pods. The gRPC channel is setup with the round_robin load balancing policy and builds a connection to all the pods.

---
apiVersion: "v1"
kind: "Service"
metadata:
  name: "my-service-with-ip"
spec:
  type: "ClusterIP"
  ports:
    ...
  selector:
    ...
---
apiVersion: "v1"
kind: "Service"
metadata:
  name: "my-headless-service"
spec:
  clusterIP: None
  type: "ClusterIP"
  ports:
    ...
  selector:
    ...

No refresh on DNS resolver

The DNS resolver is the gRPC brick that will get the IPs from the given URL before creating the connections inside the channel. Each time a connection is closed, the DNS resolver is run again to recreate the missing connection. But what happens if a new pod is available? Nothing!

You need to close your connection to trigger a DNS resolver refresh. You can also build your own DNS resolver to trigger it whenever you want.

In the setup of our gRPC server, we added a max connection age to close every connection after 5 minutes. It will trigger a DNS resolver refresh and all the client gRPC channels will build connections with all available pods.

The TCP black hole

This one is a bit esoteric and need to be experimented.

When a server shuts down, it closes its connections by sending a signal that the connection is shutting down. I experienced a situation where the client didn’t acknowledge that the connection was closed, and continued to send traffic through this connection without having any answer. It led either to infinite wait without error, or DeadlineExceeded error if you had setup a timeout.

The gRPC Keep Alive ping ensures the connection closes bad connections in the event of a timeout.

I still don’t know where these bad closed connections came from. But we setup a Keep Alive ping sent every 10 seconds (in case of traffic) with a timeout of 1 second and it cleans bad connections.

Conclusion

I would not say that I fully master gRPC communication, but I’m happy with my investigation digging in the code of the lib in Ruby, Go and C, reading their design documents while trying to guess the method interfaces, and especially the format of the undocumented options. My setup is probably not perfect, but some devs of Algolia presented in a talk exactly the same setting that I did.

The stability of our gRPC communication was becoming a blocker before I implemented all these fixes in 2020. We had a lot of failures on services communication and recurrent downtimes of 5 minutes on our extracted services. I have to confess that I still see some little communication failures at some points (maybe due to a bad retry policy setup). But thanks to my new setup, failures are automatically fixed.

My next step is to improve my competencies on gRPC monitoring. I still feel a bit blind when working on gRPC issues and gRPC logs are awful to read.

My gRPC clients


# defining channel then stub
channel = ::GRPC::ClientStub.setup_channel(
  nil,
  "dns:my_service_url:50051",
  :this_channel_is_insecure,
  {
    "grpc.lb_policy_name" => "round_robin",
    "grpc.keepalive_time_ms" => 10000,
    "grpc.keepalive_timeout_ms" => 1000      
  }
)
MyClass::Stub.new(
  nil,
  nil,
  channel_override: channel,
  interceptors: [MyInterceptor]
  timeout: 3
)

# or defining directly the channel
MyClass::Stub.new(
  "dns:my_service_url:50051",
  :this_channel_is_insecure,
  channel_args: {
    "grpc.lb_policy_name" => "round_robin",
    "grpc.keepalive_time_ms" => 10000,
    "grpc.keepalive_timeout_ms" => 1000      
  },
  interceptors: [MyInterceptor],
  timeout: 3
)

import(
  "time"
  "google.golang.org/grpc"
  "google.golang.org/grpc/balancer/roundrobin"
  "google.golang.org/grpc/keepalive"
)
opts := []grpc.DialOption{
  grpc.WithInsecure(),
  grpc.WithBalancerName(roundrobin.Name),
  grpc.WithKeepaliveParams(keepalive.ClientParameters{
   Time:    10 * time.Second,
   Timeout: 1 * time.Second,
  }),
  grpc.WithChainUnaryInterceptor(myUnaryInterceptor(), UnaryClientInterceptor()),
  grpc.WithChainStreamInterceptor(mStreamInterceptor(), StreamClientInterceptor()),
}
uri := url.URL{
  Scheme: "dns",
  Path:   fmt.Sprintf("myServiceUrl:50051"),
}
grpc.Dial(uri.String(), opts...)

My gRPC servers

::GRPC::RpcServer.new(
  interceptors: [MyInterceptor],
  server_args: {
    "grpc.max_connection_age_ms" => 300000
  }
)

import(
  "google.golang.org/grpc"
  "google.golang.org/grpc/keepalive"
)
grpcServerOpts := []grpc.ServerOption{
  grpc.KeepaliveParams(keepalive.ServerParameters{
   MaxConnectionAge:  5 * time.Minute,
  }),

  grpc.ChainUnaryInterceptor(myUnaryInterceptor(), UnaryServerInterceptor()),
  grpc.ChainStreamInterceptor(myStreamInterceptor(), StreamServerInterceptor()),
}
grpc.NewServer(grpcServerOpts...)

Documentation

grpc/load-balancing.md at master · grpc/grpc

This document explains the design for load balancing within gRPC. Load-balancing within gRPC happens on a per-call…

github.com

grpc/keepalive.md at master · grpc/grpc

The keepalive ping is a way to check if a channel is currently working by sending HTTP2 pings over the transport. It is…

github.com

grpc/naming.md at master · grpc/grpc

gRPC supports DNS as the default name-system. A number of alternative name-systems are used in various deployments. We…

github.com

proposal/A6-client-retries.md at master · grpc/proposal

gRPC client library will automatically retry failed RPCs according to a policy set by the service owner. Currently…

github.com

proposal/A8-client-side-keepalive.md at master · grpc/proposal

Mobile has very poor network connectivity and TCP connection failures are common (commonly due to NATs). However, in…

github.com

grpc/grpc_types.h at master · grpc/grpc

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below…

github.com

gRPC at JobTeaser.

No benefits of the HTTP 2 protocol

No load balancing

No refresh on DNS resolver

The TCP black hole

Conclusion

My gRPC clients

My gRPC servers

Documentation

grpc/load-balancing.md at master · grpc/grpc

This document explains the design for load balancing within gRPC. Load-balancing within gRPC happens on a per-call…

grpc/keepalive.md at master · grpc/grpc

The keepalive ping is a way to check if a channel is currently working by sending HTTP2 pings over the transport. It is…

grpc/naming.md at master · grpc/grpc

gRPC supports DNS as the default name-system. A number of alternative name-systems are used in various deployments. We…

proposal/A6-client-retries.md at master · grpc/proposal

gRPC client library will automatically retry failed RPCs according to a policy set by the service owner. Currently…

proposal/A8-client-side-keepalive.md at master · grpc/proposal

Mobile has very poor network connectivity and TCP connection failures are common (commonly due to NATs). However, in…

grpc/grpc_types.h at master · grpc/grpc

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below…

Written by Emmanuel Delmas