Challenges of running gRPC services in production
There are several ways to make services communicate, which generally involve a transport layer. Our applications often rely on it to provide several abstractions and features, such as load balancing, retries and high availability.
However, when running a service in production, we get more network-related errors than we’d like. This post intends to show how we mitigated these errors while using gRPC for service-to-service communication.
Why gRPC?
Back in 2016, almost every service at Incognia made use of the HTTP1.1/JSON stack for communication. It worked well for a long time, but, as the company grew, some high-traffic services started requiring a more efficient way of communicating with internal clients.
Documentation of JSON APIs was also cumbersome to maintain, as they were not bound to the code itself, meaning that someone could deploy code that changes the API without changing the documentation appropriately.
In the search for a good alternative, we looked into gRPC, which solves the performance and schema definition issues described above, with the following features:
- The API surface is defined directly in the protobuf files, where each method describes its own request/response types
- Auto-generate both client and server code in many languages
- Uses HTTP/2 in combination with Protobuf, which are both binary protocols, resulting in a more compact request/response payload
- HTTP/2 also uses persistent connections, removing the need to constantly create/close connections, as HTTP/1.1 does.
But running gRPC services also provided us with some challenges, mostly due to the fact that HTTP/2 uses persistent connections.
Challenges of gRPC in production
We are heavy users of Kubernetes, and as such, our gRPC services are running on Kubernetes clusters, on Amazon EKS.
One of the challenges we faced was ensuring load balancing on our servers. As the number of servers changes dynamically due to autoscaling, the clients must be able to make use of the new servers and remove connections to the ones that are no longer available, while ensuring that the number of requests are well-balanced between them by following some load balancing policy.
Load balancing
There are some solutions for this problem, as stated in the gRPC blog, including proxy load balancing and client-side load balancing. In the following sections, we explain the approaches we implemented, in chronological order.
Approach 1: Proxy Load Balancer with Linkerd 1.x
The first way we implemented was using a proxy load balancer, namely Linkerd 1.x, as Figure 1 shows. This solution worked well for some time, solving the load balancing issue from the server perspective, but the client-to-proxy load was still unbalanced, meaning that some Linkerd instances handled a larger amount of requests than others.
Unbalanced traffic on the client-to-proxy link later proved to be problematic. Overloaded proxy instances could add too much latency, or even run out of memory sometimes, becoming increasingly hard to manage.
In addition, this solution was proven to add considerable overhead (as it requires an additional network hop), also consuming a considerable amount of resources in our Kubernetes cluster, as we deployed Linkerd as a daemonset, meaning that a Linkerd pod runs in every worker node in the cluster.
Approach 2: Thick gRPC client
Trying to tackle issues with the first approach, we tried to eliminate the proxy layer, handling the responsibility of load balancing in the client code, which we own.
To handle load balancing in the clients, we used grpc-go’s naming.NewDNSResolverWithFreq(time.Duration)
in combination with Kubernetes’ headless services (to handle discovery of server pods). In this solution, the clients refresh the pool of hosts they can connect by polling the target service’s DNS every few seconds.
This caused clients to connect directly to the server’s pods, which reduced our latency when compared to the proxy load balancer approach. The following diagram shows the components involved in this approach.
However, dynamic service discovery using DNS is being deprecated by the Go gRPC implementation , in favor of other protocols such as xDS. Not only that — on other languages, it had never been implemented in the first place.
We learned that, although this approach offers us a stable and high performant communication, relying on implementations on client code can be brittle and hard to manage due the diversity of gRPC implementations. This point holds true for other features, like rate limiting and authorization.
After trying these different approaches, we identified that we needed a generic, low-overhead, language-agnostic way to enable service discovery and load balancing.
Approach 3: Sidecar proxy with Envoy
After some research on the topic, we chose to use the sidecar pattern, adding another container to the client pod, which handles service discovery, load balancing, and provides some observability to our connections. We chose to use Envoy, for its high performance and deployment simplicity.
In this approach, the client containers connect to the Envoy sidecar, which maintains connections to the target service.
Using this approach, we got what we were seeking:
- Low latency, as Envoy’s overhead is minimal when compared to Linkerd 1.x
- No additional code in the clients
- Observability, as Envoy exports metrics in Prometheus format
- Ability to enrich the network layer, as Envoy supports features like authorization and rate limiting
Service discovery and graceful shutdown
With proper load balancing configured, we still need a way for Envoy to discover new targets and update its pool of hosts.
There are a couple of options for service discovery with Envoy, such as DNS and EDS (based on xDS). For the sake of simplicity and familiarity, we chose to use DNS.
Integrating DNS service discovery in Kubernetes is quite straightforward, as we use external-dns, being able to specify hostname and DNS TTL directly on our Kubernetes service, as follows:
A hidden complexity of using DNS as our service discovery mechanism is that it takes some time to propagate. So we need to give our gRPC clients some leeway to update their host lists before a terminating backend really stops receiving connections. Using DNS, the graceful shutdown flow is a bit trickier, as DNS records have a TTL associated with them, meaning Envoy caches the hosts for this period.
The following diagram shows a basic flow which ends with a failed request:
In this scenario, the second client request fails, as the server pod was no longer available, while the Envoy cache still had its IP.
To solve this issue, we must look at how Kubernetes handles pod termination, which is described in detail here. It consists of 2 steps running at the same time: the pod is removed from the Kubernetes service endpoints (in our case, this also makes external-dns remove the pod’s IP from the list of DNS records) and the container is sent a TERM signal, starting the graceful shutdown process.
To solve the terminating host issue, we used Kubernetes’ pre-stop hooks to prevent an immediate TERM signal from being sent to the pod, as follows:
With the preStop hook configured, our flow now looks like the following:
With this solution, we give enough time for Envoy’s DNS cache to expire, and perform a new DNS lookup, which no longer includes the dead pod’s IP.
Future Improvements
While using Envoy brought us a lot of performance improvements and overall simplicity, the DNS service discovery is still not ideal. It is not as robust, as it is based on polling, where the clients are responsible for refreshing the pool of hosts when the TTL expires.
A more robust way is to use Envoy’s EDS, which is more flexible, adding capabilities such as canary deployments and more sophisticated load balancing strategies, but we still need some time to evaluate this approach and validate it in a production environment.