gRPC load balancing — Service Meshes

Gökçe Sürenkök
hepsiburadatech
Published in
4 min readDec 15, 2020
image is taken from [4]

In this article, I will be explaining why it is a must to use L7 proxies(service meshes) when you have microservices communicating with each other using gRPC calls.

The Emergence of the Problem

As the Search and Navigation Team of Hepsiburada(aka Mordor), we have been dividing our main service Mordor API into microservices. And team preferred to use gRPC between services for better performance since there is no more JSON payloads but binaries.

However, things started after we deployed and did load tests to understand how much we improved in performance. What we observed was far worse than we expected, so that we started to investigate what is causing this fall in performance.

The Moment of Enlightenment

Before digging into the cause and demonstrating the show-case, I would like to speak of some basics.

How Kubernetes do load-balancing

As we know, Kubernetes services are routed by kube-proxy and it works in 3 different modes: userspace, iptables, or IPVS. we are using kube-proxy with IPVS which is a Linux kernel load balancer over L4. It works at the connection level, by distributing TCP connections across endpoints. Since concurrent calls made with HTTP/1.1 are sent on different connections, it works well with HTTP/1.1.

However, it does not work with gRPC. Because gRPC uses HTTP/2, which multiplexes multiple calls on a single TCP connection. All gRPC calls over that connection go to one endpoint.

And we learned that while we were testing it. We observed that performance was terrible because all the gRPC calls are sent to one pod only and its resource consumption is significantly higher than the other service pods.

Service Meshes

As it is explained above, it was not possible to distribute the load by using kube-proxy. Therefore this where you need L7 proxies, which are aware of the connection’s protocol.

There are many options you can use, the most well-known ones are istio(envoy-proxy) and linkerd.

In our case, we chose linkerd, I will not explain in detail why we chose linkerd over istio. Even though istio has many features , we did not need all of them, and linkerd was better at performance since it is optimized for the service mesh sidecar use case.

For curious readers, you can read this article [why-linkerd-doesnt-use-envoy] why linkerd is built on Linkerd2-proxy instead of envoy proxy.

ShowCase

We have stated the problem and its root cause. Let’s actually see the result after we injected linkerd-proxy to our services.

Before Injection

I have chosen tolkien-API for this demonstration. It sends requests to our ML model serving API with gRPC calls.

In this configuration, we have 2 uninjected tolkien-API and 3 injected mlserving API pods. I have run a specific load test that hits directly to mlserving API. The result is like the following:

ML serving RPS before injection

We observed since there is no L7 proxy injection in tolkien API, it did not recognize the other pods, and all the requests are sent to 1 pod.

After Injection

Now, we redo the same load test by injecting linkerd-proxy to tolkien-API pods.

ML serving RPS before injection

After the injection, we can clearly see that the typo-ml serving API pods are receiving the gRPC calls evenly.

Since I did this showcase in our QA environment and did the load test with my local pc. Total request per sec is similar for both cases, but in our production environment, under heavy load, with linkerd proxy. Total throughput increased and latency is decreased.

Conclusion

I believe it is a good demonstration to understand why L7 proxies are a must to use when you consider gRPC for your microservices. It’s still an on-going journey for our team, and I am sure there will be lots of problems to solve ahead of us.

DevOps

If you enjoyed this article, and want to be a part of this, we are hiring DevOps Engineers.

Feel free to email me: gokce.surenkok@hepsiburada.com

References

[1] — https://docs.microsoft.com/en-us/aspnet/core/grpc/performance?view=aspnetcore-5.0

[2] — https://linkerd.io/2020/12/03/why-linkerd-doesnt-use-envoy/

[3] — https://linkerd.io/2018/11/14/grpc-load-balancing-on-kubernetes-without-tears/

[4] — https://hackernoon.com/reviewing-grpc-on-kubernetes-8a705b928abd

--

--