Service Mesh with Linkerd2 | pt. 1

hadican
5 min readMar 24, 2023

--

At Intenseye, we follow trends & hype technologies and apply best practices while using them. We run thousands of pods on Kubernetes written with Scala, Kotlin, Go, Python, etc. and most of them use gRPC.

gRPC is a modern open-source high-performance remote procedure call (RPC) framework that uses HTTP/2 for transport. HTTP/2 supports making multiple requests over a single TCP connection to reduce the number of round trips. This is where the problem emerges; load balance.

Once the connection is established, all requests will get pinned to a single destination pod. Thus, we won’t have a balanced load. We need an L7-aware load balancer, instead of an L4. You can read the details of the problem from here later on.

We were seeking a solution for one other problem; secure transport between microservices. We have tens of components running hundreds of pods in total. Configuring TLS between all of them one by one was intimidating and would have been time-consuming.

We also needed a monitoring system and traffic metrics from all of these components and microservices. We would like to observe success/failure rates, RPS of pods, who talks with who at which frequency, etc.

There was a single solution for these 3 problems we have: Service Mesh

What is a Service Mesh?

A service mesh is a tool for adding observability, security, and reliability features to applications by inserting these features at the platform layer rather than the application layer. (source)

The service mesh is typically implemented as a scalable set of network proxies deployed alongside application code; a pattern called a sidecar. These proxies handle the communication between the microservices and allow them to control traffic and gain insights throughout the system. Service Mesh offers great features, such as traffic metrics, circuit braking, mTLS, traffic split, retry & timeout, A/B routing, etc.

source: servicemesh.es

We started digging into the details of service meshes and evaluated the features that are important to us, how can we benefit from them, etc. As service meshes impact the latency and resource consumption, these disadvantages have to be measured too. Since we had 1000+ pods, the resource cost would be 1000 x sidecar. Plus, we were racing against time so the latency should have been minimal.

After some research and PoC, we decided to use Linkerd2 between Istio, Consul and Linkerd2. I must say, servicemesh.es helped us a lot to gain knowledge about service mesh and to compare the features between them.

We chose Linkerd2 because of 3 reasons compared to Istio and Consul, asides from the features we are seeking. (L7 LB, mTLS, traffic metrics, etc.):

  • Lightweight (low CPU & memory consumption)
  • Low latency
  • Latency aware LB

Istio had plenty of nice features (thanks to Envoy proxy), but we did not need all of them. Also, its sidecar proxy CPU & memory consumption was high compared to Linkerd2. Consul was using the same sidecar proxy, so we eliminated it as well. Here is a detailed explanation of why Linkerd2 is using its proxy instead of Envoy. Plus, Linkerd2 was so easy to use. Istio's documentation was so overwhelming.

Linkerd rhymes with “Cardi B”. The “d” is pronounced separately, as in “Linker-DEE”. (source)

Solutions

Problem #1: gRPC Load Balance

without mesh / with mesh

As you can see in the graph, some pods were like scapegoats and some were like sloths. After the mesh, everything was fine.

Problem #2: mTLS

Thanks to the mTLS feature of Linkerd2, we secured internal communication between microservices just like a finger snap as Thanos did. Linkerd2 automatically rotates certificates every 24 hours. Also, you can use cert-manager to rotate the issuer certificate and private key.

Problem #3: Traffic Monitoring

source: linkerd.io

Linkerd2 is bundled with Prometheus and Grafana, yet you can bring your instance and configure it via official documentation. We followed the documentation and started to use our existing instances. Now we have great metrics from each meshed pod and we have better observability over the cluster.

Conclusion

Thanks to Linkerd2, we solved our problems and now living happily ever after. The documentation was very clear, getting started page was so easy to follow (+ they had a demo application.) Of course, not everything was bright and shiny. We faced a few issues while meshing pods or after mesh but we solved them as well. Even we opened an issue on GitHub and got help.

So this post is the first part of our service mesh journey which is about “What is Service Mesh and why we chose Linkerd2?” In the second part, we will be talking about the problems we faced and how we solved them.

References:

--

--