Service Mesh architectural patterns

Alex Burnos
The Startup
Published in
7 min readJun 25, 2019

--

In the first article, I described what Service Mesh is through the prism of problems it came to solve. This article is effectively a part two of it, reviewing common architectural patterns of how exactly Service Mesh could be implemented to give your distributed deployment service discovery, policy & security management as well as observability.

If the last paragraph did not make much sense — highly recommend reading Service Mesh Explained in Plain English fist.

Service mesh without Service Mesh

The most straightforward approach would be to make service mesh problems — an application concern. By programming necessary logic directly into each service, it could become able to discover services, encrypt traffic, emit metrics, etc.

An application can implement Service Mesh logic within it

You can optimize a bit and make this logic a library, so it is reusable across many different services.

However, now your application has to solve problems that do not relate to its core logic and not transparent for your developers (read: developers need to know and care about infrastructure problems, rather than their application). If you have a heterogeneous environment (many languages, many technologies), you’ll need to reimplement this logic for each application/technology set in your mesh. You can see how this does not scale well.

The in-client approach gives you the most flexibility and if done right, potentially best performance. Can we do better though and at least partially save our developers from Service Mesh problems? They already have too much to worry about.

Gateways, proxies, and thin clients

Centralizing with a gateway

If we do not want Service Mesh related code to live in our application, the first thing that probably comes to mind — let’s centralize it in one location. We can implement a service that will serve as a centralized gateway for the traffic between applications. Each application would need to know how to send traffic to this Gateway, and the rest is taken care from there. We shall name it an “API Gateway.”

API Gateway — most uncomplicated way to solve the problem

This pattern is simple to implement, you centralize all code in one place and relieve developers from addressing infrastructure issues: just send your traffic here, bud!

The simplicity comes with a cost. Does not come as a surprise that we just introduced a single point of failure for all of our services. We need to take of high-availability of the gateway, as well as horizontal scale as demand grows. You can not do end to end encryption (since you do not have control over traffic until it reaches the gateway, you can not encrypt it early enough, so by definition, it is not an “end to end”). It is hard to take reliable measurements since gateway only gets visibility into traffic once and if it reaches it (was there no request because the application did not send it or network failed?).

We can try to improve the situation a bit by deploying smaller gateways in front of each group of microservices.

Sharding gateway deployments can help with scalability and SPoF

This approach breaks down centralized gateway scalability concern into several smaller ones, allowing better scale, slightly better measurement and encryption at least between gateways. However, it still has the same class of issues: encryption is not end-to-end, observability is at Gateway level only, you still have centralized failure domains although smaller ones.

What would be the next step from here? A sidecar proxy.

Sidecar proxies — what are you?

You decided that implementing Service Mesh logic in the application itself does not cut it for you, and performance characteristics of a gateway are not an option either. What’s the solution?

Sidecar proxies are one of the best compromises existing today

One way to achieve it is to deploy a proxy application as a separate binary/process, at the same host (or Kubernetes pod) where your service application is running. We will call such a proxy — sidecar proxy, precisely as operator deploys it as a sidecar to the actual service. All your code will have to do is to send traffic as nothing happens. The proxy will transparently intercept the traffic and magic happen. Sidecar proxy implements all Service Mesh smartness — it will know where and how to send traffic, encrypt it if needed, emit metrics, and much more without your service even guessing these activities have happened. Sweet, isn’t it?

Reliability is pretty good either, since your failure domain is your sidecar proxy, and each instance of service gets one — if one proxy goes down, all you lose is one instance of a service.

Downsides? On top of adding complexity to the services deployment process, you also add an extra processing hop for your requests. And since sidecar proxies are usually deployed for both sending and receiving requests services — you multiply this latency by two additional processing steps.

Even deeper integration

There are two honorable mentions of architectural patterns that I should list for completeness:

Integration through the networking layer. If you only could embed Service Mesh logic into the networking layer of your host directly (the part of the system that actually “natively” handles your traffic) — you could get everything: transparency to the application, performance of the native networking and no additional failure domains. It turns out, you can if you program it through eBPF and projects like Cilium take this approach. Downsides are limits of how much functionality you can implement this way since your code must be small and very efficient.

Integrated thin clients. If you are still considering implementing Service Mesh functionality in your application, but want to reduce the effort and computational overhead — thin clients could be an answer. Practically, this means that you only integrate a small portion of the Service Mesh code (a client library) and all heavyweight computations are still outbound of this application in another layer called “control plane.” Look at gRPC-LB as one of the examples.

Selecting an architecture

One thing you have to pay attention to when choosing Service Mesh architecture is how soon it has visibility into the traffic after service sent a request. Naturally, the soon we take control over the traffic — the more precise and reliable mechanisms to manipulate it we can offer.

For example, in-client or sidecar proxy architecture manipulates requests as soon as application produces them. Benefits? You get things like high fidelity end-to-end observability and encryption. For encryption, end-to-end means that you can encrypt traffic when it was originated by your service and be sure that it does not get decrypted until after it reached its destination. For observability it means you can now see things like real latency of the request, i.e., time passed since it was originated at the source and until it reached its destination or detect errors with requests exactly when they happened.

Some architectures provide real end-to-end encryption and observability

Compare it with another design when you can only control requests between services at some interim point.

On top of adding extra network hop latency, since you don’t control traffic when it originated by service A, at best you can ask service A to always establish a connection over TLS to the gateway, and then encrypt it on its way out of the gateway, but trust is compromised already at this point.

End-to-end encryption breaks with middle proxies

It is even worse for observability since service mesh has no idea about what happened with the request when it originated at service A. If there are errors — they are invisible. High latency at host itself? You won’t see it either. Only when and if traffic reaches our middle proxy — we can take measurements, that only partially reflect the real state of the world.

Does it mean that sidecar proxy model is good, and the API gateway is terrible? Of course not, it just depends on what performance and operational parameters you are looking to optimize for.

Looking for lowest latency possible? Consider in-application clients. End to end encryption and best possible observability needed, while you can not touch the code? Sidecar proxy is what you are after. Neither of these is as important as ease of deployment? API Gateway would give most of the benefits while being zero-burden on deployment process of your applications.

In conclusion

Pursuing the desire to offload the burden of service communication control complexity from application developers to the infrastructure — Service Mesh creators came up with multiple architectures that pursue similar objectives, but achieve them in different ways. When selecting (or designing) your Service Mesh solution — it is essential to understand benefits and tradeoffs that come with each of them: from laborious development of low-latency service mesh library to merely plugging a gateway in between services.

At the end of the day, the business will dictate the performance characteristics of your applications, which in turn will suggest what kind of Service Mesh works best for you. Choose wisely.

--

--