As part of my learning in devops space, I started exploring service mesh and recently did a podcast. Here is the blog version which I think will be useful for others to get the big picture
This post assumes you are aware of how kubernetes works at a high level. If not, you could look this post before
Why Service Mesh?
So, our kubernetes world with pods, deployments, services and ingress controllers are buzzing with activity and we are happy with the newly gained power to deploy our services easily at will and scale them as we see fit
Then, why should we care about another concept called “Service Mesh” and what does that bring to the table? To answer it, Let us think of a scenario where our cluster has grown to have lot more services and they call each other to fulfill a service request (east west traffic) and their inter-dependencies will put spaghetti to shame.
All the kubernetes constructs like Pod, Deployment, Service, Secrets help us to create and deploy our payload (applications) but does very little to observe, manage or control the flow of traffic between them. This is exactly where a service mesh helps
Before we get into the details of how a service mesh does it, Let us quickly go through some of the important concepts in kubernetes that it relies on
As the name suggests, A proxy server that runs as a sidecar to the main container intercepting the network traffic and sometimes modifying it (case of mTLS). This decouples the application container (developers concern) from the other responsibilities like securing the traffic, notifying monitoring servers about the traffic (operations concern). The application container can be independently deployed without worrying about proxy
Custom Resource Definition (CRD)
Kubernetes provides an ability to define your own resources (similar to Service, Pod, Deployment, ReplicaSet, etc) and you can also use the kubectl commands to interact with these resources like any kubernetes resource. Almost all the service mesh implementations create their own custom resource definitions to manage and control the components they create.
Operators in kubernetes are more like automation bots. Repeated things that are usually handled by operations team can be converted to operators. In that aspect, even a deployment resource is an inbuilt operator that knows how to scale up / down the resources based on its config.
Service mesh implementations use different operators to automate the actions to be taken based on the events that happen in the cluster
What can a service mesh do?
Let us see some of the scenarios that ops team faces and how they can handle it in their kubernetes cluster
Q: Which of my micro service is under lot of load? why?
A: No idea. Default Kubernetes setup does not help to answer this. We could deploy a sidecar proxy to our pods so that it can send service level metrics to some monitoring server like Prometheus. Or, deploy a service mesh which does this automatically
Q: How to keep my customers using at least some parts of the system when a dependent service becomes sluggish?
A: Configure circuit breakers in the sidecar proxies so that the calling services can respond quickly when its dependent services become sluggish / fail. Or, install and configure the service mesh to do this declaratively
Q: My services east to west traffic is across regions / datacenters, How to secure the communication?
A: Switch to mTLS with a local CA so that the services know they are talking to the authentic services. Or, you guessed it right. Install service mesh
Q: How can I get a picture of the traffic flow between the services I manage?
A: Solutions like Weave Scope and KubeView give only the resource dependencies graph but not the live traffic status. Configure sidecar to emit these. Or, same here… service mesh
A Service mesh is an abstraction of such solution so that it can be applied to any cluster easily. Service mesh does to managing application traffic as what Kubernetes is to creating and deploying our applications.
How does it do it?
Service mesh injects a sidecar proxy into pods so that all the network traffic flows through to it. This component which is independent from the actual application can be used in lots of interesting ways as described below
Setting up Mutual TLS authentication manually is not a simple task as it involves configuring the certificate authority and handling certificate signing requests sent by each participating service in your cluster. But, with a sidecar proxy, this concern can be handled in one place. And the application will continue to send / receive traffic in http as the sidecar proxy wraps or unwraps the TLS layer
Sidecar proxy can also send information about the destination services to a monitoring server like Prometheus to understand the traffic flow patterns. Since this is also done at the sidecar level, the actual application can stay blissfully ignorant of all this.
As you might have notice, this monitoring traffic could overwhelm the network as we could be sending lot of extra network packets. So, many of the service mesh implementations gives us control to enable and monitor them on demand
We can also route traffic to different versions of a service based on rules configured at the sidecar proxy level. In the above illustration, all the requests to “/svc-b/user” will hit the V1 of service B whereas all the requests to same service but to a different endpoint “/svc-b/order” will hit the V2 of the service B that is already deployed. This can be used for A/B testing or canary deployments at service level
If you have come this far and think a service mesh will benefit your team, You can explore service mesh implementations like Istio / Linkerd / Consul. Each have their own feature sets and manages the service mesh bit differently (which would be an interesting post on its own). It would be worth spending some time on the upcoming Service Mesh Interface which provides standardization across different implementations