The formal description of a service mesh is:
A dedicated infrastructure layer for making service-to-service communication safe, fast, and reliable.
But a service mesh is not an entirely new concept, it is a set of network level features we know and have seen scattered all over many micro services ecosystems:
- Service discovery: Responsible for finding the network location (think of IPs and ports) of a service.
- Load balancing: The ability of distributing requests correctly, and most of the time evenly, across a set of instances of the same service.
- Service security: The identity verification and encryption of service to service communications.
- Observability: The ability to visualize what services “talk” to each other building a kind of service network graph.
- Telemetry and monitoring: The collection of metrics about service requests, things like request success rate, latency, request volume, …
- Reliability: A set of features with the goal of making service requests more reliable, for example automatic retry failed requests.
- Circuit breaking: The ability of preventing service overload due to failed requests retry attempts or maximum request throughput reached.
- Traﬃc shifting: The ability of selectively shifting network traffic to different versions of a service. Useful for canary deployments (shift requests with a certain HTTP header to a service test version) or blue/green deployments (gradually shift traffic to a newly deployed version ensuring that if something is wrong with the newly deployed version only a small percentage of the requests are affected by it).
Many of these features, if not all of them, have existed for years implemented in many applications, either directly in their code or by the usage of libraries.
A service mesh is the extraction of network level features, like the ones mentioned above, to the outside of the application. This is usually accomplished by passing the services network traffic through a proxy.
Linkerd was the first service mesh solution. Their creators claim to have coined the name “service mesh”. The following image describes the chronology of the major events in Linkerd’s history:
In 2013 Twiter was transitioning from a three-tiered to a micro services architecture. They needed many of the features mentioned before. They created an application / service to implement this features so they could have information and some control over their micro services network requests.
Later, in 2016 this solution was made open source and later it became known as Linkerd v1. It was implemented in the Twiter tech stack which used the Java virtual machine and Scala. Its main characteristics are:
- Highly conﬁgurable
- Powerful and complex
- Multi-platform, runs on Kubernetess, ECS, Mesos, …
In 2017, the Linkerd project was donated to the Cloud Native Computing Foundation, a Linux foundation “child” foundation that is also home for the Kubernetes project.
Later, in 2018 Linkerd v2 was released. Using many lessons learned from Linkerd v1 and trying to fix some of its pains, a complete rewrite was done using Rust and Golang with performance and simplicity in mind. Linkerd v2 main design goals are:
- Zero config
- Lightweight and simple
- Kubernetes first
The major reasons for the rewrite in Linkerd v2 were a worry about performance and a difficulty in adoption. Linkerd v1 is a bit heavy on resource consumption, for example v1 proxy runs at the node level and not at the service instance level to save on the performance cost. Linkerd v1 is also complex, with many configuration options and concepts, its setup is complicated and one of the main reason of many application choosing not to use it.
To correct this issues in Linkerd v2 Rust was the picked language for the proxy component in order to maximize its efficiency and performance. For the remainder of its components Golang was used, mainly because Kubernetes is it self implemented in Golang, it has many libraries that integrate well with Kubernetes components and it is easier to get community support since Kubernetes contributors already use Golang.
Since Linkerd v2 was a complete rewrite of v1 it is still playing catch-up with v1 feature wise. Currently Linkerd v2 features (@ v2.4.0) are:
- TCP Proxying and Protocol Detection: It can proxy TCP Traffic (other protocols continue to work but Linkerd will not proxy them) and detect if the traffic is HTTP or gRPC.
- HTTP, HTTP/2, and gRPC Proxying: If any of these protocols are used Linkerd can extract success metrics, handle retries, load balance request, … If none of these protocols are used Linkerd can only collect byte level metrics and secure traffic (mTLS).
- Zero config, Automatic Proxy Injection: Linkerd can automatically inject it self into your application (you just need to provide an annotation to enable this) with no other configuration needed.
- Automatic mTLS by default: By default Linkerd will secure your requests via mutual TLS (Transport Layer Security).
- Load Balancing, Retries and Timeouts: For HTTP and gRPC Linkerd will load balance requests and will be able to automatically retry failed request (this needs to be explicitly enabled) with timeout awareness.
- Observability, Telemetry and Monitoring: Linkerd is able to build a network service graph and collect metrics.
- Traffic Splitting: Linkerd allows canary or blue/green deployments by being able to incrementally directing percentages of traffic between various services.
Linkerd v2 architecture?
Linkerd v2 divides its components into two planes:
- Data plane: This plane is where the data transfer from your application occurs.
- Control plane: This plane is responsible for collecting and storing information about requests and controlling the data / flow.
The following image describes the high level components of Linkerd v2:
The only Linkerd component in the data plane is the proxy. The proxy is injected into your application pods as a container. In reality two containers are injected into your pods:
- Linkerd init container: This container will be ran before your pods containers and it will configure iptables to forward all TCP traffic to the Linkerd proxy container. This is how Linkerd intercepts your application traffic without any code changes.
- Linkerd proxy container: This container runs along side your application containers and it will proxy the other containers TCP traffic, recording its metrics and controlling its network requests.
This proxy is implemented in Rust, it is very lightweight (takes an average of 1 millisecond to process one request) and it uses many of the control plane components to provide all of Linkerd features.
The controller component is implemented in Golang and contains many containers:
- public-api: Provides a public API for the Linkerd CLI (Command Line Interface) and the Web component to interact with the service mesh.
- identity: Is responsible to act as a CA (Certificate Authority) for the proxy component, it will issue certificates to the proxy instances enabling them to use mTLS.
- destination: Provides service discovery and load balancing information to the proxy component.
- tap: Allows real time inspection of requests for debug purposes. Currently it only provides simple information but in the future it should be able to show header and payload information about the requests (when the requests are in a protocol supported by Linkerd).
- proxy-injector: Responsible for injecting the proxy container automatically into pods. The proxy is only injected into pods if the pods namespace or the pods Deployment / Statefulset have an annotation that indicates that Linkerd should join those pods to the mesh.
The Web component is a web application that provides the Linkerd dashboard:
The dashboards provides observability and telemetry on the “meshed” services. It also allows you to tap into the live requests of services to see in real time what requests are being done.
The Prometheus component is responsible to scrape the metrics of the proxy component and store them temporarily. The Prometheus instance is configured for performance. It is not meant to store the metrics in the long term, it stores 6 hours of metrics by default. If you already have a Prometheus in your cluster to store application metrics or if you use another solution, like Data Dog, you should configure them to fetch metrics from the Linkerd’s Prometheus.
The Grafana component is used to display many dashboards about the metrics of your “meshed” services:
These dashboards are reachable via links from the Web component.
One of the most important drawbacks to look for is performance loss. Service meshes usually add a performance tax on every single network request due to the usage of a proxy. This tax will add on every network request, meaning that if a request to your application will generate 4 micro service requests the performance tax is multiplied by 4. In Linkerd v2 this performance tax is minimized due to the proxy being very lightweight and only “stealing” 1 millisecond (average) from your requests. Currently Linkerd v2 is the fastest service mesh.
The first time a new service / endpoint is called there is a small performance tax to due to service discovery but this only happens in the first request, subsequent requests will have service discovery information cached.
Since Linkerd v2 is still playing catch-up with v1 feature wise there are a few common service mesh features missing. Although many of this features are in Linkerd v2 roadmap, it is unknown when they will be released.
This last one is not really a drawback but is something to keep in mind. Linkerd (and other service meshes) are not complete solutions for application wide concerns. For example, Linkerd can make service to service communications secure with the usage of mTLS but this does not mean your application is fully secured, you still need to address other security issues from inside your code. Another example can be the usage of the automatic retry feature, if you enable this it does not mean you can completely stop worrying about retries in your code. There might be some requests where you can not blindly retry, you will need to check state or consistency before attempting to retry a request.
Linkerd vs Istio?
One of the most asked questions about Linkerd is:
What is the difference between Linkerd and Istio?
Istio is another service mesh solution. Currently the most popular ones are Istio and Linkerd.
The big difference is that Linkerd v2 is more focused on performance and simplicity but it sacrifices some features and some configuration. On the other end, Istio is more feature focused, complex and configurable.
The main goals of Linkerd are:
- Be extremely lightweight so it does not influence to much your application’s latency.
- Be simple so you don’t new to learn a bunch of concepts to use a service mesh.
- Zero config to provide good defaults so you don’t need to provide any configuration for the most common use cases.
If you wish to try it out the Linkerd “Getting started” page is a good place for a tutorial.