Services meshes are a reaction to some of the problems that have arisen in the wake of the microservices revolution. The decomposition of large, monolithic applications into fine-grained, independently-deployed services has created its own set of challenges. This article will explore the basics of service mesh use-cases and architecture so that you can decide if a service mesh is the right solution to address these challenges within your own organization.
Microservices architecture can decrease the complexity of an organization’s software development efforts by breaking software into smaller, more manageable units. However, this decreased intra-service complexity comes at the cost of the increased complexity of inter-service communication.
As the number of microservices within an organization grows from the single-digits to dozens to hundreds or thousands, inter-service complexities can become daunting. Consider some of the problems that come with managing a larger number of microservices:
- Low visibility into service health.
- High inter-service latency.
- Difficulty understanding dependencies between microservices.
- Brittle, ad-hoc authentication mechanisms.
There are multiple ways of addressing these issues.
Development teams can build reliability and security mechanisms into their microservices, but this approach is both labor-intensive and non-repeatable across services, which may be running on entirely different languages/tech stacks.
Operations teams can use their orchestration tool of choice (Kubernetes, Docker Swarm, etc.) to improve reliability and observability up to a point. However, once an organization’s network of microservices reaches a certain size this approach is not sustainable.
This is where a service mesh comes in.
While all service mesh products are unique in some ways, they generally share a number of core features:
- Service instance proxies, or “sidecars”
- Advanced Layer 7 load balancing and traffic analysis.
- Support for the “Circuit Breaker Pattern”
- Service discovery.
- Staged rollouts.
- Inter-service logging.
- Encryption and authentication.
- Much, much more…
Planes and Sidecars
Service mesh products generally share some core architectural features that allow them to accomplish their goal of improving resiliency, observability, and discoverability across an organization’s microservices.
A service mesh is divided into two logical planes; the Control Plane and the Data Plane.
The Data Plane consists of sidecars that run in-band alongside containers. The sidecars proxy all traffic to- and from application containers, allowing the service mesh to control traffic without the explicit knowledge of the microservice’s application logic.
The Control Plane is the out-of-band management layer of the service mesh. It receives telemetry from sidecar containers, enforces policies, and provides other network-wide configuration capabilities. The Control Plane provides a set of tools for centrally controlling the behavior of the data plane, often including a CLI, GUI, and API.
Service Meshes in Action
Service meshes offer a dizzying array of capabilities. To help make things more concrete, let’s take a look at two real-world scenarios where a service mesh could come in handy.
These examples may seem somewhat disconnected at first glance, but what they have in common is that they each leverage the service mesh’s 360-degree view of the microservices network and its role as an intermediary between each individual microservice:
Imagine that you work for a large, global e-commerce company with storefronts visited by millions of users each day. Every second of downtime has a direct revenue impact. How can you introduce potentially complex updates to your applications without creating expensive service outages?
Blue-green deployments help mitigate this type of risk. There are many variations of blue-green deployments, but they all follow the same basic pattern. Let’s take a look at our example e-commerce site as an example:
- Website Blue is running with live traffic.
- An updated version of the website (Website Green) is deployed and tested while traffic is still going to Website Blue.
- A canary deployment is performed; a small amount of live traffic is diverted to Website Green (while you observe that everything is functioning correctly.)
- The amount of traffic to Website Green is gradually increased as the traffic going to Website Blue is decreased. This is continued until ALL traffic goes to Website Green.
- Website Blue is taken down.
By gradually increasing traffic (instead of updating everything at once) you give your operations team a chance to roll back changes before there are system-wide consequences. This is especially useful in cases where the service being updated has complex interdependencies with other services.
A service mesh is an especially well-suited technology to perform blue-green deployments because it has control over all inter-service traffic (in the Data Plane) and a centralized place to manage deployments and observe global system health (the Control Plane.)
Suppose that your e-commerce company relies on a low-availability Inventory Service to fulfill orders. A number of storefront services call the Inventory Service after a purchase has been made. As your organization has grown, more and more storefront services depend on the Inventory Service. These microservices are built on separate technology stacks by distributed teams.
Developers writing the various storefront microservices have tried to mitigate the unreliability of the Inventory Service by building retry logic into their applications, placing greater demand on the Inventory Microservice (which only exacerbates the problem.)
A service mesh could help mitigate this problem by enforcing retry budgets (a maximum number of requests that can be made in a given time) which allow a certain number of re-try requests to be made but not so many as to cause a downward spiral of service unreliability.
The Market for Meshes
The market for service mesh technology has consolidated around a few popular tools. While these tools provide a common set of functionalities, you should carefully examine how each fits the needs of your organization specifically before making an implementation decision.
Istio is a popular service mesh technology that is enjoying explosive growth. Istio was launched in 2017 with backing from technology powerhouses Google, IBM, and Lyft. Istio utilizes Envoy as its default high-performance sidecar. It works with Kubernetes by default but can be extended to use other orchestration tools.
Linkerd (pronounced “linker-dee”)
Linkerd was the original service mesh technology. It is a JVM-based tool (written in Scala) that evolved from Twitter’s Finagle project. Linkerd is maintained by the Cloud Native Computing Foundation and licensed under Apache v2.
Do I Need a Service Mesh?
If your organization is just building its first few microservices, you probably don’t need a service mesh (yet.) Even in organizations with a relatively large number of services, the rich capabilities of your container orchestration tool (Kubernetes, Docker Swarm, etc.) may be enough to accomplish your goals.
Implementing a service mesh is a major investment. It requires your team(s) to learn a new technology and introduces another source of complexity into your architecture. Before investing in a service mesh, make sure that your architecture has reached a level of complexity that makes the investment worthwhile.
If you decide that a service mesh is the right solution for you, consider rolling it out to a subset of microservices before doing an organization-wide deployment. Consider your team’s current tech stack and tech skills before adopting a particular service mesh product.
When implemented correctly, a service mesh is an invaluable part of an organization’s tech strategy. As with any other new technology, educating yourself about whether it fits your particular needs is the best approach.