Adopting Istio for a multi-tenant kubernetes cluster in Production

Vishal Banthia
Aug 28 · 11 min read

This is the 3rd blog post for mercari’s bold challenge month

At mercari, while migrating our monolithic backend to microservices architecture, we felt the need to have a service mesh and understood its importance in the long run. Most of the incident post-mortem reports have actionable items such as — implement rate-limit, implement a better canary release flow, better network policies… and this is what service mesh promises.

Last quarter, we finally decided to take this challenge and started investing our time in Istio. Since then, we have been able to introduce Istio in the production environment which is a multi-tenant single Kubernetes cluster occupying more than 100+ microservices without any major incident. In this blog post, I will explain our Istio adoption strategy and overall journey so far. This blog post assumes that readers have a fair knowledge about what a service mesh and Istio is. Istio’s official doc does a great job in explaining this.

Motivation

In 2017, due to steep growth in business and the number of engineers, we realized our monolithic backend is not scalable and decided to migrate to microservices architecture. With microservices architecture, we started experiencing a new era of network-related problems: load balancing, traffic control, observability, security…For observability, we succeded in building a centralized eco-system for tracing, logging and metrics using DataDog but traffic control and security are still primitive stages. One of the prime reason to go for microservices architecture was to decouple different application components but due to lack of modern traffic control policies between services, we are still facing cascading failures. Introducing circuit-breakers and rate-limits can potentially solve these cascading failures.

Another network-related big issue for us is gRPC load balancing. We use gRPC for inter microservice communication. If you have used gRPC in Kubernetes, you must have faced this issue too. In short, gRPC is based on HTTP2 and uses multiplexing to send RPC calls. Kubernetes service works at the L4 layer and cannot load balance HTTP2. A common workaround for this issue is to use client-side load balancing which needs applications to add load balancing logic in them which makes applications coupled with the underlying infrastructure. This is something which we really do not want to do as our beliefs are the same as envoyproxy vision “The network should be transparent to applications”. Our microservices eco-system is already polyglot and maintaining infrastructure specific libraries for each language and keeping services updated with them is not something which developers like to do.

A service mesh helps in solving these issues by running a proxy sidecar along with each application without changing the application code.

Why Istio?

Istio service mesh promises to provide a single solution for all network-related problems for modern microservices architecture: connect, secure, control and observe. It tries to do it transparently without modifying any application code using envoyproxy as a sidecar. It is open-sourced and backed by Google.

Although it has been a year since Istio has been announced as production-ready, it is still hard to see many production success stories. We believe this is because of all the initial challenges require to setup Istio reliably. But once done correctly, onboarding developers to use it will be a low cost because of its CRDs which provides Kubernetes native user experience which our developers are already used to. This led us to start investing our time in Istio.

Mercari microservices architecture and cluster ownership model

Before going into Istio, let’s have a look at our microservices architecture and Kubernetes cluster ownership model.

Mercari’s Microservices Architecture

We are on GCP and use their managed Kubernetes service (GKE) to run all our stateless workloads. All client requests come through API gateway which route requests to microservices based on the request’s path. If the endpoint is not yet migrated to a microservice, gateway just proxies the request to the monolithic backend running in our datacenter. This is how we are slowly migrating pieces from monolithic to microservices. Depending on the endpoint, services talk to gateway either using HTTP or gRPC. All inter-service communication is through gRPC.

In order to minimize cluster operation cost, we use multi-tenancy model. A multi-tenant cluster is shared by multiple users or teams. In our model, each service has its own namespace and service team has full access to its namespace. We also have a cluster operator team called Microservices Platform Team which manages the whole cluster and system namespaces: kube-system, istio-system and this is where I belong. This model ensures that responsibilities are well defined based on the focus area. Microservice backend team whose focus is on service manages microservice namespace whereas the platform team whose focus is on overall infrastructure manages the cluster. Introducing cluster-wide functionality and its reliability is platform team responsibility.

Please check these slide to know more about our architecture in detail.

Multi-Tenant Cluster Ownership Model

Some numbers

  • API Gateway receives 4M RPM during peak time
  • 100+ microservices (100+ namespaces)
  • 200+ developers have direct access to some namespace
  • Mercari, Merpay, Internal Services — all run in the same cluster

As can be seen from these numbers, this cluster is pretty important to us. It has the highest SLO. We need to be very cautious while introducing something which can affect the whole cluster. Ensuring Istio’s reliability is the foremost task. Also, being in a multi-tenant cluster, it is totally possible one team can make some configuration mistakes. Our Istio setup should be “well-guarded” from these mistakes and misconfiguration in one service or namespace should not affect others. And believe me, things can get really really go bad in Istio! More on this later.

Istio Adoption Strategy

In simple, our strategy was:

Do one thing at a time!

Meaning, introduce one feature at a time. Introduce Istio in one namespace at a time.

Istio Adoption Strategy

Istio’s feature selection

Istio’s mission is to become a single solution for all networking problems in the microservices world: traffic control, observability, security. It comes with lots and lots of features. Last time when I counted the number of Istio related CRDs, it was around 53. And each feature has some unknown unknowns. Eg, all of a sudden, some services egress requests stopped working because of some outbound port-name conflicts. It’s good to first narrow down these unknown unknowns scope by limiting features. Fortunately, Istio’s components are designed in such a way that only necessary components need to be installed based on feature requirements. Installing all components is not required.

Out of these three broad feature categories: traffic control, observability, and security, we decided to go for traffic control first. Even in traffic control category, there is a long list of features: load balancing, rate-limit, retries, canary release, circuit breakers, fault injection, fault tolerance… In reality, we need most of these features asap, but for our initial Istio release, we narrowed down feature requirement to just load balancing, gRPC load balancing to be precise as reason is already explained in the motivation section.

With this, we decided our first Istio release goal:

Enable istio-sidecar proxy (envoy) to application Pods and extend service mesh gradually in all namespaces

Istio Feasibility Investigation

After deciding our release goal, we started testing Istio feasibility in sandbox cluster to make sure that if we can actually do what we have strategized. Our investigation approach and feasibility test requirements were:

  1. Make sure, just installing Istio should not create any side effect in the cluster
  2. Make sure, Istio can be introduced gradually. One microservices (namespace) at a time. It should not be backward incompatible and downstream service should work without any modification in them. Downstream may or may not have sidecar enabled. Since microservice namespace is managed by developer teams, we cannot ask them all to introduce Istio all at once
  3. Make sure, Istio works well for all the type of communications we have inside cluster: gRPC, HTTP, HTTPS
  4. Make sure, there is no noticeable performance degradation in latency or new errors
  5. Make sure, there is no big impact on the cluster if any Istio’s control plane component is down.

Our investigation journey was not so smooth. We met with many challenges which were kind of blocker for us. These issues might not be a big hurdle if cluster SLO is low but for us, we had to figure out workarounds and strategies to satisfy feasibility test requirements.

Challenges

Below are a few initial challenges which we had to face and I thought it worth mentioning. Explaining them in detail will take dedicated posts for each, I will try to explain them in brief.

1. Managing istio-proxy lifecycle

network traffic flow in a Pod with istio sidecar enabled

When Istio’s sidecar is enabled in a Pod, all inbound and outbound traffic passes through the sidecar container. It is very important to make sure:

  • sidecar should start and is healthy before the application container starts, else any outbound requests such as a database connection will fail.
  • application container should terminate before sidecar starts terminating, else outbound requests during that period will fail

Unfortunately, there is no easy way to control this lifecycle in Kubernetes as a sidecar is not a first-class citizen. Behaviour is random and any container can start or terminate first. There is an accepted proposal to solve this issue but work is in progress and it will take a few releases to see this feature in Kubernetes. To solve this issue we use the following workaround.

Make sure the application container starts after the sidecar

Kind: Deployment
spec:
template:
spec:
containers:
- command: ["/bin/sh", "-c", "while true; do STATUS=$(curl -s -o /dev/null -w '%{http_code}' http://localhost:15020/healthz/ready); if [ "$STATUS" -eq 200 ]; then exec /app; break; else sleep 1; fi; done;"]

This makes sure the application’s container process only starts after envoy sidecar is healthy and ready to take traffic.

Make sure the envoy sidecar starts terminating after application container has

containers:
- name: istio-proxy
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "while [ $(netstat -plunt | grep tcp | grep -v envoy | wc -l | xargs) -ne 0 ]; do sleep 1; done"

This needs to be added in Istios’s sidecar-injector-configmap.yaml. It makes sure envoy sidecar will wait till all connections with the application container are terminated. This workaround is taken from this issue.

2. Zero downtime rolling updates

If you are following Istio closely, sporadic 503 during rolling updates is a very common complaint from Istio’s users. It is not Istio’s fault, it is in Kubernetes design itself. Kubernetes is not consistent, it is “eventually consistent”. Istio just adds a little bit more complexity making inconsistency duration a bit longer causing more 503 errors than usual. The popular answer to this issue is to retry these requests but if downstream services have not enabled Istio or you are not sure about service idempotency, retry is not feasible.

Kubernetes provides container lifecycle hooks such as preStop which can be used by developers to reduce inconsistency side-effects. We also configure preStop in application Pod based on protocol. Explaining this in detail will take a separate post in itself.

for services with gRPC

apiVersion: apps/v1
Kind: Deployment
spec:
template:
spec:
terminationGracePeriodSeconds: 60
containers:
- lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 30"]

In both scenarios, downstream has Istio enabled or not, client-side load balancing is used. In client-side load balancing, the client maintains the connection pool and refresh them when upstream service endpoint gets updated. This sleep ensures that upstream service is still listening for new connection till endpoints get updated.

for services HTTP

apiVersion: apps/v1
Kind: Deployment
spec:
template:
spec:
terminationGracePeriodSeconds: 75
containers:
- lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "wget -qO- --post-data '' localhost:15000/healthcheck/fail; sleep 30"]

If downstream service has Istio enabled, then the scenario will be the same as gRPC one and sleep 30 is sufficient as Istio sidecar will refresh connections when endpoints get updated. When downstream does not have Istio enabled and client-side load balancing is not used which is in case of HTTP, upstream needs to close the connection gracefully else the client will never know the connection has been closed and keep sending requests. In Istio, sidecar creates connections with clients, not the application container, so connections need to be closed from the sidecar itself. By calling envoy’s healthcheck/fail endpoint, we can forcefully drain all connections from upstream during rolling updates.

3. Istio’s Kubernetes Service port-name convention

Kubernetes Services works at the L4 layer and it does not know the L7 layer protocol. Istio needs to know the higher-level application protocol before head so that it can configure sidecar. In order to do that, it uses a convention to add a prefix in Kubernetes Service port name `<protocol>[-<suffix>]>`. If some service does not follow this convention, this can create conflicts and can affect many other services in a mesh. The situation is worse when you have headless services.

In a multi-tenant cluster, we cannot trust each service will follow the correct convention. In order to solve this, we use stein, a YAML validator with custom policies in our centralized manifest repo. Custom Kubernetes validation admission webhook is work in progress.

Investigation Results

1. Make sure, just installing Istio should not create any side effect in the cluster

We did not see any negative impact on the cluster health by just installing Istio. Istio control plane runs in its namespace and it is just like any other service.

2. Make sure, Istio can be introduced gradually.

This was very tricky and this is where we had to come up with all workarounds related to sidecar lifecyle management, rolling update errors and port-name convention. Using these workarounds, we were able to make sure that this is doable.

3. Make sure, Istio works well for all the type of protocols we have

Yes, it works well. Although, we had to create protocol specific workarounds for rolling update errors.

4. Make sure, there is no performance degradation

We measured latency before and after introducing Istio and we did not see much latency increase. It was in the range of 5–10msec for p99. One of the reasons for this low latency is because we are not using any mixer functionality yet.

5. Make sure, we have a backup plan when Istio components are down

We did resiliency testing for Istio components which we are using currently (pilot, sidecar-injector). Our result shows there will not be a big impact if any of these components are down temporarily. Sidecar will work fine in case the pilot is down, although it will not be able to update the latest configuration. We prepared all the necessary monitoring and playbooks to deal with these scenarios.

Conclusion

Adopting Istio in a multi-tenant cluster needs the right strategy. If cluster SLO is high then it will be more tedious. It took us a team of 2 members a whole quarter to reliably introduce Istio in our production cluster, but we are happy with the end results. Our initial goal of using Istio for gRPC load balancing has been met and we already started investing in other traffic control features and will roll out them in production gradually.

In the future, we would also like our control plane to be fully managed same as GKE and we are looking forward to Traffic Director once it starts supporting Istio CRDs. This was one of the main reason to choose Istio in the first place.

Lastly, if you are interested in working for the Microservices Platform Team at mercari, please have a look at our hiring page. Lots of interesting projects have been going on.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade