Service Mesh Journey At Trendyol — Part 1

Gokhan Karadas
Trendyol Tech
Published in
6 min readApr 28, 2020

We are a Tech company. Technology is the driver, e-commerce is the outcome. Trendyol Tech team enables us to grow faster by using technology. We have cross-functional autonomous teams that are organized around a line of business, learn continuously, and have various technical competencies. We are all working towards the same goal: to create an excellent experience for our customers at Trendyol.

Before talking about everything about service mesh in deeply I want to explain why we need a service mesh.

At Trendyol we run hundreds of microservices that do everything from the best customer experience. Running microservices at scale is not without its own challenges. When we compared to last year our microservice size is repeatedly increasing so it becomes increasingly difficult to understand the interactions between all these services. When a problem occurs in a microservice world, it can be really difficult to find where the problem is. Service mesh addresses these challenges where the implementation of these cross-cutting capabilities is configured as code. Like traffic management of microservice, monitoring, security, deployment strategy…

Maybe we can solve this problem to integrate client-side solutions. In many ways, libraries like Finagle, Stubby, and Hystrix were the first service meshes. While they were specific to the details of their surrounding environment and required the use of specific languages and frameworks.

Early in 2019, we started looking at ways to handle this problem and concluded that a service mesh. There are a lot of service mesh tools in the software world such as Istio, Linkerd 2, AWS App Mesh, Consul Connect, Kuma. Istio is the best option to use in production which has more reliable features and more community.

What is Service Mesh?

Service mesh manages all service-to-service communication within a microservice-based system. It accomplishes this typically via the use of “sidecar” proxies that are deployed alongside each service. It is useful for traffic management and monitorings some point here every useful thing has some tradeoff like complexity and runtime resource usage.

  1. Service 1 sends a request to service 2.
  2. On exiting Service 1, the request is redirected in its sidecar.
  3. Sidecar sends a request to service 2 mesh.
servicemesh.es

How did we decide Istio is the best option for us?

Before starting the service mesh implementation we analyze our requirement. In our use case, we strongly need Timeout Management, Circuit Breaking, Grpc Load Balancing, Rate Limiting, Performance, and Monitoring.

We made some of the benchmark tests to show latency between our microservices communication. Istio and other service mesh products are nearly the same. I think resource consumption is higher than in others.

Poc Team

We have more than thirty teams to solve some e-commerce problems in the Trendyol. Firstly we discovered which team is easy to integrate into an Istio mesh. We decided to start with Browsing team basically which is responsible give some personalization, and product detail data to other gateways.

Istio Mesh

Istio is a service mesh that is built around an Envoy proxy to manage and control the flow of traffic, secure services and see what’s happening between them. In addition, Istio works well with other common infrastructure and monitoring components such as Jaeger, Grafana, Kiali and Prometheus.

  • Load balancing for HTTP, gRPC.
  • Rich routing rules, retries, failovers, and fault injection.
  • Monitoring mesh services with more details.
  • Service to service authentication and authorization.

We spent several months load testing and configuring Istio on our staging environment. First off all, we made some service to the service load test benchmark. We figure out something in our baseline test (without envoy proxy sidecar) more than handle rpm to envoy proxy.

First 75k rpm Baseline test. The second one is with the envoy proxy. We have changed some Istio configuration to find the best rpm for us. In the next parts, I will talk about deeply.

Browsing Team Mesh Graph
Browsing-Team Mesh Graph

Which Istio feature primary for you?

Istio has great features, connectivity, observability, security, and traffic control. It’s very hard to implement all features at first. We have decided to improve our service resilience and monitoring. For service resilience, we enabled retry policy, timeouts, and circuit breaking features. Some team has different requirements like an end to end authentication, jwt policy enforcement.

The Hidden Cost Data Plane

Be aware and scale your cluster for the extra envoy proxy CPU and memory consumption. Why? Because it’s adding another sidecar (Envoy proxy), running on all the pods in your clusters. So the question is — how much it is going to cost? This really depends on how much resources Istio consumes. How much traffic received from your envoy proxy?

global:
proxy:
resources:
requests:
cpu: 10m
memory: 128Mi
limit:
cpu: 2000m
memory: 2024Mi

Change the proxy resource limit, if you need more than 1 CPU.

data:
mesh: >-
# Set enableTracing to false to disable request tracing.
enableTracing: false
# which makes pilot generates tracing sampling config for envoy

If you want to use globally distributed tracing you should disable it. Envoy doesn't send tracing information directly to tracing backends.

Check Control Plane

What happens if your control plane goes down? The Control plane pilot is responsible for the traffic management feature of Istio, and it also is responsible for updating all sidecars with the very latest mesh configuration. Pilot automatically detects a change in the mesh (it monitors Kubernetes resources in etcd), it pushes the new configuration to sidecars via this gRPC connection. All these changes live in sidecar proxy memory. When Pilot goes down traffic is not affected because the last configuration saved by envoy proxy.

Nginx 426 Upgrade Problem

When ingress gateway enabled our Nginx is getting upgrade required error. We have enabled Istio ingress gateway 1.0 support.

> GET / HTTP/1.0
> Host: kiali.istio-system.svc.cluster.local:20001
> Accept: */*
>
< HTTP/1.1 426 Upgrade Required
< server: envoy
< content-length: 0

Support ingress gateway 1.0 Http traffic.

gateways:
istio-ingressgateway:
env:
ISTIO_META_HTTP10: '"1"'

Conclusion

Service mesh is a very critical component of the large system. It provides us security, traffic management, and observability to be taken out of developers’ hands. On the other hand, every benefit comes with some learning curve, cost, and management complexities. In the next part, we will talk more detail about Istio.

References:

--

--