Service Mesh Journey At Trendyol — Part 2

Gokhan Karadas
Trendyol Tech
Published in
7 min readApr 28, 2020

In Part1 of this series, I talked about how we decided to use service mesh in Trendyol and some basic concept about Istio. In this part, we will talk about a more detailed Istio configuration and tuning for the production.

The Istio Service Mesh Architecture

  • Istio service mesh is an intentionally designed abstraction that has both a control plane and a data plane.
  • Istio is a service mesh created by the combined efforts of IBM, Google, and Lyft. The sidecar patterns are enabled by the Envoy proxy and are based on containers.
https://istio.io/docs/ops/deployment/architecture/
  • A control plane that controls the overall network infrastructure and strengthens the policy and traffic rules.
  • A data plane that uses sidecars through the Envoy makeshift which is an open source edge proxy.

Getting started

To install Istio we use Istio Operator. There is a lot of configuration profile which are default, demo, minimal, and remote... We use minimal configuration profile and we changed some of the behavior like envoy proxy CPU and memory limit. As well as the minimal configuration does not have an ingress gateway. We enabled it to get outside traffic in the cluster. We don’t need mTLS for the internal service now. Istio should be configured with mTLS as “PERMISSIVE” which allows non-TLS requests to enter the mesh.

Sidecar Injection Strategy

In order to use Istio mesh, we need to enable envoy sidecar injection.

Istio works by injecting a sidecar proxy based on Envoy alongside your application which intercepts all inbound and outbound calls going to and from the pod. There are two ways of injecting the Istio sidecar into a pod which are manual sidecar injection and automation sidecar injection. We are using manual sidecar injection in the production environment.

You can disable auto injection in namespace and no automatic injection occurs here :

kubectl label namespace trendyol-team istio-injection=disabled
kubectl get namespace -L istio-injection

We were able to roll out Istio on an application by application basis by adding the following annotation to pods:

sidecar.istio.io/inject: "true"

Injection occurs at pod creation time. Kill the running pod and verify a new pod is created with the injected sidecar.

One pod two container example

The istio-init container sets up the pod network traffic redirection to/from the Istio sidecar proxy. Init containers execute before the sidecar proxy starts.

Application Startup Fail With Sidecar

In stage environment after enabling an application for Istio we realize that some pods don’t connect to outbound service. We investigate that this was occurring because the application container was starting before the Istio-proxy container was ready which then caused outbound requests to fail.

We solve this problem like disabling redirecting traffic to the specific outbound ports the init containers. We couldn’t find the best solution.

"traffic.sidecar.istio.io/excludeOutboundPorts": "8091,11210"

Ingress Gateway

In order to handle outside traffic to the cluster, we need to use Istio ingress gateway. If you remember we enabled with our configuration. We use a gateway to manage inbound traffic. We deployed ingress gateway as a daemon set in the large Kubernetes cluster you need to separate it for each namespace deployments.

kind: Gateway
apiVersion: networking.istio.io/v1beta1
metadata:
name: browsing-gateway
namespace: default
spec:
servers:
- hosts:
- 'www.trendyol.com'
port:
name: http
number: 80
protocol: HTTP
selector:
istio: ingressgateway

In this configuration, we accept HTTP traffic from incoming www.trendyol.com you can create multiple gateway definitions to manage your traffic. Maybe you want to block some IP blocks.

Handle Too High Envoy Resource Overhead

One of the core problems to be solved when Envoy was developed was the observability of services. Therefore, Envoy has embedded a large number of statistics from the very beginning to better observe services.

Envoy offers fine-grained statistics, even at the IP level of the entire cluster. each IP address carries different metadata under different services, and therefore the same IP addresses are independent under different services. As a result, Envoy consumes a huge amount of memory. To address this issue, we have added the stats toggle to Envoy to disable or enable IP-level statistics. Disabling IP-level statistics directly reduces memory usage.

source_workload: source.workload.name | "unknown"
source_workload_namespace: source.workload.namespace | "unknown"
source_principal: source.principal | "unknown"
source_app: source.labels["app"] | "unknown"
source_version: source.labels["version"] | "unknown"
destination_workload: destination.workload.name | "unknown"
destination_workload_namespace: destination.workload.namespace | "unknown"
destination_principal: destination.principal | "unknown"
destination_app: destination.labels["app"] | "unknown"
destination_version: destination.labels["version"] | "unknown"
destination_service: destination.service.host | conditional((destination.service.name | "unknown") == "unknown", "unknown", request.host)
destination_service_name: destination.service.name | "unknown"
destination_service_namespace: destination.service.namespace | "unknown"
request_protocol: api.protocol | context.protocol | "unknown"
response_code: response.code | 200
grpc_response_status: response.grpc_status | ""
response_flags: context.proxy_error_code | "-"
connection_security_policy: conditional((context.reporter.kind | "inbound") == "outbound", "unknown", conditional(connection.mtls | false, "mutual_tls", "none"))

Istio Pilot Tuning

Pilot provides service discovery for the Envoy sidecars, traffic management capabilities for intelligent routing (e.g., A/B tests, canary deployments, etc.), and resiliency (timeouts, retries, circuit breakers, etc.). It converts high-level routing rules that control traffic behavior into Envoy-specific configurations and propagates them to the sidecars at runtime. Pilot abstracts platform-specific service discovery mechanisms and synthesizes them into a standard format consumable by any sidecar that conforms to the Envoy data plane APIs. When every new sidecar that’s means every scale your pod it adds more load to Istio control plane.

Pilot connects to every istio-proxy and every istio-proxy checks and reports metrics to telemetry v2. Before every namespace we migrated to Istio we monitored and tuned the control plane.

Before tuning our control plane, when Istio-pilot sends it to push to sidecars, our service decreased at this moment. Our pilot CPU usage is almost 2 cores.

We have changed pilot push concurrent value. This is default 100. Limits the number of concurrent pushes allowed. We have added also some of the delays for Isto- pilot push.

PILOT_PUSH_THROTTLE=1:Limits the number of concurrent pushes allowed. On larger machines this can be increased for faster pushesPILOT_DEBOUNCE_AFTER=10s:The delay added to config/registry events for debouncing. This will delay the push by at least this internal. If no change is detected within this period, the push will happen, otherwise we'll keep delaying until things settle, up to a max of PILOT_DEBOUNCE_MAX.

After Tuning

Cpu usage after 16:00

Envoy Proxy Concurrency

Envoy proxy concurrency means this parameter control to sidecar worker thread to reduce CPU utilization and also improve your application performance. The default value is 0 which means starting worker thread for each CPU core. You can find the best throughput to change this setting. Be aware each change is effect to memory and CPU usage of each sidecar process.

Ingress & Egress Gateway Gzip Filter

You can enable gzip for ingress and egress gateway. Gzip Compression is useful in situations where large payloads need to be transmitted without compromising the response time.

apiVersion: networking.istio.io/v1beta1
kind: EnvoyFilter
metadata:
name: ingressgateway-gzip-ef
namespace: istio-system
spec:
workloadSelector:
labels:
istio: ingressgateway
configPatches:
- applyTo: HTTP_FILTER
match:
context: GATEWAY
listener:
filterChain:
filter:
name: "envoy.http_connection_manager"
subFilter:
name: 'envoy.router'
patch:
operation: INSERT_BEFORE
value:
name: envoy.gzip
config:
remove_accept_encoding_header: true
compression_level: BEST

--

--