Service Mesh Journey At Trendyol — Part 2

Published in

Trendyol Tech

7 min readApr 28, 2020

In Part1 of this series, I talked about how we decided to use service mesh in Trendyol and some basic concept about Istio. In this part, we will talk about a more detailed Istio configuration and tuning for the production.

The Istio Service Mesh Architecture

Istio service mesh is an intentionally designed abstraction that has both a control plane and a data plane.
Istio is a service mesh created by the combined efforts of IBM, Google, and Lyft. The sidecar patterns are enabled by the Envoy proxy and are based on containers.

https://istio.io/docs/ops/deployment/architecture/

A control plane that controls the overall network infrastructure and strengthens the policy and traffic rules.
A data plane that uses sidecars through the Envoy makeshift which is an open source edge proxy.

Getting started

To install Istio we use Istio Operator. There is a lot of configuration profile which are default, demo, minimal, and remote... We use minimal configuration profile and we changed some of the behavior like envoy proxy CPU and memory limit. As well as the minimal configuration does not have an ingress gateway. We enabled it to get outside traffic in the cluster. We don’t need mTLS for the internal service now. Istio should be configured with mTLS as “PERMISSIVE” which allows non-TLS requests to enter the mesh.

Sidecar Injection Strategy

In order to use Istio mesh, we need to enable envoy sidecar injection.

Istio works by injecting a sidecar proxy based on Envoy alongside your application which intercepts all inbound and outbound calls going to and from the pod. There are two ways of injecting the Istio sidecar into a pod which are manual sidecar injection and automation sidecar injection. We are using manual sidecar injection in the production environment.

You can disable auto injection in namespace and no automatic injection occurs here :

kubectl label namespace trendyol-team istio-injection=disabled
kubectl get namespace -L istio-injection

We were able to roll out Istio on an application by application basis by adding the following annotation to pods:

sidecar.istio.io/inject: "true"

Injection occurs at pod creation time. Kill the running pod and verify a new pod is created with the injected sidecar.

The istio-init container sets up the pod network traffic redirection to/from the Istio sidecar proxy. Init containers execute before the sidecar proxy starts.

Application Startup Fail With Sidecar

In stage environment after enabling an application for Istio we realize that some pods don’t connect to outbound service. We investigate that this was occurring because the application container was starting before the Istio-proxy container was ready which then caused outbound requests to fail.

We solve this problem like disabling redirecting traffic to the specific outbound ports the init containers. We couldn’t find the best solution.

"traffic.sidecar.istio.io/excludeOutboundPorts": "8091,11210"

Ingress Gateway

In order to handle outside traffic to the cluster, we need to use Istio ingress gateway. If you remember we enabled with our configuration. We use a gateway to manage inbound traffic. We deployed ingress gateway as a daemon set in the large Kubernetes cluster you need to separate it for each namespace deployments.

kind: Gateway
apiVersion: networking.istio.io/v1beta1
metadata:
  name: browsing-gateway
  namespace: default
spec:
  servers:
    - hosts:
        - 'www.trendyol.com'
      port:
        name: http
        number: 80
        protocol: HTTP
  selector:
    istio: ingressgateway

In this configuration, we accept HTTP traffic from incoming www.trendyol.com you can create multiple gateway definitions to manage your traffic. Maybe you want to block some IP blocks.

Handle Too High Envoy Resource Overhead

One of the core problems to be solved when Envoy was developed was the observability of services. Therefore, Envoy has embedded a large number of statistics from the very beginning to better observe services.

Envoy offers fine-grained statistics, even at the IP level of the entire cluster. each IP address carries different metadata under different services, and therefore the same IP addresses are independent under different services. As a result, Envoy consumes a huge amount of memory. To address this issue, we have added the stats toggle to Envoy to disable or enable IP-level statistics. Disabling IP-level statistics directly reduces memory usage.

source_workload: source.workload.name | "unknown"
      source_workload_namespace: source.workload.namespace | "unknown"
      source_principal: source.principal | "unknown"
      source_app: source.labels["app"] | "unknown"
      source_version: source.labels["version"] | "unknown"
      destination_workload: destination.workload.name | "unknown"
      destination_workload_namespace: destination.workload.namespace | "unknown"
      destination_principal: destination.principal | "unknown"
      destination_app: destination.labels["app"] | "unknown"
      destination_version: destination.labels["version"] | "unknown"
      destination_service: destination.service.host | conditional((destination.service.name | "unknown") == "unknown", "unknown", request.host)
      destination_service_name: destination.service.name | "unknown"
      destination_service_namespace: destination.service.namespace | "unknown"
      request_protocol: api.protocol | context.protocol | "unknown"
      response_code: response.code | 200
      grpc_response_status: response.grpc_status | ""
      response_flags: context.proxy_error_code | "-"
      connection_security_policy: conditional((context.reporter.kind | "inbound") == "outbound", "unknown", conditional(connection.mtls | false, "mutual_tls", "none"))

Istio Pilot Tuning

Pilot provides service discovery for the Envoy sidecars, traffic management capabilities for intelligent routing (e.g., A/B tests, canary deployments, etc.), and resiliency (timeouts, retries, circuit breakers, etc.). It converts high-level routing rules that control traffic behavior into Envoy-specific configurations and propagates them to the sidecars at runtime. Pilot abstracts platform-specific service discovery mechanisms and synthesizes them into a standard format consumable by any sidecar that conforms to the Envoy data plane APIs. When every new sidecar that’s means every scale your pod it adds more load to Istio control plane.

Pilot connects to every istio-proxy and every istio-proxy checks and reports metrics to telemetry v2. Before every namespace we migrated to Istio we monitored and tuned the control plane.

Before tuning our control plane, when Istio-pilot sends it to push to sidecars, our service decreased at this moment. Our pilot CPU usage is almost 2 cores.

We have changed pilot push concurrent value. This is default 100. Limits the number of concurrent pushes allowed. We have added also some of the delays for Isto- pilot push.

PILOT_PUSH_THROTTLE=1:Limits the number of concurrent pushes allowed. On larger machines this can be increased for faster pushesPILOT_DEBOUNCE_AFTER=10s:The delay added to config/registry events for debouncing. This will delay the push by at least this internal. If no change is detected within this period, the push will happen, otherwise we'll keep delaying until things settle, up to a max of PILOT_DEBOUNCE_MAX.

After Tuning

Envoy Proxy Concurrency

Envoy proxy concurrency means this parameter control to sidecar worker thread to reduce CPU utilization and also improve your application performance. The default value is 0 which means starting worker thread for each CPU core. You can find the best throughput to change this setting. Be aware each change is effect to memory and CPU usage of each sidecar process.

Ingress & Egress Gateway Gzip Filter

You can enable gzip for ingress and egress gateway. Gzip Compression is useful in situations where large payloads need to be transmitted without compromising the response time.

Gzip - envoy tag-v1.8.0 documentation

Gzip is an HTTP filter which enables Envoy to compress dispatched data from an upstream service upon client request…

www.envoyproxy.io

apiVersion: networking.istio.io/v1beta1
kind: EnvoyFilter
metadata:
  name: ingressgateway-gzip-ef
  namespace: istio-system
spec:
  workloadSelector:
    labels:
      istio: ingressgateway
  configPatches:
    - applyTo: HTTP_FILTER
      match:
        context: GATEWAY
        listener:
          filterChain:
            filter:
              name: "envoy.http_connection_manager"
              subFilter:
                name: 'envoy.router'
      patch:
        operation: INSERT_BEFORE
        value:
          name: envoy.gzip
          config:
            remove_accept_encoding_header: true
            compression_level: BEST

Conclusion

We had improved data plane and control plane in the production environment. I will publish the service and envoy proxy metrics. Share with us the best tuning option for low latency and max QPS.

In the next part, we will take a look at the Istio traffic management rules. Such as virtual service, service entries.

Istio Pilot consuming huge CPU resources · Issue #20262 · istio/istio

Expected behavior Limited CPU Usage Steps to reproduce the bug Deploy pods to a GKE cluster in a sequential manner…

github.com

High CPU usage of Pilot due to ServiceEntry prior to 1.5.0 · Issue #20647 · istio/istio

Bug description We have a namespace which is running Pods without sidecars, so this namespace is outside of the mesh…