Istio’s published benchmarks state:
As of Istio 1.1, a proxy consumes about 0.6 vCPU per 1000 requests per second.
For our first edge in the service mesh (2 proxies on either side of the connection) we’re looking at 1,200 cores for the proxy alone, per million requests per second. Google’s pricing calculator estimates around $40/month/core for
n1-standard-64 nodes, which puts this single edge at over $50k/month/1MM RPS.
Looks like the values-istio-test.yaml is going to raise the CPU requests by quite a bit. If I’ve done my math correctly, it’s around 24 CPUs for the control plane and 0.5 CPU for each proxy. That’s more than my current personal account quota. I will re-run the tests once my request to increase my CPU quotas is approved.
I needed to see for myself if Istio was comparable to another open-source service mesh: Linkerd.
Installing the Service Meshes
First thing, I installed SuperGloo in the cluster:
I used SuperGloo because it was super simple to get both services meshes bootstrapped quickly, with almost no effort on my part. We’re not using SuperGloo in production, but it was perfect for a task like this. It was literally two commands per mesh. I used two clusters for isolation— one for Istio, and one for Linkerd.
The experiment was run on Google Kubernetes Engine. I used Kubernetes
1.12.7-gke.7 and used a node pool with
n1-standard-4 nodes with node autoscaling enabled (min 4, max 16).
I then installed both service meshes using the command line tool.
After a few minutes of CrashLooping, the control planes stabilized.
(Note: SuperGloo currently only supports Istio 1.0.x. This experiment was re-tested with Istio 1.1.3 with no measurable difference.)
Set up Istio Auto Injection
To get Istio to install the Envoy sidecar, we use the sidecar injector, which is a
MutatingAdmissionWebhook. It’s out of the scope of this article, but in a nutshell, a controller watches all new pod admissions and dynamically adds the sidecar and the initContainer which does the
At Shopify, we wrote our own admission controller to do sidecar injection, but for the purposes of this benchmark, I used the one that ships with Istio. The default one does injection when the label
istio-injection: enabled is present on the namespace:
Set up Linkerd Auto Injection
To set up Linkerd sidecar injection, we use annotations (which I added manually with
The Istio Resiliency Simulator (IRS)
We developed the Istio Resiliency Simulator to try out some traffic scenarios that are unique to Shopify. Specifically, we wanted something that we could use to create an arbitrary topology to represent a specific portion of our service graph that was dynamically configurable to simulate specific workloads.
The flash sale is a problem that plagues Shopify’s infrastructure. Compounding that is the fact that Shopify actually encourages merchants to have more flash sales. For our larger customers, we sometimes get advance warning of a scheduled flash sale. For others, they come completely by surprise and at all hours of the day & night.
We wanted IRS to be able to run “workflows” that represented topologies and workloads that we’d seen cripple Shopify’s infrastructure in the past. One of the main reasons we’re pursuing a service mesh is to deploy reliability and resiliency features at the network level, and proving that it would have been effective at mitigating past service disruptions is a big part of that.
The core of IRS is a worker which acts as a node in a service mesh. The worker can be configured statically at startup, or dynamically via a REST API. We use the dynamic nature of the workers to create workflows as regression tests.
An example of a workflow might be:
- Start 10 servers, as service
- Start 10 clients, sending 100 RPS each to
- Every 10 seconds, take down 1 server, monitoring
5xxlevels at the client
At the end of the workflow, we can examine logs & metrics to determine a pass/fail for the test. In this way, we can both learn about the performance of our service mesh and also regression test our assumptions about resiliency.
(Note: We’re thinking of open-sourcing IRS, but are not ready to do so right now.)
IRS for Service Mesh Benchmarking
For this purpose, we set up some IRS workers as follows:
irs-client-loadgen: 3 replicas that send 100 RPS each to
irs-client: 3 replicas that receive a request, waits 100ms and forwards the request to
irs-server: 3 replicas that return
With this setup, we can measure a steady stream of traffic between 9 endpoints. The sidecars on
irs-server receive a total of 100 RPS each and
irs-client sees 200 RPS (inbound & outbound).
We monitor the resource usage via DataDog, since we don’t maintain a Prometheus cluster.
First, we looked at the control plane CPU usage.
The Istio control plane uses ~35x more CPU than Linkerd’s. Admittedly this is an out-of-the-box installation, and the bulk of the Istio CPU usage is from the
istio-telemetry deployment, which can be turned off (at the cost of features). Removing the mixer still leaves over 100 mcores, which is still 4x more CPU than Linkerd.
Next, we looked at the sidecar proxy usage. This should scale linearly with your request rate, but there is some overhead for each sidecar which will affect the shape of the curve.
These results made sense, since the client proxy receives 2x the traffic of the loadgen proxy: for every outbound request from the loadgen, the client gets one inbound and one outbound.
We see the same shape of results for the Istio sidecars.
Overall, though, the Istio/Envoy proxies use ~50% more CPU than Linkerd.
We see the same pattern on the server side:
On the server side, the Istio/Envoy sidecar uses ~60% more CPU than Linkerd.
Istio’s Envoy proxy uses more than 50% more CPU than Linkerd’s, for this synthetic workload. Linkerd’s control plane uses a tiny fraction of Istio’s, especially when considering the “core” components.
We’re still trying to figure out how to mitigate some of this CPU overhead — if you have some insight or ideas, we’d love to hear from you.