Expedia Group Technology — Engineering

Enhancing Istio’s Performance: Unveiling the Impact of Lock Contention on Expedia Group’s Compute Platform

A deep dive into the steps taken to uncover the root cause of Istio’s lock contention

Raghav Grover
Expedia Group Technology

--

A man standing next to a Joshua Tree
Photo by Frames For Your Heart on Unsplash

Istio, a popular CNCF graduated open-source service mesh, provides a powerful set of features, including traffic management, load balancing, service discovery, and telemetry. However, Expedia Group™️’s Compute Platform team had their own challenges with Istio while scaling it on hundreds of Kubernetes clusters, where even a single cluster is tasked with thousands of pods, with releases happening more than fifty times a day.

This blog will discuss how Expedia Group’s Compute Platform team built a custom Istio Pilot, with added observability and tweaked controller queues. This helped narrow down the root cause by uncovering a lock contention that introduced significant delays in Proxy propagation and resulted in outages. Furthermore, this write-up will also unveil details about the impact of the interim workaround Expedia Group implemented before the solution was available upstream.

Describing the issue

Applications deployed into one of our large Kubernetes-based clusters faced many 503 errors from Istio. Logs were full of upstream_reset_before_response_started{connection_failure}. After isolating one of the pods as an upstream host in logs, ingress pods continued sending traffic to the pod which was terminated 9 mins previously, as shown in the following image. Seeing the criticality of this issue, we raised an issue with Istio to get help from the community.

Istio ingress continued to route traffic to a terminated pod for 9 minutes.
Istio ingress continued to route traffic to one of the dead pod for 9 minutes.

Correlating some other metrics with the cluster health at the same time suggested that applications with high replica counts were released before the issue and were also accompanied by the HPA scaling down another big app. Was the cluster too large for Istio to handle? Let us find out next!

Investigating some plausible bottlenecks

Are Kubernetes informers delayed?

Delayed informers would cause Istiod, the control plane, to act with a delay, and hence the configuration in the cluster would be stale, where the chances of it routing the information to an already terminated pod would make sense. But API server metrics didn’t indicate this, so we put a custom informer locally connected to the cluster and observed the difference between the pod creation timestamp and when the informer event was received — it was in the order of milliseconds, so this theory was nullified!

Is endpoint controller not reacting to updates of pod terminations fast enough?

We took AWS audit logs and analyzed all events on Endpoints and EndpointSlices using a script to output the count of healthy addresses at different points of time, and that seemed to match up with what we saw in our dashboards, pointing to no lag from the endpoint controller.

Is Istiod starved for resources?

We saw huge CPU throttling in Istiod, but even raising the CPU to a much higher value didn’t help us, so we wanted to get to the root cause of it.

Is Istiod generating too many pushes for a single pod termination and is thus overworked?

We collected events from the audit logs and observed a spike in events from Istiod. The gray line shows partial pushes when compared to the blue one, which shows Kubernetes events, but this theory was denied by the Istio maintainer here.

Number of Kubernetes Events compared to full and partial Istio pushes were significantly higher.
Number of Kubernetes Events compared to full and partial Istio pushes

Primary bottleneck — Istiod

Istiod is primarily divided into three parts (in sequence):

  • Controller queue — Events from all Kubernetes informers are put into this nonconcurrent queue. The informers include watches on pods, services, endpoints, nodes and more.
  • Debounce — Events are grouped together to limit the number of push requests.
  • Push queue — After processing some events from the Debounce queue, events are pushed to this queue, and PILOT_PUSH_THROTTLE, which defaults to 100, is the max concurrent number of pulls that can be done and processed to push Istio config to connected proxies.
Istiod internals, events being pushed to Controller queue and then to the push queue after being debounced.
Controller queue feeds to Push queue after being debounced

There are out-of-the-box metrics provided around the Push queue and Debounce queue, but during the incident and pod churn, there was no elevation in push time (pilot_proxy_queue_time and pilot_xds_push_time). On the contrary, it was more suppressed than usual which seemed to indicate that the blockage was somewhere prior.

Push and queue time is more suppressed than usual indicating blockage is somewhere prior.
Push and queue time is suppressed than usual indicating blockage is somewhere prior

Istio remote debugging

Fortunately, we could reproduce the issue on our test environments by creating a lot of rolling restarts to high replica-count deployments. Istio has excellent documentation on remote debugging, through which we could put a Debugger in our IDE and connect to a running pod enabled by Delve, which led us to the following snapshot. The snapshot indicates that there were around 20K events in the queue, and they took a lot of time to clear, signifying a contention!

20K elements in the controller queue depicting a contention.
20K events in the controller queue

After some code deep dives, it was clear that instances of this queue are used in multiple places, and the events from API server are lined up here and dequeued one at a time. The items in this controller queue are callbacks or functions, which will get processed (executed) when the queue is dequeued right here. Snippet below:

//The dequeued callback is being executed.
if err := task(); err != nil {
}

These functions are pushed into the queue when an informer receives an event. For example, a pod creation event push looks something like this:

c.queue.Push(func() error {
return wrappedHandler(nil, obj, model.EventAdd)
})

These handlers process process events based on the type of the Kubernetes objects — pod, service, endpoints and more. For each object type and event type there is a different handler.

From an observability standpoint of view, this queue only logged errors while processing the events, but no metrics on the depth or latency of the queue which could have pointed us in the right direction.

Uncovering lock contention

After finding out the source of the bottleneck, which is the controller queue, the next challenge was to find the type of event and add observability around:

  • Type of object associated with the callback
  • The time it took to dequeue and start processing
  • The time it took to process the event and execute the callback

Since the Push function was being called from more than twenty places, we decided to change the implementation and modify the interface to push an enriched version of the callback:

type EnrichedTask struct {
Task func() error
Type string
QueuedTime time.Time
}

So the Task field was still in the original callback, but we modified all places to put the Type and QueuedTime to provide more information. Now while dequeuing and executing the Task(), we add in timers to measure the time taken to dequeue and the time taken to process the events and log it all, something like below:

c.queue.Push(&queue.EnrichedTask{
//Event push time
Start: time.Now(),
// Unchanged Task field
Task: func() error {
return wrappedHandler(nil, obj, model.EventAdd)
},
//otype is the type of the K8s object - Pods, service, endpointslice, node
Type: otype + "-create"})

After deploying with those changes, it was clear that processing EndpointSlices took more than any other type:

Dequeuing task EndpointSlice-update , process time 81 (secs), 
queue length: 2830
Dequeued task EndpointSlice-update , took 0.8 (secs)

The dequeuing alone took more than a minute, and this was happening before the other Debounce and Push queues. The queue length climbed close to 3,000 according to the logs above, and each EndpointSlice callback took 800 milliseconds to get processed. So even if there are 500 EndpointSlice update events out of those 3,000 items, it would take approximately 400 seconds before the last event is even processed!

We were seeing this exact behavior during outages, where the ingress pods were still sending traffic after several minutes of pod termination because the whole controller queue was jammed up and would take minutes to process.

Since we knew it was only endpoints events facing elevated processing times, it was time to dive deep and add more logs to EndpointSlice related services.

After some back and forth adding logs at multiple places, we figured this lock took 700 milliseconds to get acquired.

ads Time taken to ep.Lock(): 0.716286

When we reported this on our Github Issue, the authors quickly acknowledged and requested that we share the CPU profiles.

After generating CPU profiles on our custom Istiod image, the maintainer swiftly addressed the issue, fixing the lock contention.

Root cause of contention

The lock was being shared between two components of Istiod Pilot:

  • While processing the Kubernetes event from the controller queue (shown left in the image below)
  • While pushing the xDS config to the proxy
Two code paths waiting on the lock!
Two code paths waiting on the lock!

Actual fix

The actual fix from the maintainer removed the heavyweight processing inside the lock — to be outside of the lock. It also turned one side of the lock into a read lock.

Workaround

Until the official upstream resolution was in place, the team found that reducing the PILOT_PUSH_THROTTLE to somewhere near the number of virtual cpus alleviated the problem at large. This was happening because the number of go-routines on the other side (pushing the proxy) got reduced from a default of 100 down to 16, and that reduced contention since fewer go-routines were fighting for that lock. After implementing this workaround on our custom Istiod image, we observed the following drop in time taken to execute the callback.

The processing times of Endpoint slices dropped off a cliff when we tweaked the environment variable, PILOT_PUSH_THROTTLE.
The processing time observed from one of our custom metrics dropped off a cliff!

Improved telemetry

Expedia Group’s Compute Platform team added metrics to Istio around the controller queue which can be enabled by the env variable ISTIO_ENABLE_CONTROLLER_QUEUE_METRICS : true.

The following metrics were added:

  • pilot_worker_queue_depth — Determines the total number of items in the queue at any time.
  • pilot_worker_queue_latency— Histogram which determines the time it takes to dequeue and start processing.
  • pilot_worker_queue_duration— Histogram which observes the time taken to execute the callback.

The metrics are available from Istio v1.19.0.

Acknowledgments

Big shout-out to the maintainers of Istio for building such a great open-source project and for being super responsive on the GitHub issues to acknowledge and fix the problem. We are really glad that they accepted Expedia Group’s contribution for improving the metrics around the controller queues to benefit the community at large.

This write-up demonstrated how Expedia Group’s Compute Platform team deep-dived into the issue and took systematic steps to rule out the various possibilities, navigating through challenges posed by limited observability and ultimately identifying the root cause, which helped both Expedia Group and the open-source Istio community at large.

Learn about life at Expedia Group

--

--