EXPEDIA GROUP TECHNOLOGY — SOFTWARE

All about ISTIO-PROXY 5xx Issues

Sharing my experience of Istio Service mesh’s most common 5xx issues and how to identify and resolve those issues

Vinod Canumalla
Expedia Group Technology

--

What is the Issue?

In Hotels.com™ (part of Expedia Group™) we recently implemented an Istio Service mesh. The Istio service mesh sidecar injector automatically attaches an istio-proxy sidecar to every pod. Once all apps are enabled with istio-proxy sidecars, a common issue faced by many apps and services are HTTP 503/504 response errors for upstream service calls. These errors are generated by the istio-proxy sidecar. In the Reliability Engineering team, we are often called on to analyse these subtle intermittent issues and this blogpost details that analysis. We found that the HTTP 503 errors could happen for inbound or outbound calls between apps within the service mesh or for calls made to apps external to the service mesh. But the most impactful ones are the outbound calls going to services outside the mesh over the internet and the least impactful ones are the inbound calls to app containers within the mesh. The HTTP 504 errors are less common but do happen when a response from an upstream service takes longer than 15 seconds, which is the default response timeout within istio-proxy.

Istio Ingress-gateway Metrics Dashboard
Istio Ingress-gateway Metrics Dashboard (not from Istio, created in-house)

The symptoms of these issues are the same as the errors seen from genuine upstream component failures. However, there is a slight possibility that the istio-proxy itself could cause these errors due to the default Istio configuration which may not be good enough for many of the applications within the mesh. The pattern you would notice for the more impactful outbound service calls are: a lot of 503 errors occurring for a period of time and then a recovery on its own for some time, and that sequence repeats.
For the less impactful inbound calls, you may not see any errors from your application metrics dashboard, as the client retries and succeeds on retry in almost all cases, however the 503 error will still be present in the istio-proxy logs.
For outbound calls, the 504 response timeout error will impact the requests but it may be an issue only if your component dependencies SLO’s are longer than 15 seconds.

You can see the HTTP 503/504 upstream errors in the istio-proxy logs. For more information on istio-proxy log format and response flags refer to this envoy access log format link.

separator

Before looking at the reasons for these errors, I’ll show a few sample error log messages:

Inbound Call Error example

Below is an entry in the istio-proxy sidecar log for an incoming request, where “UC” indicates “Upstream connection termination in addition to 503 response code“, which means that for some reason the connection between istio-proxy and its upstream container within the pod has been terminated.

There is also another type of response flag “UF“, indicates “Upstream connection failure in addition to 503 response code” — again for some reason istio-proxy was unable to connect to its upstream container within the pod.

[2020-01-23T09:29:59.633Z] "GET /service_a/v2.1/get?ids=123456 HTTP/1.1" 503 UC "-" "-" 0 95 11 - "-" "-" "bc6f9ba3-d8dc-4955-a092-64617d9277c8" "srv-a.xyz" "127.0.0.1:8080" inbound|80|http|srv-a.xyz.svc.cluster.local - 100.64.121.11:8080 100.64.254.9:35174 -[2019-12-12T08:38:43.976Z] "GET /service_b/v2.5/get?getIds=123456 HTTP/1.1" 503 UF "-" "-" 0 91 0 - "-" "-" "d0dbd0fe-90db-46d0-b614-7eaea3764f79" "srv-b.xyz.svc.cluster.local" "127.0.0.1:8080" inbound|80|http|srv-b.xyz.svc.cluster.local - 100.65.130.7:8080 100.64.187.3:53378 -

Outbound Call Error example

Below is an example error log message in one of the application pod’s istio-proxy sidecar logs, when the app is accessing an external service domain “external.service.com”, where the “UF” response flag next to HTTP 503 code indicates “Upstream connection failure in addition to 503 response code“, which is similar to the one seen above for inbound call, and means for some reason it was unable to establish a connection to the remote host.

[2019-10-25T17:27:02.903Z] "GET /External.WebService/v2/Ids HTTP/1.1" 503 UF "-" "-" 0 91 0 - "-" "Apache-HttpClient/4.5.5 (Java/11)" "c75a750d-9efd-49e7-a419-6ed97298b543" "external.service.com" "55.92.142.17:80" PassthroughCluster - 55.92.142.17:80 100.65.254.14:39140 -

Below are a few example error log messages for the upstream response timeout, where the “UT” response flag next to HTTP 504 code indicates “Upstream request timeout in addition to 504 response code“, here the istio-proxy waited up to 15000 milliseconds, before walking away (the 5th field from response flag highlighted in bold).

[2020-01-23T07:18:17.609Z] "POST /ext_svc/1111000222333/add?currency=GBP&locale=en_GB HTTP/1.1" 504 UT "-" "-" 150 24 15004 - "-" "Jersey/2.27 (HttpUrlConnection 11.0.4)" "7e9a9207-690f-45b1-b574-fc806b117a84" "testservice.prod.com" "10.186.199.78:80" PassthroughCluster - 10.186.199.78:80 100.65.161.7:37302 -

[2020-01-22T23:05:06.499Z] "GET /api/1234567 HTTP/1.1" 504 UT "-" "-" 0 24 15001 - "-" "ABC.3.1.449" "4403f657-f324-4c60-befd-343a23e9520a" "apisvc.prod.com" "10.186.199.78:80" PassthroughCluster - 10.186.199.78:80 100.66.91.11:50288 -

Note: For TLS connections, without Istio mutual TLS (mTLS), there will not be any HTTP status codes nor UT/UC response flags reported in the proxy logs. This is due to TLS encryption which cannot be intercepted by istio-proxy unless mTLS is used.

separator

Why is this happening?

In order to debug these issues it is necessary to use a combination of options such as manual tests while observing logs and error patterns, plus istio-proxy debug/trace logging and tcpdumps to capture the packets. All the data collected lead to the following conclusions.

For Inbound calls

All inbound calls to an application are intercepted by IP table rules, setup by Istio within the pod, which then redirects the incoming request to the istio-proxy container listening on port 15001. The istio-proxy then opens a new connection on localhost (127.0.0.1), loopback network adaptor, to the application container (since application and istio-proxy sidecar containers share the same network namespace).

As the istio-proxy works at layer 7, it does request-load-balancing on persistent connections for both inbound and outbound connections between pods in the cluster. The issue here is that, due to the limitations of Kubernetes internal default TCP layer 4 connection load balancing, as described in my other blogpost, application servers such as Tomcat or Finagle have maxKeepAliveRequests or server session MaxLifeTime parameters set within the application to recycle connections. Connections are recycled after servicing a certain number of requests or periodically based on fixed time.

In the figure below, the connection shown by the red arrow is terminated by the application server, but istio-proxy is already pipelining another request, which results in an upstream connection failure in addition to a 503 error code (response flag UC). This issue is more visible on high request rate applications where the chance of pipelining more requests on an established connection is higher.

the flow of a request through two services both with istio-proxy sidecars, the redirect make the flow go through the sidecars
inbound / outbound request flow in istio-proxy sidecar

There is also another type of 503 error for inbound requests with a response flag UF (upstream connection failure), this happens when the service container itself crashes (eg due to OOM issues or any other error that causes the service container to crash).

For Outbound calls

In a standard istio-proxy mesh environment, a request via the mesh with an external service is routed through the istio-proxy as a PassthroughCluster request when there is no specific Istio configuration (such as ServiceEntry) set for the external services, and hence no routing or traffic management will be done. It is just a plain HTTP/1.1 protocol over TCP connection using the destination service host IP address. An application that is initiating an outbound request will perform a DNS resolution of the external service domain before leaving the localhost network interface. The IP table rules within the pod redirects the request to the istio-proxy sidecar which then terminates the connection at the istio-proxy and opens a new connection to the remote host IP for proxying the request.

request flow from istio enabled pod to egress firewall and external service
istio-proxy outbound request through egress firewall

The istio-proxy doesn’t close a persistent connection unless the application that has initiated the connection wants to close it. So there are no more DNS resolutions performed for new requests by the application as the connection has already been established from the application point of view. This will lead to the istio-proxy returning a 503 error to the application over time due to these scenarios:

  • External Service call going through Corporate Security component — If the access to the external service host is through the internet or outside the corporate network and the call goes through a security component (an egress firewall), this will do the following two things when a DNS request comes in: 1) check whether or not the external service access is allowed by comparing against a list of allowed domains and 2) if it is allowed then this component will add the resolved IPs to the IPSET along with all the ports found in the whitelisted file, with an expiration time of 1hr or 3600 seconds (by default). These IPSET rules appear as below on the firewall host:
[ec2-user@egress-proxy-4 ~]$ ipset list | grep external
55.92.142.17,tcp:80 timeout 3697 comment "external.service.com."
55.92.142.17,tcp:443 timeout 3697 comment "external.service.com."

Once the above rule expires any new requests from the istio-proxy using the same IP address without a new DNS resolution will result in an error like this — “upstream connect error or disconnect/reset before headers. reset reason: connection failureHTTP/1.1 503” — which is what is seen within istio-proxy trace/debug logs. The reason the application sometimes recovers on its own from this error is that some other application using the same external service might have just started and performed a first request which then adds the IPSET rule which enables all the apps that go to the same instance of firewall component to succeed.

  • External Service host is an AWS ELB which has dynamic IP — In this scenario, the 503 issue happening is very rare as the host IPs don’t change more frequently than the component deployments within cluster. However the istio-proxy could produce the 503 error if the application is using persistent connections to external host IPs and the ELB instances were added or replaced.
separator

What are the possible solutions?

For Inbound calls

Ideally the istio-proxy within the pod should be able to handle the upstream connection close events by the localhost upstream server. However a simple workaround can resolve this issue until we understand whether or not this behaviour is an issue or by design in later versions of Istio. The workaround solution for inbound calls is simple, just set infinite request pipelining on a connection with only an idle timeout for connection recycling.

for Tomcat app server
maxKeepAliveRequests = -1

for Finagle app server
# Don't set any value, then it defaults to unbounded.
maxLifeTime(timeout: Duration)

In the past, within a Kubernetes cluster the in-cluster request routing was through ServiceIP (<service_name>.<namespace>) that worked at layer 3/4, which does only connection load balancing. Hence it was required to close the connections after serving a certain number of requests to correctly distribute requests across multiple instances evenly. Though frequent connection closing impacts latencies, it is very much needed when compared to having an uneven request load across instances which causes the bigger problem of loads being too large on some instances. With Istio, there is no need for connection recycling as the istio-proxy works at layer 7 and does request load balancing using persistent connections and this also improves latencies. In the case of layer 7 request load balancing, the only time a connection is closed is on idle timeout and both server and client applications must implement idle connection evictions.

For Outbound calls

Now we know the reason for the 503 error when using istio-proxy for outbound calls, which require performing a DNS resolution on the external service host periodically. There are a few options to resolve this

  • Don’t use persistent connections for outbound calls — This will resolve the issue but is not scalable when you have too many requests per second. Also a code change is needed for an existing application that uses persistent connections through a httpclient connection pool.
  • Perform DNS resolutions periodically — Change your application to periodically close the connection and open a fresh connection to perform a DNS resolution, this solution works fine but requires code changes to the existing application.
  • Configure Istio to perform DNS resolutions periodically — This is the best solution that can be implemented very easily without changing the application code. Istio traffic routing allows you to configure ServiceEntry and VirtualService for routing calls to external services, so by configuring these for the external services we can avoid 503 issues.
  • Configure Istio to increase Response Timeout — In order to avoid 504 errors for response timeouts and to accommodate longer than 15 second response timeout for upstream service calls, configure the timeout value within the VirtualService as shown below

The following example ServiceEntry and VirtualService instructs istio-proxy to perform DNS resolution periodically and the response timeout set to 30s instead of the default 15s.

ServiceEntry

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
name: ext-service-se
spec:
hosts:
- external.service.com
location: MESH_EXTERNAL
ports:
- number: 80
name: service-http
protocol: HTTP
resolution: DNS

VirtualService

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: ext-service-vs
spec:
hosts:
- external.service.com
http:
- timeout: 30s
route:
- destination:
host: external.service.com
weight: 100

The above ServiceEntry DNS resolution configures 300 second DNS refresh interval by default, which can be seen in the istio-proxy config dump.

"dns_refresh_rate": "300s",
"dns_lookup_family": "V4_ONLY",
"load_assignment": {
"cluster_name": "outbound|80||external.service.com",
"endpoints": [
{
"lb_endpoints": [
{
"endpoint": {
"address": {
"socket_address": {
"address": "external.service.com",
"port_value": 80
}
}
},
"load_balancing_weight": 1
}
],
"load_balancing_weight": 1
}
]
}

This DNS refresh time is too long for most of the applications in AWS clusters, and also as per the AWS best practices and guidance the DNS refresh rate must be set to as low as 60 seconds. The DNS refresh rate configuration parameter — “global.proxy.dnsRefreshRate” — is a global setting for the istio-proxy which defaults to 300 seconds, and changing this parameter affects the entire mesh so care must be taken when changing this value to a common value that is suitable for all apps and services in the mesh. Your Istio platform team should review and make the necessary changes for the DNS refresh rate.

Summary

So, with istio-proxy sidecar added to your applications, a 5xx error means there could be a number of reasons which must be carefully analysed by going through application metrics, upstream service metrics, istio-proxy logs and Istio configuration. It may also need enabling istio-proxy debug/trace logs and capturing network packets using tcpdump in the lab environment for further investigation.

It is important to setup appropriate ServiceEntry and VirtualService Istio configuration and server-side connection keep-alive configuration, as mentioned above, for resolving the http 503 and 504 errors for upstream service calls.

Analyses like these can be time consuming and require a wide range of skills to be applied, we in the Reliability Engineering team feel it’s helpful to the community to report our results in detail so that others encountering these issues can gain the benefit of our investigations. I hope you find this analysis and the explanations useful.

Update: At the time of writing this blog the Istio version used was 1.2.10. Though most of the points discussed in this blog may be applicable to other Istio versions too but some default values or configuration mentioned above may not mirror latest Istio version docs.

--

--