Expedia Group Technology — Engineering

Service Slow to Retrieve AWS Credentials

Working around high latencies while using the AWS Java SDK

Francesco Feltrinelli
Expedia Group Technology

--

An hourglass flowing sand
Photo by NeONBRAND on Unsplash

The problem

We recently came across a tricky production issue at Expedia Group™️. One of our Kotlin services had intermittent response time spikes of up to five seconds.

Diagram showing service response time spikes up to 5 seconds
Service response time spikes up to 5 seconds

This was surprising because the service is a simple proxy to an AWS DynamoDB table, and Dynamo is notoriously fast. The service itself is running in AWS, so round trip times to Dynamo were expected to be low.

In an attempt to improve the response times, we updated our Maven pom.xml file to use the highest 1.x version of the AWS Java SDK . This had the positive effect of capping the latency spikes to one second instead of five seconds, but didn’t reduce their frequency. To get more insights, we enabled the AWS SDK default metrics following this blog post:

and enabled in Logback the logger com.amazonaws.latency with DEBUG level. This both pushes useful metrics to CloudWatch and logs them in your log file.

Credentials request time

We then used this Splunk query to go through the logs of the service and return the slow calls to Dynamo:

logger=com.amazonaws.latency | rex field=message "ClientExecuteTime=\[(?<client_exec_time>[^\]]*)" | where client_exec_time > 500

Here is a fragment from the logs with times in ms:

ClientExecuteTime=[884.268]
HttpRequestTime=[20.519]
CredentialsRequestTime=[862.945]

It seemed that the request to Dynamo took a long time to execute (ClientExecuteTime), not because of the round trip time to Dynamo (HttpRequestTime), but because of the time to request the credentials (CredentialsRequestTime). This made us wonder, what is happening during the credentials request?

It is common to configure AWS SDK to use DefaultAWSCredentialsProviderChain to obtain the credentials for accessing an AWS service. The provider chain sequentially tries each credentials provider until it finds one able to return the credentials. In our case the provider was InstanceProfileCredentialsProvider. This provider retrieves credentials by doing a network call to Amazon EC2 Instance Metadata Service (IMDS). By default the network call address is 169.254.170.2, with paths like /latest/meta-data/iam/security-credentials/[..] .

Indeed, we could find all these network calls to IMDS and their response time spikes up to 1 sec:

Diagram showing IMDS response time spikes up to 1 sec
IMDS response time spikes up to 1 sec

So it seemed clear that the problem was with slow calls from the service to IMDS to retrieve credentials. We looked for solutions on the web.

Known issue

It turned out that the issue with intermittent high latency responses from IMDS is a known global problem (see #1, #2, #3, #4, #5, #6) without a real solution from AWS so far . Moreover, the SDK has a non-configurable 1 sec timeout. InstanceProfileCredentialsProvider, and in particular its credentials fetcher, has a logic to cache the retrieved credentials and to refresh this cache after credentials are either expired, near expiration or anyway once per hour; in case the upstream call to IMDS fails, it falls back to using the cached credentials. That’s why we had these peaks of latency to IMDS capped at 1 sec, but then the call to Dynamo wasn’t failing: it was the credentials provider falling back to the cached credentials, which were still valid.

Now, if the SDK supported the configuration of the connection timeout to IMDS, we would have configured it with a much lower value than 1 sec, to fail faster, but it is not configurable (see here). Moreover, InstanceProfileCredentialsProvider has an optional asynchronous refresh mechanism, where a background thread periodically gets the credentials. Unfortunately, the way it is implemented doesn’t force the cache refresh, so doesn’t help preventing high latency responses (see an example here of someone enabling it without improvements).

Our custom solution

As an internal customized solution, we implemented a credentials provider namedAsyncCachedCredentialsProvider which wraps an existing upstream credentials provider, caches the credentials retrieved by the upstream provider and periodically asynchronously force the cache refresh. This way, the live service requests always hit the cache, without any latency.

After redeploying the service using AsyncCachedCredentialsProvider, the response time spikes were gone:

Diagram showing that service response time spikes disappeared after using AsyncCachedCredentialsProvider
Service response time spikes disappeared using AsyncCachedCredentialsProvider

Conclusion

If you are using the AWS Java SDK, it’s likely that your application may be affected by this global IMDS problem. We’ve shown how using a custom credentials provider can improve application response times.

And that was it! I hope this was useful.

This blog post is provided for educational purposes only. Certain scenarios may utilize fictitious names and test data for illustrative purposes. The content is intended as a contribution to the technological community at large and may not necessarily reflect any technology currently being implemented or planned for implementation at Expedia Group, Inc.

--

--