Expedia Group Technology — Engineering
Service Slow to Retrieve AWS Credentials
Working around high latencies while using the AWS Java SDK
The problem
We recently came across a tricky production issue at Expedia Group™️. One of our Kotlin services had intermittent response time spikes of up to five seconds.
This was surprising because the service is a simple proxy to an AWS DynamoDB table, and Dynamo is notoriously fast. The service itself is running in AWS, so round trip times to Dynamo were expected to be low.
In an attempt to improve the response times, we updated our Maven pom.xml
file to use the highest 1.x version of the AWS Java SDK . This had the positive effect of capping the latency spikes to one second instead of five seconds, but didn’t reduce their frequency. To get more insights, we enabled the AWS SDK default metrics following this blog post:
and enabled in Logback the logger com.amazonaws.latency
with DEBUG
level. This both pushes useful metrics to CloudWatch and logs them in your log file.
Credentials request time
We then used this Splunk query to go through the logs of the service and return the slow calls to Dynamo:
logger=com.amazonaws.latency | rex field=message "ClientExecuteTime=\[(?<client_exec_time>[^\]]*)" | where client_exec_time > 500
Here is a fragment from the logs with times in ms:
ClientExecuteTime=[884.268]
HttpRequestTime=[20.519]
CredentialsRequestTime=[862.945]
It seemed that the request to Dynamo took a long time to execute (ClientExecuteTime
), not because of the round trip time to Dynamo (HttpRequestTime
), but because of the time to request the credentials (CredentialsRequestTime
). This made us wonder, what is happening during the credentials request?
It is common to configure AWS SDK to use DefaultAWSCredentialsProviderChain
to obtain the credentials for accessing an AWS service. The provider chain sequentially tries each credentials provider until it finds one able to return the credentials. In our case the provider was InstanceProfileCredentialsProvider
. This provider retrieves credentials by doing a network call to Amazon EC2 Instance Metadata Service (IMDS). By default the network call address is 169.254.170.2
, with paths like /latest/meta-data/iam/security-credentials/[..]
.
Indeed, we could find all these network calls to IMDS and their response time spikes up to 1 sec:
So it seemed clear that the problem was with slow calls from the service to IMDS to retrieve credentials. We looked for solutions on the web.
Known issue
It turned out that the issue with intermittent high latency responses from IMDS is a known global problem (see #1, #2, #3, #4, #5, #6) without a real solution from AWS so far . Moreover, the SDK has a non-configurable 1 sec timeout. InstanceProfileCredentialsProvider
, and in particular its credentials fetcher, has a logic to cache the retrieved credentials and to refresh this cache after credentials are either expired, near expiration or anyway once per hour; in case the upstream call to IMDS fails, it falls back to using the cached credentials. That’s why we had these peaks of latency to IMDS capped at 1 sec, but then the call to Dynamo wasn’t failing: it was the credentials provider falling back to the cached credentials, which were still valid.
Now, if the SDK supported the configuration of the connection timeout to IMDS, we would have configured it with a much lower value than 1 sec, to fail faster, but it is not configurable (see here). Moreover, InstanceProfileCredentialsProvider
has an optional asynchronous refresh mechanism, where a background thread periodically gets the credentials. Unfortunately, the way it is implemented doesn’t force the cache refresh, so doesn’t help preventing high latency responses (see an example here of someone enabling it without improvements).
Our custom solution
As an internal customized solution, we implemented a credentials provider namedAsyncCachedCredentialsProvider
which wraps an existing upstream credentials provider, caches the credentials retrieved by the upstream provider and periodically asynchronously force the cache refresh. This way, the live service requests always hit the cache, without any latency.
After redeploying the service using AsyncCachedCredentialsProvider
, the response time spikes were gone:
Conclusion
If you are using the AWS Java SDK, it’s likely that your application may be affected by this global IMDS problem. We’ve shown how using a custom credentials provider can improve application response times.
And that was it! I hope this was useful.
This blog post is provided for educational purposes only. Certain scenarios may utilize fictitious names and test data for illustrative purposes. The content is intended as a contribution to the technological community at large and may not necessarily reflect any technology currently being implemented or planned for implementation at Expedia Group, Inc.