DynamoDB Latency Issue While Using Java SDK

Ajay Pilaniya
Airtel Digital
Published in
6 min readJul 4, 2022

The Problem

At Airtel Digital we recently encountered a tricky issue in one of the APIs which is responsible for fetching the streaming URL of a content requested by the end-user. When analyzing the latency of this Playback API we saw that every 30 minutes or so, the API was taking 2–3 seconds to respond. This was a very strange behaviour. We checked if any of our resources were throttling but none of our resources were even close to throttling.

This API uses two services :

1. MongoDB : To fetch content details

2. DynamoDB : To fetch user data

At Airtel Digital, we have a good Analytics system in place to monitor the data points so we put some metrics on both of these services to record the execution time of every method related to these services. Since DynamoDB is famous for delivering the results in single-digit milliseconds, we started looking at MongoDB cluster to see if it is taking time to fetch the data. On observing, we found that in fact, the step involving the call to fetch user data was the culprit.

In our Playback API, we make a GETITEM request to DynamoDB to fetch the user details and this call was taking 2+ seconds in some cases. As you can see in the below graph, on average we had 2 spikes every hour.

RCA

Before our request lands on DynamoDB, there are many steps involved in between and we wanted to understand exactly, which step was taking the most time. So we did our analysis in the following order :

RCU Throttling

First, we needed to make sure that our DynamoDB cluster wasn’t throttling and exceeding the provisioned RCUs :

As you can see, our RCU consumption was well below provisioned consumption. We also never received any read throttle events as you can see in below figure :

GETITEM Latency

Since we make a GETITEM request in every API call, we checked the latency on DynamoDB console :

As you can see maximum latency on Dynamo was 32.5 ms. So DynamoDB seemed to be doing okay.

DynamoDB Client Configuration

We use the below client configuration for Dynamo DB across Java applications :

ClientConfiguration config = new ClientConfiguration();config.setRequestTimeout(500);config.setClientExecutionTimeout(3 * 1000);config.setRetryPolicy(PredefinedRetryPolicies.getDefaultRetryPolicyWithCustomMaxRetries(10));AmazonDynamoDBClientBuilder.standard()
.withClientConfiguration(config)
.withRegion("ap-south-1")
.build();

AWS recommends that you tune these configurations according to your requirements. According to this AWS blog, we tried fine-tuning our configuration but we didn’t see any major improvements.

After countless iterations of trial and error, we found that fine-tuning the variables weren’t reducing this latency. Upon further research, we found that sometimes SSL handshake can have a delay and if we make a handshake request for every API call, this could lead to considerable overall latency. So we changed our connection to Keep-Alive by adding the following code in the configuration :

config.setUseTcpKeepAlive(true);

And we were hopeful that this will fix the issue but we were wrong and were once again back to square one.

As a last resort to fix this, we decided to record every metric and push it to our ELK stack. So we created a class to listen for the metrics whenever a request was fired :

@Service
@Slf4j
public class DynamoMetricRequest extends RequestMetricCollector {
@Autowired
ApplicationEventPublisher eventPublisher;
@Override
public void collectMetrics(Request<?> request, Response<?> response) {
try {
eventPublisher.publishEvent(new DynamoMetricEvent(this, request, response));
} catch (Exception ex) {
log.error("Exception while collecting metrics", ex);
}
}
}

You can see, we collected every DynamoDB metric and published it the to ELK stack. After about an hour, we had below data points :

This shows us that most of the time was being taken by CredentialsRequestTime.

Upon further research, we found out that this is a known issue in AWS services and so far no clear solution has been provided by the AWS team. When we don’t define any Credential chain to AWS client, it uses DefaultAWSCredentialsProviderChain. Below are the details of how this works as per this blog on AWS :

The default credential provider chain looks for credentials in this order:

1. Environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. The AWS SDK for Java uses the EnvironmentVariableCredentialsProvider class to load these credentials.

2. Java system properties aws.accessKeyId and aws.secretKey. The AWS SDK for Java uses the SystemPropertiesCredentialsProvider to load these credentials.

3. Web Identity Token credentials from the environment or container.

4. The default credential profiles file- typically located at ~/.aws/credentials (location can vary per platform), and shared by many of the AWS SDKs and by the AWS CLI. The AWS SDK for Java uses the ProfileCredentialsProvider to load these credentials.

You can create a credentials file by using the aws configure command provided by the AWS CLI, or you can create it by editing the file with a text editor. For information about the credentials file format, see AWS Credentials File Format.

5. Amazon ECS container credentials — loaded from the Amazon ECS if the environment variable AWS_CONTAINER_CREDENTIALS_RELATIVE_URI is set. The AWS SDK for Java uses the ContainerCredentialsProvider to load these credentials. You can specify the IP address for this value.

6. Instance profile credentials — used on EC2 instances, and delivered through the Amazon EC2 metadata service. The AWS SDK for Java uses the InstanceProfileCredentialsProvider to load these credentials. You can specify the IP address for this value.

Since we were using InstanceProfileCredentialsProvider, and this provider retrieves credentials by doing a network call to Amazon EC2 Instance Metadata Service (IMDS). The credential provider keeps these credentials cached but when they expire, it makes a network call to IMDS, and due to this network call sometimes it is common to experience the latency. If the call to IMDS fails then it falls back to cached credentials, but since there is no option to configure timeout on IMDS call there was no way to avoid this latency.

The Solution

When we make a GETITEM request, DynamoDB client uses DefaultAWSCredentialsProviderChain to fetch the credentials and when credentials are near expiration it makes a network call to IMDS. This was the main reason of latency spikes in our API. To avoid this network call, we implemented a custom AWS Credential Provider :

@Slf4j
@Service
public class AwsCustomCredentialsProvider implements AWSCredentialsProvider {
private static final AWSCredentialsProvider awsCredentialsProvider = new DefaultAWSCredentialsProviderChain();
private static final AtomicReference<AWSCredentials> cachedAwsCredentials = new AtomicReference<>();

@Override
public AWSCredentials getCredentials() {
return cachedAwsCredentials.get() != null ? cachedAwsCredentials.get() : fetchAndUpdateCredentials();
}

@Override
public void refresh() {
fetchAndUpdateCredentials();
}

@Scheduled(fixedRate = 5 * 60 * 1000)
private AWSCredentials fetchAndUpdateCredentials() {
long start = System.currentTimeMillis();
log.info("Refreshing AWS Credentials....");
try {
cachedAwsCredentials.set(awsCredentialsProvider.getCredentials());
log.info("Refreshed AWS Credentials successfully in : {} ms", System.currentTimeMillis() - start);
} catch (Throwable th) {
log.error("fetchAndUpdateCredentials :: Exception occurred", th);
}

return cachedAwsCredentials.get();
}
}

This custom credential provider makes the call to IMDS every 5 minutes and caches the results. This way, when Dynamo DB client always picks the credentials from cache.

We configured this provider like below :

AmazonDynamoDBClientBuilder.standard()
.withClientConfiguration(config)
.withCredentials(awsCustomCredentialsProvider)
.withMetricsCollector(dynamoMetricRequest)
.withRegion("ap-south-1")
.build();

And bingo, our latency was reduced and there were no more spikes!

Conclusion

If your application is using AWS Java SDK, you should ideally configure the metrics and a custom credentials provider to reduce the response time latency of your APIs.

--

--

Ajay Pilaniya
Airtel Digital

Software Developer — Travel Enthusiast — Technology Lover — Binger Watcher