Monitoring API Call Retries

Published in

Qoala Engineering

5 min readNov 29, 2021

The title may sound trivial, sure many people have already implemented it, but it’s often overlooked. To start the story with the retry mechanism, we know failure is unfortunate but it can happen anytime whenever one system calls another. Retry mechanism is helpful in some cases, and monitoring the retry mechanism with metrics and logs for most of those cases is often more helpful.

Qoala as an insurtech company has a lot of partners and some dependencies on other services, so integration with them is inevitable. A retry mechanism along with information like metrics and logs is crucial. At Qoala, we implement retry mechanisms for calling some third-party services/APIs. We add metrics and/or logs between retries, also when the operation succeeds and fails for monitoring and troubleshooting purposes.

There are many ways to implement a retry mechanism, one can use a third-party library or implement their own. Other than features like backoff strategies (to determine how much the delay before retry like fixed delay, exponential delay, etc.) before running the retry, one feature that we need is an interceptor before or after retry, meaning we can do some sort of actions before or after retrying the failed API request, like adding custom metrics and/or logs for our monitoring. Before we’re going into the details, let’s take a look at how metrics and logs can help for observability, also how and where we at Qoala store it.

Metrics help with the measurement of component functionality, monitoring, and defining the threshold of attention-required usage. At Qoala, we store the metrics in Datadog. If necessary we can also set alerts/notifications if a certain metrics threshold has been exceeded, for example, we set up Datadog to send alerts via email if one metric (for failure event) has exceeded xx counts in yy hours, depending on the case. On the other hand, logs help with the more technical details on a specific occurrence and can be used to investigate problems and their root causes. In the case of API calls and retries, the logs should certainly have the API request and response data so it can be useful if all of the retries fail and we need to process these logs somehow.

Let’s go into the implementation. For the metrics and logs in the retry mechanism, we use something like this in Node.js services, it’s simple and straightforward.

One of the Node.js libraries we can use to send metric events to Datadog is hot-shots StatsD. The code/wrapper snippet for sending custom metrics looks like this.

You can see the metric name and metric tags variables there. metricName is -well- the metric name, we can filter the metrics in Datadog by this name. metricTags are the array containing the tags that we want to measure, a tag is usually in this format “key:value” for further filtering. The increment method is basically to increment the counter, in this context, the counter is basically the metric tag itself. Later on, we can see how many of those metric tags were triggered in a certain time range. At Qoala, we have some metrics for calling insurance API like issuing insurance policies, for example, the metric name could be travel_insurance_policy_api_call, and the metric tags could be status:active for success call to the insurance API, and status:failed for failure after all retry attempts fail. We increment the count of these tags whenever issuing the policies succeeds or fails. Here is the sample metric shown in Datadog.

The metrics can be shown in the Datadog dashboard. Here’s some sample time series visualizing the data. Other monitoring services like Datadog may likely have the same feature.

If you haven’t used metrics in your services and you’d like to use them for monitoring your API calls and retries, it’s good to have some tags that describe the corresponding event, be it for failed, retried, or succeeded API calls. You could also add some more other tags depending on your cases.

We use AWS CloudWatch and database for the logging targets. There is nothing fancy about the logging, we store the timestamp of the occurring event, the request, the response, and some other specific data. You could use structured logging for better log filtering, log levels to distinguish logs' importance and reduce information noise, sensitive data masking, and some other best practices for logging, but the important thing is the data itself. For retrying API calls, it’s obviously important to have the request and response data in the logs, so we can reprocess it later if all the retry attempts fail. Logging is a must-have feature as part of monitoring purposes because it contains all of the details that the metrics simply don’t have.

To wrap things up, monitoring the API retries is usually useful especially for high traffic APIs. With metrics, we are able to see if some API calls are not behaving really well, like how many calls have been retrying or failing within some time range. If a third-party service performance is degrading (as we can see from the metrics and logs), retrying API calls with the current rate might make it worse. By monitoring it, we can act accordingly on how we handle the situation, like adjusting the retry backoff or some other things. Even when sometimes we can’t and won’t do anything with the failed API calls in the case that it doesn’t really help to retry based on the API response like exhausted API call quota or third party service maintenance, having the metrics can give the problem overview on what to do next, for the examples above, like increasing the API call quota, notice the customers for service maintenance, or other strategies that you think fit your case. So in monitoring API retries, metrics can certainly help for the general overview and alerts, while the logs can help with the details on what’s going on with the API calls.

Monitoring API Call Retries

Written by G. Nervadof