A journey to have fault-tolerant Kinesis Producers

Emre Kaya
Insider Engineering
3 min readFeb 14, 2022

In every development lifecycle, we may face problems with the services that we use. Especially when we are dealing with the queue systems. There are some common problems that everyone can face time to time. In this post, we are going to examine the most crucial ones.

P.S. This article will be based on JavaScript SDK.

Before going for a deep dive, let’s see the primitive version of our producer.

It is pretty simple, it just checks that is there any error or not in the callback. But it can mislead us because of behaviour of AWS SDK.

The batch operations in AWS SDK are not atomic. If there are partial failures, SDK will treat them as a success. It means that there can be records that are lost without even knowing it, if you don’t define an alarm and/or not logging response of PutRecords operation. Here is an example response which includes partial failure.

When we realized this partial failure behaviour, first thing first, we were going to increase the number of shards for our stream and define an alarm based on WriteProvisionedThroughputExceeded metric.

But it was not enough. Following days, we were being alarmed lots of time. At that point, we’ve decided that we would not figure it out this problem just increasing number of shards. Then, we’ve started to find an efficient solution that recovers failed records.

We know that we can easily extract failed records from the response by checking ErrorCode and/or ErrorMessage fields. But, at that point, we had a confusing about record ordering in response. Thankfully, the response of PutRecords provides exactly the same order for records which given in the request.

The PutRecords response includes an array of response Records . Each record in the response array directly correlates with a record in the request array using natural ordering, from the top to the bottom of the request and response. The response Records array always includes the same number of records as the request array.

put-records — AWS CLI 1.22.23 Command Reference

After this confirmation, we’ve created a new service for our Lambda Producer that checks failed records and tries to put them to the stream again.

Basically, every PutRecords operation is being initialized with a retry limit. This limit is an indicator that defines how many times it tries to put failed records to the stream until the response does not have a failed record. Also, every retry attempt waits for a while to avoid an overload in the stream.

After this solution, everything was good for a while. Following days, we have started to struggle with timeout issues in producer. It is a common issue that may happen in any call, even in AWS SDK. It was another challenge in our journey. Then, we’ve started to check AWS SDK documentations to find an efficient way to solve it.

Here are the findings that we found while checking the documentation;

  • The default timeout duration of each request is 2 minutes. It means that, SDK waits 2 minutes to terminate the request, before it stops and retries the request.
  • The default timeout of establishing a new connection is equal to timeout duration if you don’t set it explicitly.

After the investigation, we’ve applied the following httpOptions to the Kinesis instance.

That was the last piece of the puzzle :)

It is time to see metrics for our new consumer 🎉

Conclusion

If you are working with Kinesis Streams to minimize data loss, you must be aware of the partial failures in batch operations and connectivity configurations.

--

--