Solving race-conditions with Kinesis Data Streams vs SQS FIFO Queues

Viktor Trako
Cazoo Technology Blog

--

There are often many challenges when integrating with third-party APIs that require the evaluation of different architectural solutions before making any decisions. Sometimes you may find yourself engineering to compensate for the limitations of said third-party APIs.

At Cazoo, we run asynchronous Lambda functions on AWS that are triggered by events subscribed to in our event bus. These events are raised when something happens in the value chain. We recently built an integration with a third-party provider for a route planning and resource optimisation system. The third-party models the resource data such as drivers and vehicles into a schedule. Any changes to create, update or delete an entity could be done by fetching the schedule for a specific time period, applying the changes to it then pushing these changes back up. Each change is triggered by a corresponding event in the event bus.

Because we require this initial fetch step, the limitation is that our data held by the third party could end up in an inconsistent state due to a new event arriving before completing an update for an event that is currently in flight. This does not play well with the asynchronicity of Lambda functions and event-driven architecture. In this case, we need to guarantee that we process events in the order in which they are raised so we have to rate-limit the flow of events. Figure 1.0 can help to illustrate the issue.

It is important to note that although the event throughput for this service is very low it may be volatile as many events could be raised at the same time. In order to reduce complexity, we had created a stateless application and we wanted to maintain this property when exploring solutions.

There are different approaches to solving this problem but if you want a solution that keeps your application stateless, there are serverless options available on AWS that allow us to control the event flow through configuration when provisioning these resources on the platform.

We explored two such services that showed promise in helping to solve the race-condition issue illustrated above, SQS FIFO Queues and Kinesis Data Streams.

AWS SQS FIFO Queue

Similar to AWS SQS standard queues, AWS SQS FIFO is a fully managed message queuing service. The difference is that it has added functionality for building serverless applications where the order of messages is important. This makes it an obvious choice for our use-case as it would guarantee the events are processed in order, eliminating a race condition from occurring.

Message groups can be used in the event producer to tag messages with a group ID. Messages with a group ID belong to a message group and SQS FIFO will guarantee that each event from a message group is processed in order. This makes the solution horizontally scalable if your use-case allows for the segregation of the events into message groups.

Let’s put it under test and observe the order of the events processed from the queue and scalability potential.

For the initial test, a SQS FIFO queue was set up with a Lambda function trigger, reservedConcurrency of 1, batch size of 1 (maximum number of events Lambda function can take off the queue). For this test messages were tagged with the same group ID which means one message group was used.

Under test, the observations was that SQS FIFO provides the desired rate limiting and events are processed in the order in which they are emitted. The news was not so great in terms of performance. The graph below illustrates the performance observed under test.

Let us see if we can improve on this by increasing the reservedConcurrency and producing two message groups by segregating our events using message IDs. When setting the reservedConcurrency for the Lambda function the best practice is to set it to at least the number of message groups or higher. For this test, the reservedConcurrency was set to 3, batch size 1 and separated our messages into two message groups. The graph below illustrates the performance under test.

This is a huge improvement in performance. Due to the reservedConcurrency setting, 3 Lambda function invocations were observed. It is important to note the following two points:

  1. There is no dedicated Lambda invocation for a specific message group. Each Lambda invocation could be triggered from any massage group.
  2. Events within a message group are processed in order but some events from different message groups were observed to be processed out of order. This is important to take into account when trying to segregate the data into message groups in a way that fits your use-case.

AWS Kinesis Data Streams

Kinesis Data Streams (KDS) is a highly scalable and durable real-time data streaming service for building high throughput and cost effective data pipelines.

There are many use cases for using KDS to capture high volumes of data to power real-time analytics, enabling your business to make informed data-driven decisions. It can be used to capture log and event data to build observability and provide you with granular insights on your application performance. There are plenty of other use cases centered around high throughput data processing and analytics.

We found that KDS can be configured to suit our use-case in allowing us to control concurrency, rate-limit the throughput and process our events in order in which they are emitted.

This can be achieved by utilisation of shard count, batch size and parallelizationFactor. We set up a new Kinesis Data Stream with a shard_count of 1 and a trigger to our Lambda function with batchSize 1 and parallelizationFactor 1.

The graph below illustrates the performance observed when running this load test.

Performance of KDS is much better when benchmarked against our SQS FIFO Queue solution using one message group and reservedConcurrency of 1 as the graph below illustrates.

This solution is fully scalable as our data volume increases. For our use-case, we can segregate data from our consumer into 1-n shards using the dates in each entity to calculate a value to use as the partition key. For example, we can have 7 shards partitioning the data by day of the week. Each partition can trigger a Lambda function invocation allowing us to process more data in parallel. This would allow us to safely process the data over a 24 hour period safely preventing a race condition from ever occurring.

To test this out we set up a load test by increasing the number of shards in the stream to two and producing events partitioned into these two shards (Figure 1.2). When benchmarking the results against SQS FIFO queues with two message groups and reservedConcurrency of 3 we can see that SQS FIFO performance is more competitive as the graph below illustrates.

Under test, we observed that Kinesis Data Streams allows us to control concurrency limits and throughput. The Kinesis stream trigger for the Lambda function honors the settings as expected allowing us to observe the events being processed synchronously from the stream when using one or more shards to partition our data. This is the desired outcome for the problem we are trying to solve in preventing any race conditions from occurring.

Similar to SQS FIFO Queues, any of the Lambda invocations can process data from any partition but events belonging to a partition are processed in order.

By default KDS provides single event source mapping defaulting to a Parallelization Factor of 1 and Batch Size of 1. This means that there will be 1 Lambda invocation per shard processing 1 batch of data from the shard. The number of parallel Lambda invocations per shard is proportionate to the Parallelization Factor.

What’s next?

We have determined under test that both SQS FIFO queues and KDS can provide a solution for our use-case. Which one you choose depends on your application data throughput and volatility. The shape of your events is important to determine if you can partition your data into shards for KDS or different message groups for SQS FIFO. There are also some service limits to consider when making a decision. You can find out more about KDS Quotas and Limits and SQS Quotas in AWS Documentation.

For the use-case described above, we can get away with one partition or message group due to the low data throughput we expect. Based on performance alone, KDS is a more performant solution and it can be used as an event source with a data relay set up between EventBridge and KDS.

But how do we get our event data from the event bus, in our case EventBridge, to a Kinesis Data Stream? In the second part of this post, I would like to walk you through how we set up an EventBridge to Kinesis Data Stream relay using Terraform as our Infrastructure as Code tool of choice.

--

--

Viktor Trako
Cazoo Technology Blog

Tech Nerd | Builder of High-Performing Product Engineering Teams