Adopting an Event-Driven Architecture: A comparative look at AWS messaging solutions using Kinesis

Mario Bittencourt

Published in

SSENSE-TECH

10 min readOct 1, 2021

Part IV of the SSENSE-TECH four-part series aims to explore some of the several solutions available within AWS.

Part I: Adopting an Event-Driven Architecture: A comparative look at AWS messaging solutions using SQS:

Part II: Adopting an Event-Driven Architecture: A comparative look at AWS messaging solutions using SNS

Part III: Adopting an Event-Driven Architecture: A comparative look at AWS messaging solutions using EventBridge

In parts I, II, and III we covered the simplest solution (SQS) to increasingly more powerful and complex solutions such as SNS and EventBridge. While these three services cover a plethora of applications, there are situations where you may benefit from a different approach.

This article will cover Kinesis and how its streaming capabilities can be used to decouple applications and handle fan-out in a different way than the other solutions we’ve reviewed so far. To close this series, I will also provide a simple set of heuristics to help determine which service is more suitable for your problem/application.

You get a stream, they get a stream, we all get a stream

When we revisit the solutions explored so far, there is one thing in common for all of them: Decoupling is achieved by a producer sending messages that are then made available to consumers, and once consumed they are removed.

Kinesis approaches this differently, where the same message — not a copy — is available to all consumers and even after being processed it remains available. It can be used to decouple your application, handle vast amounts of data flowing in, and enable more complex analytics with the data in “transit”.

Solutions like this belong to a category commonly known as streaming solutions.

Figure 1. A stream of events with multiple consumers having access to the same messages but at different positions.

Kinesis Data Stream

Kinesis technically encompasses a family of services offered by AWS including Data Stream, Data Analytics, and Firehose.

I will focus on the Data Stream which is the one that could be compared to probably the most popular streaming alternative: Kafka.

Similar to Kafka, it provides a way for data to flow into your application by allowing multiple producers to add data that is organized into a stream to later be presented to the consumers and potentially be further manipulated.

Basics

Data Record

Represents the data/event/message we want to capture from the producer and make it available for the consumers.

The actual data format is not relevant to Kinesis since it is treated as a base64-encoded blob.

As part of each data record, you have to specify a partition key that Kinesis will use to determine where to store the record within the stream.

Shard

Consists of a sequence of data records grouped within a stream.

A shard has a different capacity of reads and writes, both expressed in the number of records and bandwidth per second.

For writes, you can have up to 1,000 records per second while respecting 1 MB per second.

For reads, you can have up to 5 requests per second while respecting 2 MB per second.

Data Stream

Consists of the data records organized in 1 or more shards. You can store the data records for up to one year, with the default retention period being 24 hours.

Currently, it is possible to have up to 499 shards per stream and you can add or remove shards as needed to satisfy the ingestion and/or consumption requirements.

Contrary to other managed services, scaling is not automatic and needs to be triggered by you.

AWS offers a simple formula to help you find how many shards you would need.

Number_of_shards = max(incoming_write_bandwidth_KB/1024, outgoing_read_bandwidth_KB/2048)

You take the average size of a record in KB and multiply it by the number of records per second your producers are expected to write. This will provide you with the bandwidth needed on the producer side.

For the consumer side, take the previously calculated number and multiply it by the number of consumers.

This way you balance the influence of producers and consumers against the read and write limits a single shard can provide.

For example, imagine you expect your consumers to write 800 records per second with an average of 5 KB in size and you have 3 consumers for the stream, one to send the records to a data lake, one to update a local projection, and a final one to send a manipulated version of the record to an external system.

Number_of_shards = max(3.9, 11.71) = 11.72 => 12 shards.

Figure 2. Overall view of Kinesis components.

Writing records to a stream

Being a producer to a Kinesis stream is very simple and needs only 3 things:

The name of the stream you want to add the record to
The partition key
The content of the record

The response of the PutRecord will inform of the sequence number of the record in the stream and which shard the record was written to.

Kinesis will use a hash function in the partition key and with that it will choose which shard the record will be written to.

Kinesis also exposes a PutRecords API to allow you to send up to 500 records where any single record can be up to 1 MB — including the partition key — as long as the total size of the request is not over 5 MB.

Similarly as seen with EventBridge, if you choose to send multiple records at the same time, AWS will try to send them all but will not stop at the ones that failed. As such the ordering could be compromised when records fail to be written to the stream.

The result of a PutRecords will contain the number of failures and a list of all records, successfully written or otherwise. The successful records will report the SequenceNumber and ShardId while the unsuccessful will report the ErrorCode and their associated ErrorMessage.

AWS provides a list of the errors that are commonly associated with Kinesis here. When integrating with Kinesis, or any other service, you should look for those errors to understand how your application should react when they happen.

Consuming records from a stream

The consumer side is more complex, as you have to specify the stream, what shard, and from what position you want to start receiving the records.

To facilitate this process, AWS offers a KCL (Kinesis Client Library) that handles saving checkpoints to inform where in the shard your consumer last received a record and reacts when resharding happens.

Figure 3. Using KCL with a non-Java application.

The simplest way to consume records from Kinesis is by configuring a stream to be the event source for Lambda execution. If you do so there is no need to manage the KCL requirements as the records are pulled from the shard by the Lambda service and delivered to your function.

The Lambda service polls the stream every second and takes a batch of records from Kinesis and makes it available for the invoker to deliver them to your function.

Figure 4. How a Lambda function gets invoked by using Kinesis as the event source.

A recommendation when using Serverless is to enable the custom checkpoints, this way if you have issues processing a batch of records in Lambda you can inform the point of failure so the next Lambda iteration would not receive the entire batch, but starting from the one that failed.

Handling increased throughput

Let’s discuss two different approaches that can be used if you are faced with one or more of the following:

Need to increase the throughput as your producers send more events;
Need to have a bigger fan-out of the consumers;
Each consumer takes a long time processing each record.

Adding more shards

As mentioned earlier, each shard has a maximum capacity for writes and reads. If a given number of shards is not enough, you can increase it and gain extra throughput. When you do so a process called resharding takes place.

Once done, the partitioning scheme is updated to take into account the new shards so when new records are written, they may end up in different shards than before.

Figure 5. Adding shards cause new records to be written to a different one based on the partition key.

This can help increase the rate of events being written or if you need to add many consumers.

One aspect that is important to note is that the consumers will only start accessing the new shards after they’ve read all the records of the old shards, this way preserving the order within the shard.

Adding parallel consumers per shard

Sometimes the problem is not that you have many different consumers, but that the rate a single consumer provides is not enough to handle the influx of records. In this case, you can enable the parallel execution of the same consumer.

With Lambda you can configure it to run up to 10 simultaneous executions per shard. To guarantee the order, records with the same partition key in each shard are sent to the same invocation of a Lambda.

Figure 6. Parallel execution. Each Lambda receives records from different batchers. Order within the batcher matches the shard.

In both cases, if the partition key is not diverse enough you may end up not having significant gains, so make sure to monitor the IteratorAge metric and if none of the aforementioned approaches helps, consider redefining your partition key.

Reducing latency with enhanced fanout

As discussed earlier, each shard offers a capacity of up to 5 requests per second and 2 MB of bandwidth, whichever happens first. This means we can have up to 5 consumers where each one can make one request per second, making the latency vary from 200 ms to 1 second. Similarly, the bandwidth is shared among the same consumers.

If your application has many consumers or higher bandwidth requirements, Kinesis offers enhanced fanout (EFO), enabling up to 20 consumers. Each one has non-shared bandwidth and an average latency of 20 ms.

This option comes with an extra cost that involves an hourly charge + the traffic consumed by those consumers.

Choosing the right solution for your application

It is certain that as your ecosystem grows you are likely to use more than one of the presented AWS services to integrate your applications. However, choosing the right one can be daunting, especially as many of them have overlapping features.

Instead of trying to come up with a set of rules to determine the right service, I prefer to think in terms of what I know about my application’s needs, and use this as heuristics to narrow down the possible candidates. Then you put them and use additional dimensions to select the best choice, such as cost and latency.

Does it need to send commands to another application?

Commands have an implicit 1:1 relationship between the origin and the destination, which can be handled by a point-to-point connection.

Direct SQS (queues) are normally the simplest and recommended solution.

Do you require multiple applications to be informed, or react to the same information?

This means you need some form of fan-out, where the same information (message/event) is made available for multiple destinations. Since once consumed from a queue the message disappears, using SQS directly is out of the question. SNS, EventBridge, and Kinesis are all candidates and you will need to inquire more to help decide.

How many destinations need to be made aware of the information?

SNS is known for the massive number of subscribers a given topic can have. If you need such several subscribers it is likely the recommended solution.

If the number is much smaller, ideally less or equal to 5, then both EventBridge and Kinesis could also be selected. Your additional criteria could be latency, potential source, need to filter messages and destinations.

Do you need to filter so not all destinations receive all messages?

SNS offers filtering based on the attributes of the message and EventBridge can do the same based on the entire content of the message.

Do you need to be able to reprocess the messages?

If you want to receive the messages again, Kinesis is a good option as it offers this functionality and can persist the records for up to one year.

Do you need to integrate third-party applications as the source of information?

If your third-party application is an AWS partner, EventBridge can offer you integration with the least amount of effort.

Is latency a big factor for the destination?

SNS has the lowest average latency when compared to EventBridge and Kinesis. Kinesis comes second with EventBridge being the last.

More than being an exhaustive list, answering the above questions has the added benefit of forcing you to think about aspects of your application that you would not normally do.

Wrapping up

As we saw, Kinesis provides yet another solution that allows you to decouple your application, and more specifically, a managed way to bring stream handling capabilities. It is a powerful option in the managed services space that, although not as configurable as Kafka, offers a solid solution with support for handling spikes in the volume of records being produced.

Choosing the right solution(s) is tricky due to the many variables that have to be considered, but with the provided set of heuristics, you can narrow the selection or at least surface the questions you should be asking yourself about your system’s needs.

Now let’s go build something fun!

Access parts 1 through 3 of this SSENSE-TECH series tackling how to adopt an event-driven architecture below:
Part I: A comparative look at AWS messaging solutions using SQS

Part II: A comparative look at AWS messaging solutions using SNS

Part III: A comparative look at AWS messaging solutions using EventBridge

Editorial reviews by Deanna Chow, Liela Touré, and Pablo Martinez. Want to work with us? Click here to see all open positions at SSENSE!