Kinesis(AWS) vs. PubSub (GCP) and how they stand near Kafka
In the last year, I worked a lot, at scale, with these 2 managed streams, let’s understand each one, its abilities, concepts, and weakness, and in the end compare both cloud solution to Kafka.
Who are the components here?
Data is published to the topic, the topic can have multi subscriptions, each subscription serves one consumer.
Needless to say that any subscription is isolated from the others, it just sets the stream state for any consumer.
You don’t have to worry about scaling, the PubSub “knows” to adapt itself by the rate of published messages. no need to “dirty your hands” with shards/partitions, GCP is arranging it for you.
You can monitor PubSub with StackDriver metrics, you can do it at the topic level or the subscription level.
These 2 metrics show that the subscription has messages that it can’t ack in time.
In the bellow metrics, you can see the publish messages rate vs the ack rate, in this case, they are identical, indicates that the ack process is performing well without bottlenecks.
Best-effort basis, using the system-supplied
publishTime attribute to deliver messages in the order that they were published. Cloud Pub/Sub does not guarantee only-once or in-order delivery.
Best-effort basis, PubSub does not guarantee only-once delivery.
Who are the components here?
The stream is similar to the “Topic” in PubSub, a stream of records, a stream must be configured with shard number, this will define its capacity:
A shard is a unit of throughput capacity. Each shard ingests up to 1MB/sec and 1000 records/sec, and emits up to 2MB/sec. To accommodate for higher or lower throughput, the number of shards can be modified after the Kinesis stream is created.
When you creating your stream you must choose the number of shards. and very important to remember: you pay per shard hour! if you have a couple of hours were you expect an increase in your traffic, you need to set the shards number as the highest traffic you have, even if you don’t need it for the rest of the day…
When creating a stream, AWS helps you calculate your needed shard number:
What happens when I have spikes?
Then you will get: “
Rate exceeded for shard shardId-000000000052 in stream [some stream name] under account…”
If it is a temporary spike, this may be valid, because the producer Library (KPL) used to retry again, and the spiked records will be ingested in the next retry.
AWS also recommends implementing a backpressure mechanism, to avoid such spikes.
What happens when I have to increase shards numbers? or decrease?
You can increase the number of shards, only twice from their current size, or if you want less, then you can reduce it, but only half from the current size, this limitation can lead you to perform this transition in several iterations.
Once you change the shards number, Kinesis handles the re-sharding for you, with zero downtime.
For any record you send to Kinesis you should specify its partition key, this key is used to load-balance the records across the available shards.
I usually do
You can monitor it in CloudWatch, on a shard level and in-stream level. alerts are supported as well.
And if you want to monitor it by the consumer? this a bit challenging, you have to make sure your consumer sends metrics to CloudWatch, (and you pay per the API call). read more here.
You also require to monitor “
WriteProvisionedThroughputExceeded” and “
ReadProvisionedThroughputExceeded” which are showing you cases where you read/write above your shard limit.
You can see in the below metrics, that there are from time to time
WriteProvisionedThroughputExceeded, this can be correlated to the spike in incoming data.
Kinesis supports batching of record, it has dedicate API for it “PutRecords”, rather than, “PutRecord”.
Each PutRecords request can support up to 500 records. Each record in the request can be as large as 1 MB, up to a limit of 5 MB for the entire request.
Need to remember, that when producing aggregated records in one API call, you need also to de-aggregate them when consuming it. read more here.
Order can be achieved at the shard level because the producer must send any record including its partition key, to determine the shard. Moreover, any record gets a sequence number. So, when the consumer reads the data, it will receive it by the shard and in the sequence order.
A record can be delivered to a consumer more than once, so the application must enforce exactly-once semantics. For more information, see Handling Duplicate Records in the Amazon Kinesis Data Streams documentation.
Comparing both cloud solution to Kafka (most popular):
Kafka doesn’t support automatic scaling, and when you need to send more data into the topic, you have to increase partitions, brokers, replications and more.
You end up with a lot of maintenance and multiple related components in the cluster that need to be enhanced.
These operations aren’t needed when you use Kinesis/PubSub. even said that Kinesis isn’t automatically scaled up/down, it’s still easier than doing it in Kafka.
Kafka can support ordered messages in the partition level, consumer read data from partition, so it will get the messages ordered.
When choosing Kafka you aren’t locked by the cloud vendor, and you can change the cloud provider without touching any code/infrastructure.
Kafka, because of its growing popularity, has a strong ecosystem, countless tools and 3rdParty connectors/sink/providers, which gives you easy integration with the external world.
Kafka has managed solutions like Confluent, Aiven and more.
They can build you the cluster on any cloud: Aws/GCP/Azure.
Worth checking this, because Kafka has its own strength and the managed service solves the cluster maintenance part. Now, you can benefit from other Kafka features: Kafka stream, message persistence, transactions, message ordering, only once guarantee …
- zero operation, no need to define partitions and shards.
- Scaling is built-in without any required operation(at least check that you don’t hit some quota limit).
- Pay only for usage.
- Monitoring is friendly and simple.
- Producer code is straight forward, without handling partitions/ shard errors/ backpressure.
- Scaling is also a managed, but you have to know your expected usage and operate the scale-up/scale-down accordingly, this doesn’t happen automatically.
- You pay for the shards you have opened, even if you aren’t using all of them.
- Monitoring is not always free, in case you have to send custom metrics (per consumer).
- Producer code can be complex, recommended to use the KPL library,(java only)
Kafka, on the one hand, has a strong ecosystem, rich features, and it keeps growing and improving, it is open-source and has a large community, you are also not locked by the cloud vendor.
But on the other hand, it comes with a high maintenance cost, better, if you can buy a managed Kafka service and delegate the maintenance/scaling to external service and focus on your logic and business, then you can enjoy the best from both worlds.