The tale of two messaging platforms: Apache Kafka and Amazon Kinesis

Guest post by Parviz Deyhim, Data & Analytics Practice Lead, Datapipe

Streaming data processing has become increasingly prevalent. As a result, different platforms and frameworks have been introduced to reduce the complexity. As Datapipe’s data and analytics consultants, we are frequently asked by customers to help pick the right solution for them. As a result of our customer engagements, we decided to share our findings in our Apache Kafka vs. Amazon Kinesis whitepaper. In this post, we summarize some of the whitepaper’s important takeaways.

Apache Kafka or Amazon Kinesis?

Both Apache Kafka and Amazon Kinesis are data ingest frameworks/platforms that are meant to help with ingesting data durably, reliably, and with scalability in mind. Both offerings share common core concepts, including replication, sharding/partitioning, and application components (consumer and producers).

However, before you build a real-time application, you should consider some key differences between the two:

1. Apache Kafka is an open-source, distributed messaging solution that initially was developed at LinkedIn. As a user, you are responsible for installing and managing clusters, and you also are responsible for ensuring high availability, durability, and failure recovery. In contrast, Amazon Kinesis is a managed platform, so you don’t have to be concerned with hosting the software and the resources.

2. The cost of using either solution varies considerably. Apache Kafka requires that you host and manage the framework. That means you are responsible for picking the right compute resources and storage capabilities, getting involved in capacity planning, and managing failure detection and recovery. All of these considerations result in resource costs (such as EC2 instances) and human costs (such as DevOps engineers). In contrast, given the hosted nature of Amazon Kinesis, the resource and human costs are significantly lower. However, in certain cases Apache Kafka is more cost effective and should be carefully considered as a suitable option for certain data ingest patterns.

3. There are also architectural differences such as end-to-end data ingest, consumption latency, and scalability models. For example, in order to scale Apache Kafka, you have to monitor for hot partitions and move or add partitions as needed. In contrast, Amazon Kinesis provides scalability by allowing you to split a given shard to increase capacity or join two shards to reduce capacity for lower cost.

4. The last section of the whitepaper provides a high-level overview of the producer and consumer APIs of the two solutions. Given that the user applications are directly affected by how the APIs work, it’s important to pay attention to what features are supported by either solution. For example, Apache Kafka provides the ability to retain the last known message in the queue. This feature allows users to de-duplicate data based on a given key. Amazon Kinesis, on the other hand, does not provide this feature; you have to build this capability by using the API.

For an in-depth analysis of the two solutions in terms of core concepts, architecture, cost analysis, and the application API differences, see the Apache Kafka vs. Amazon Kinesis whitepaper.

This post and the whitepaper focus on the data ingest component of your architecture, but as your organization evolves, the other components of your data analytics capabilities also will change and evolve. As a next generation managed services provider and experts in the public cloud, Datapipe has helped many customers such as Trulia, Milliman, and others with their data analytics needs. If you have questions regarding this post or other projects, feel free to contact us to get in touch with one of our experts. And, hopefully we had a chance to meet you in person last week in San Francisco, where we sponsored the Big Data & Analytics event during the AWS Loft Architecture Week , October 11–13. We also gave a talk at the event called “Working with Apache Kafka and Amazon Kinesis on AWS.”