Photo by Robert Haverly

Apache Kafka vs Amazon Kinesis to Build a High Performance Distributed System

Kyle Wild
The Event Log
Published in
7 min readFeb 13, 2017

--

At Keen IO, we’ve been running Apache Kafka in a pretty big production capacity for years, and are extremely happy with the technology. We also do some things with Amazon Kinesis and are excited to continue to explore it.

Apache Kafka vs Amazon Kinesis

For any given problem, if you’ve narrowed it down to choosing between Kinesis and Kafka for the solution, the choice usually depends more on your company’s size, stage, funding, and culture than it does on your use case (although I believe that for some use cases, the answer is obviously Kafka, as I’ll get to later). If you’re a Distributed Systems engineering practice, have lots of distributed dev ops / cluster management / auto-scale / streaming processing / sysadmin chops, and prefer to interact with Linux vs. interacting with an API, you may choose Kafka regardless of other factors. The inverse is true if you’re more of and web, bot, or app development practice, are fans of any services like Amazon RDS, Amazon EC2, Twilio, and SendGrid more than services like Apache ZooKeeper and Puppet.

In somewhat-artificial tests: Kafka today has more horsepower out of the box on rough numbers. Thus Kafka today can be tuned to outperform Kinesis in terms of raw numbers on practically any given test– but are you really going to do all that tuning? And are those really the factors that matter most to you, or are there other pros and cons to consider? By analogy: a Corvette can beat a Toyota Corolla in a lot of tests, but maybe gas mileage is what matters most to you; or longevity; or interoberability? Or, like lots of business decisions, is it Total Cost of Ownership (TCO) that wins the day?

What follows is a bit of a side-by-side breakdown of the big chunks of the TCO for each technology.

Performance (can it do what I want?)

For the vast, vast, vast majority of the use cases you may be considering them for, you really can’t go wrong with either of these technologies from a performance perspective. There are other great posts (Ingestion Comparison Kafka vs Kinesis) that point to the numbers demonstrating where Kafka really shines in this department.

Advantage: Kafka — but performance is often a pass/fail question, and for nearly all cases, both pass.

Setup (human costs)

I would say Kinesis more than just slightly easier to set up than Kafka. When compared with roll-your-own on Kafka, Kinesis abstracts away a lot of problems (you mentioned cross-region stuff, but also you’d otherwise have to learn and manage Apache ZooKeeper, cluster management/provisioning/failover, configuration management, etc). Especially if you’re a first-time user of Kafka, it’s easy to sink days or weeks into making Kafka into a scale-ready, production environment. Whereas Kinesis will take you a couple of hours max, and as it’s in AWS, it’s production-worthy from the start.

Advantage: Kinesis, by a mile.

Ongoing ops (human costs)

It also might be worth adding that there can be a big difference between the ongoing operational burden of running your own infrastructure (and a 24-hour pager rotation to deal with hiccups, building a run book over time based on your learnings, etc — the standard Site Reliability stuff), vs. just paying for the engineers at AWS to do it for you.

In many Kafka deployments, the human costs related to this part of your stack alone could easily become a high hundreds of thousands of dollars per year.

The comment below is right: That ops work still has to be done by someone if you’re outsourcing it to Amazon, but it’s probably fair to say that Amazon has more expertise running Kinesis than your company will ever have running Kafka, plus the multi-tenancy of Kinesis gives Amazon’s ops team significant economies of scale.

Advantage: Kinesis, by a mile.

Ongoing ops (machine costs)

This one is hard to peg down, as the only way to be _certain _for your use case is to build fully-functional deployments on Kafka and on Kinesis, then load-test them both for costs. This is worthwhile for some investments, but not others. But we can make an educated guess. However, as Kafka exposes low-level interfaces, and you have access to the Linux OS itself, Kafka is much more tunable. This means (if you invest the human time), your costs can gone down over time based on your team’s learning, seeing your workload in production, and optimizing for your particular usage. Whereas with Kinesis, your costs will probably go down over time automatically because that’s how AWS as a business tends to work, but that cost reduction curve won’t be tailored to your workload (mathematically, it’ll work more like an averaging-out of the various ways Amazon’s other customers are using Kinesis — this means the more typical your workload is for them, the more you’ll benefit from AWS’ inevitable price reduction).

Meanwhile — and this is quite like comparing cloud instance costs (e.g. EC2) to dedicated hardware costs — there’s the utilization question: to what degree are you paying for unused machine/instance capacity? On this front, Kinesis has the standard advantage of all multi-tenant services, from Heroku and SendGrid product to commuter trains to HOV Lanes: it is far less likely to be as over-provisioned as a single-tenant alternative would be, which means a given project’s cost curve can much better match the shape of its usage curve. Yes, the vendor makes a profit margin on your usage, but AWS (and all of Amazon, really) is a classic example of Penetration Pricing, never focused on extracting big margins.

Advantage: Probably Kinesis, unless your project is super special snowflake.

Incident Risk

Your risks of production issues will be far lower with Kinesis, as others have answered here.

After your team has built up a few hundred engineer-years of managing your Kafka cluster — or if you can find a way to hire this rare and valuable expertise from the outside — these risks will decline significantly, so long as you’re also investing in really good monitoring, alerting, 24-hour pager rotations, etc. The learning curve will be less steep if your team also manages other heavy distributed systems.

But between go-live and when you have grown or acquired that expertise, can you afford outages and lost data in the meantime? The impact depends on your case and where it fits into your business. The risk is difficult to model mathematically, because if you could a given service outage or data loss incident well enough to model their impact, you’d know enough to avoid the incident entirely.

Advantage: Kinesis

Conclusion

In conclusion, the TCO is probably significantly lower for Kinesis. So is the risk. And in most projects, risk-adjusted TCO should be the final arbiter.

Addendum

So why do my team and I use Kafka, despite the fact that the risk-adjusted TCO may be higher?

The first answer is historical: Kinesis was announced in November 2013, which was well after we had built on Kafka. But we would almost certainly choose Kafka even if we were making the call today.

Two core reasons:

  • Event streaming is extremely core to what we do at our company. In the vast majority of use cases, data engineering is auxiliary to the product, but for us it is product: one of our products is called Keen Streams, and is itself a large-scale streaming event data input + transformation + enrichment + output service. Kafka helps power the backbone of the product, so tunability is key for our case.
  • Nothing is more tunable than running an open source project on your own stack, where you can instrument and tweak any layer of the stack (on top of Kafka, within Kafka, code in the Linux boxes underneath, and configuration of those boxes to conform to a variety of workloads). And because what we sell is somewhere between PaaS and IaaS ourselves, and because performance is a product feature for us as opposed to an auxiliary nice-to-have on an internal tool, we’ve chosen to invest heavily into that tuning and into the talent base to perform that tuning.
  • Apache Kafka is open source and can be deployed anywhere. Given that infrastructure cost is a key input to our gross margins, we enjoy a lot of benefits by being able to deploy into various environments — we’re currently running in multiple data-centers in both IBM and AWS. Meanwhile, data location is a key input to some enterprise customers’ decision-making process, so it’s valuable for us to maintain control over where all of our services, including the event queue itself, are deployed.

At Keen IO, we built a massively scalable event database that allows you to stream, store, compute, and visualize all via our lovingly-crafted APIs. Keen’s platform uses a combination of Tornado, Apache Storm, Apache Kafka, and Apache Cassandra, which allows for a highly available and scalable, distributed database. Have an experience or content you’d like to share? We enjoy creating content that’s helpful and insightful. Enjoyed the article? Check us out! Or email us– we would love to hear from you.

--

--