Kafka on AWS without breaking the bank

Published in

Investing in Tech

7 min readMar 19, 2019

When I first read the recommended specs of Zookeeper and Kafka, my “Tighty-Sense” was tingling. Surely this was going to cost us a fortune, but through trial and error I found a setup capable of delivering over a million messages per second for under $50 a month.

Research

Like most people, I googled for a guide to setting up Kafka on AWS. I found various best practices and recommendations for memory usage, some had 4G heap for the JVM, confluent seemed to be suggesting 6G, one guide recommended r4.xlarge as the lowest instance type to consider, dedicated instance storage was a must… already sounding very pricy. I’m always dubious of “recommended” specs. I prefer to try it out for myself and really explore the boundaries before committing to a stack.

Then there’s resilience, you really need at least 3 instances in different AZs with replication on to make sure you can recover your data when something goes bang. That’d be three r4.xlarge instances ….. $574 per month per environment. We have a DEV/QA /STAGING and PROD environments, so we’d be looking at $2300 per month in EC2 costs alone.

Goals

Three nodes minimum for quorum.
Terraformed immutable infrastructure.
Must be able to terminate a node on the fly without losing any messages.
Must be easily scalable, should our throughput demand increase.
Zero downtime.

I decided on using detachable EBS volumes, allowing us to terminate a node without losing any data, and making resizing the disk a doddle:

resource "null_resource" "zoos" {
  count = "${var.zookeeper-max}"
  triggers  {
    dns_name = "zoo${count.index}.example.co.uk:${var.zkClientPort}"
  }
}resource "aws_ebs_volume" "kafkaVolumes" {
  count = "${var.zookeeper-max}"
  availability_zone = "${element(aws_instance.zookeeper.*.availability_zone ,count.index)}"
  size              = 80
  tags = {
    Name = "${var.env}-kafkaVolume-${count.index}"
  }
  lifecycle {
    ignore_changes = ["availability_zone"]
  }
}resource "aws_volume_attachment" "kafka_att" {
  count = "${var.zookeeper-max}"
  device_name = "/dev/sdh"
  volume_id   = "${element(aws_ebs_volume.kafkaVolumes.*.id,count.index)}"
  instance_id = "${element(aws_instance.zookeeper.*.id,count.index)}"
  depends_on  = ["aws_instance.zookeeper","aws_ebs_volume.kafkaVolumes"]
  lifecycle {
    ignore_changes = ["aws_instance.zookeeper"]
  }
}

First Attempt

By default confluent’s Zookeeper starts up with 512M and Kafka with 1GB heap. The smallest I could get this config to function on was a t3 small. Once up & running I then unleashed the producer & consumer performance tests provided in the confluent package. It was EASILY coping with the load, cpu didn’t move above 1%, OS memory usage 30%, so I decided to try dropping the heap to a ridiculously small 256M and see if I could squeeze it all into a t3 micro.

Second Attempt

Kafka was easy enough, I changed the Terraform user data to include:

export KAFKA_HEAP_OPTS=”-Xmx256M -Xms256M”

And low and behold it worked, but then I noticed Zookeeper was also on 256M. The only successful way I could find to have Kafka and Zookeeper use different heap sizes was to re-export KAFKA_HEAP_OPTS between launching Zookeeper and Kafka.

So now, my Kafka is on 256M , Zookeeper 128M , and they’re running on t3.micro’s with a detachable standard EBS volume of 80GB.

Nirvana-ish … as long as I can stop 1 node without losing any messages, I can increase the instance type. If the disk space is running out we can increase it on the fly!

Performance Tests

I hammered it with the performance tests, fully expecting something to fall over, but to my amazement after an hour it was all still running! It was producing & consuming ~1.4 million messages per second. Each message was 500 bytes long.

To run the performance test, in one session, start the producer:

./kafka-producer-perf-test --topic test_22Jan19 --num-records 10000000000 --record-size 500 --throughput 2000000 --producer.config ~/producer.properties

Then in a second session, start the consumer:

./kafka-consumer-perf-test  --messages 50000000000 --broker-list=zoo0.dev.example.co.uk:9092,zoo1.dev.example.co.uk:9092,zoo2.dev.example.co.uk:9092 --topic test_22Jan19 --group test_consumer_group2 --num-fetch-threads 6 --reporting-interval 5000 --show-detailed-stats --threads 6 --print-metrics

So that’s attempting to produce and consume 2 million messages per second.

The user data for the EC2 instance installs and configures Zookeeper, Kafka, node_exporter and Kafka_exporter for prometheus (to give me some idea whats going on inside)

Kafka Stats:

It’s managing to produce and consume 1.4 million messages per second, (681.25 MB/sec) , but I asked for 2 million, so this must be being throttled by EBS bandwidth OR network bandwidth. Over the 70 minutes I left it running the rate was remarkably consistent.

Here’s what was happening on the boxes themselves :

It managed to push CPU up to ~30%, huge network traffic as you’d expect, but the memory usage is absolutely fine. I suspect this may change once we’re running many topics, with many consumers & producers on the go. It was using up the disk at a rate of 0.77G per minute! I’d only attached 80G EBS drives… “oooh” I thought, “I’ll start the tests again, and whilst they’re running I’ll change the disk size on the fly“

Disk Resizing Test

Much to my delight, changing the volume size in Terraform does NOT destroy and re-create the volume as I’d expected, so I started the performance tests again, waited a few minutes for it to level out, then applied the new volume size of 160GB. Once the Terraform apply was complete I had to ssh to each instance and run …

sudo resize2fs /dev/nvme1n1

which expands the filesystem to use the new unused disk space.

Results:

Kafka stats are exactly the same, it didn’t bat an eyelid about the disk resize.

Same for the node stats, but there is a slight jitter at about 12:45 in memory and disk space used, when I applied the change ?

The only notable difference on Grafana is file system space, looks like I increased it just in the nick of time:

What’s the catch ?

Too good to be true ? Well, something had to give. When I checked the cloudwatch metrics for the Kafka instance it was clear that this throughput wasn’t sustainable for ever.

So as you can see, at that rate our poor little t3.micro would eventually run out of CPU credits and EBS byte balance. CPU credits isn’t really a problem as we can turn on “unlimited credits”, which is kinda like an overdraft… you can dip into it, but they’ll charge through the nose for it.

EBS Byte Balance is a pain though. The results showed that the byte balance would expire after 769 minutes (12.8 hrs). But …. hang on , the majority of our customers are in one country, and the majority of those go to bed at night, so the overnight load is going to be negligible, giving the instances time to re-charge their byte balance before the next days onslaught begins. T3 micros have a MAX credit limit of 288, and re-charge at 12 credits per hour….. so as long as everyone sleeps for (288 / 12) 24 hours a day, we’re laughing. 😜

Seriously though, it’s very difficult to predict what the message throughput will need to be in advance. I can safely say it’ll be nowhere near a million per second.

Summary

To sum up, we haven’t put Kafka live yet, there’s only a handful of user stories that require a messaging solution, we’ve not the foggiest how many messages per second we need to support, but by addressing the need to increase instance type, bandwidth, and disk size ON THE FLY, and by having good alerts from Grafana, we can safely start off on a ridiculously feeble setup, and scale vertically as and when required.

This initial feeble setup may end up being more than adequate for our needs, and coming it at ~$22 per month in EC2 costs and ~$20 in EBS storage I think it’d be extremely difficult to get more bang for the buck.

NEXT UP : try the same tests , with multiple topics and consumers.