Deploying your own Kafka cluster in AWS via Terraform and Ansible

9 min readAug 28, 2019

A little bit of context

Last year we started a company-wide effort of migrating all our infrastructure from a traditional data-center to AWS(Amazon Web Services). This effort is still ongoing, but we can share some experiences from the journey so far.

I’m part of the big-data team and, before AWS, our team had at its disposal a 5-machine cluster with Hortonworks distribution of Hadoop. We used this cluster for running all our “big data” services (HBase, Kafka, and NiFi), performing all our on-demand computation (Spark), and storing all our data (HDFS). As you can imagine, it was starting to get a bit crowded.

Moving to AWS gave us the chance to make our lives easier — with some initial significant effort, I admit. We are now much more flexible in our deployment: we can easily increase/decrease the resources of a cluster based on the expected load and deploy a new cluster if we need to test some new configurations. This, of course, comes at a cost — as does everything in AWS ;)

We recently finished moving all our big-data infrastructure to AWS which now includes for each environment (beta-testing and production):

1 Kafka cluster with its own dedicated Zookeeper ensemble
1 NiFi cluster with its own dedicated Zookeeper ensemble
1 HBase EMR-based cluster
on-demand Spark EMR-based clusters

In this post, we will describe how we deploy a Kafka cluster with its own dedicated Zookeeper ensemble.

The problem

Let’s focus on how to get a Kafka cluster up and running in AWS. If you have used Kafka before, you know that it requires Zookeeper to coordinate the brokers. Therefore, the problem that we are trying to solve is actually starting a Zookeeper ensemble and a Kafka cluster in AWS.

What tools to use

Our infrastructure team actually led the decision on this matter. They started the migration of our company infrastructure using Terraform as it fulfills all our current requirements.

To provision the machines, the infra team always used Ansible and we decided to adopt it as well, as we can rely on some common codebase (e.g., mounting disks, setting up users and permissions).

Our solution

You can find the code for the described solution in our blog’s github. Disclaimer: The code is not to be considered production-ready code; it was gathered to provide a starting point for setting up your Kafka cluster, but it will need a few edits to be adapted to each specific scenario.

We decided to take a two-step approach:

We start all the necessary AWS resources using Terraform: security groups, EC2 instances, EBS volumes, and so on.
We deploy the necessary software on each of the instances and start the services using Ansible.

In the beginning, we thought of using a more simple one-step approach and doing both infrastructure creation and software deployment using Terraform (e.g., via provisioners). We decided, however, to go with the two-step solution as Ansible gives us much more freedom on provisioning. Also, as mentioned earlier, doing the provisioning via Ansible allows us to reuse part of the codebase used for other infrastructure pieces.

Step 1: Creating resources via Terraform

First things first, we need to create the EC2 instances which will be part of our cluster. As mentioned earlier, we want to instantiate a Kafka cluster composed of N brokers (we use 3 in this example) and a serving Zookeeper ensemble composed of M nodes (we use 3 here too).

High-level picture of the components of a Kafka cluster.

Security Groups

We assume that a VPC with at least one public subnet is already setup in the AWS account — how to actually do it would be a matter for a whole other post. However, we will need to create 3 new security groups:

kafka-security-group: our Kafka brokers will live in this security group; therefore, we need to grant access to this security group from all Kafka producers/consumers which will use the cluster. Moreover, the Kafka brokers need to talk to each other, so we will also need to open Kafka ports within the security group (usually 9092).
zookeeper-security-group: our Zookeeper instances will live in this security group; therefore, we need to grant access to this security group from all Kafka brokers which use Zookeeper as coordinator. We will also need to allow communication among Zookeeper nodes on the relevant ports (usually 2888 and 3888).
kafka-cluster-security-group: this is not strictly necessary, but it makes things easier, as both our Kafka brokers and Zookeeper nodes will live in this security group and we can set up common communication patterns — for example, allowing SSH access from maintainers’ and developers’ machines.

We can easily create those security groups using Terraform; for example, the zookeeper-security-group can be defined using something like the following:

resource "aws_security_group" "zookeeper" {  
  name        = "zookeeper-security-group"  
  description = "Allow Zookeeper traffic"  
  vpc_id      = "${data.aws_vpc.vpc.id}"  ingress {
    from_port       = 2181
    to_port         = 2181
    protocol        = "tcp"
    security_groups = ${data.aws_security_group.kafka.id}
  }
  ingress {
    from_port   = 2888
    to_port     = 2888
    protocol    = "tcp"
    self        = true
  }
  ingress {
    from_port   = 3888
    to_port     = 3888
    protocol    = "tcp"
    self        = true
  }
}

The problem with having all the rules defined within the security group itself is that when you change the rules, Terraform will likely destroy and recreate the security group itself and detach/reattach it to all the affected instances. To overcome this issue, we create an empty security group and then use the Terraform resource aws_security_group_rule to create each rule independently and attach it to the security group; something like this:

resource "aws_security_group" "zookeeper" {
  name        = "zookeeper-security-group"
  description = "Allow Zookeeper traffic"
  vpc_id      = "${data.aws_vpc.vpc.id}"
}resource "aws_security_group_rule" "allow_zookeeper_quorum" {
  type                     = "ingress"
  from_port                = "2181"
  to_port                  = "2181"
  protocol                 = "tcp"
  source_security_group_id = "${data.aws_security_group.kafka.id}"
  
  security_group_id = "${aws_security_group.zookeeper.id}"
}

This way you can add or remove rules to/from the security group without having to worry about Terraform destroying/recreating the security group itself. No changes will be done to any instance to which the security group is attached.

For the complete definition of security groups, see the github repo.

Kafka Brokers and Zookeeper Nodes

Now we can create the actual Kafka brokers and Zookeeper nodes and the Terraform resource aws_instance will come in handy here. There are a few attributes which we need to specify:

ami: we have our own AMI (created by our infra team) which we use for most of our AWS instances and which is customized for our needs (mainly users and groups). However, any Ubuntu-18-based AMI will do the job.
instance_type: we can definitely go small here, but not too small. We will need an instance with at least 2GB of memory for Kafka to run properly; Zookeeper is much less demanding and a smaller instance will suffice.
associate_public_ip_address: we want this set to true to be able to reach all the Kafka and Zookeeper instances from outside of our VPC (e.g., for maintenance).
vpc_security_group_ids: here we can specify a list of security groups; for example, we will need to list the IDs of kafka-security-group and kafka-cluster-security-group for our Kafka brokers.
availability_zone: the availability zone needs to be the same for both Kafka and Zookeeper instances.
key_name: last, but not least, don’t forget to assign a key to the instance to be able to SSH to it.

We will, of course, want to use variables to set most of the attributes listed above, so our Terraform code will look something like the following:

resource "aws_instance" "zookeeper" {
  ami           = "${data.aws_ami.image_latest.id}"
  instance_type = "${var.instance_type}"
  key_name      = "${var.key_name}"  subnet_id =   
            "${tolist(data.aws_subnet_ids.public_subnets.ids)[2]}"
  availability_zone           = "${var.availability_zone}"
  associate_public_ip_address = true
  
  vpc_security_group_ids = 
                  ["${data.aws_security_group.kafka_cluster.id}",
                   "${aws_security_group.zookeeper.id}"]  depends_on = ["aws_security_group.zookeeper"]
  count      = "${var.instance_count}"
}

We use the count attribute to create as many instances of the same type as we like. For example, for Zookeeper, we will set the variable instance_count to 3, so that we will create 3 identical nodes for our Zookeeper ensemble.

A word about storage

As Kafka will likely store some data (depending on your retention policy), Kafka brokers will have some more or less demanding storage needs. In our example code on github, we simply define the block root_block_device of the aws_instance with a predefined size:

  root_block_device {
    volume_size = "${var.volume_size}"
  }

However, in a real deployment, we will possibly want to add independent EBS volumes to our Kafka instances, and size them appropriately — Kafka documentation suggests to use multiple disks for data storage, to increase throughput. Showing how to set up volumes is out of the scope of this post, but we refer you to the use of Terraform aws_ebs_volume and aws_volume_attachment resources. Also, consider that adding EBS volumes will require some extra steps in Ansible to mount the devices.

Now we have all the resources and networking that we need for our cluster to operate. The next step will be to actually set up the software on the instances and start the appropriate services.

Step 2: Provisioning software via Ansible

We need Kafka and Zookeeper to be installed on our bare instances before we can do anything with them. We will use Confluent distribution of both Kafka and Zookeeper to make our setting more standard.

We recently found out that Confluent provides a series of Ansible playbooks which can be used (after some tuning) for setting up the desired Confluent services. You can find them here: Confluent Ansible playbooks. We suggest you take a look there for inspiration, however, in the following paragraphs, we will try to guide you through the steps necessary to install Kafka and Zookeeper.

First, we need to define the groups and roles to assign to the instances we created in the previous step. We do this by defining an Ansible inventory that will look something like this:

[kafka]
test-kafka-broker-001
test-kafka-broker-002
test-kafka-broker-003[zookeeper]
test-kafka-zookeeper-001
test-kafka-zookeeper-002
test-kafka-zookeeper-003

For each group ( kafka and zookeeper) we list the hostnames of the instances belonging to them. For each of those groups, we also define a respective role which contains the actual Ansible steps.

Configuring Zookeeper

Installing Confluent and starting the Zookeeper service is almost all we need to do here. However, there are a couple of extra steps which need attention:

Each Zookeeper node needs to know what the full quorum of nodes is; therefore, we need to include in the zookeeper.properties the hostnames of all the Zookeeper nodes. We do this via a template step where the template file of zookeeper.properties contains the following:

{% for host in groups[ "zookeeper" ]|sort %}
server.{{loop.index}}={{ hostvars[host].inventory_hostname }}:2888:3888
{% endfor %}

Each Zookeeper node needs to have a unique integer N assigned (in the file myid), which is the same as in the zookeeper.properties property named server.N. For this we use a similar construct:

{% for host in groups[ "zookeeper" ]|sort %}
{% if hostvars[host].ansible_default_ipv4.address == ansible_default_ipv4.address %}
{{loop.index}}
{% endfor %}

Note the use of sort after retrieving the Zookeeper group: the reason for sorting is that it’s crucial that a Zookeeper node has the same N in the myid file as it’s reported in the zookeeper.properties of all the other nodes.

Configuring Kafka

We can reuse the same steps for installing Confluent. Then, we need to configure the Kafka service via templating of the server.properties. A couple of crucial points:

We use the automatic generation of broker ID by setting the property broker.id.generation.enable to true .
We set up the listeners; we need an SSL listener for connections coming from outside the VPC (producers and consumers), and a plaintext listener for connections from within the cluster:

listeners=SSL-PUBLIC://{{ inventory_hostname }}:9092,PLAINTEXT-LOCAL://{{ inventory_hostname }}:9094

we configure the connection to Zookeeper by listing all the instances of the quorum:

zookeeper.connect={% for host in groups[ "zookeeper" ]|sort -%}
{% set delimiter = "" if loop.last else "," %}
{{ hostvars[host].inventory_hostname~":"~zk_client_port~delimiter }}
{%- endfor %}

Securing the connection to Kafka

In our setup, the only listener reachable from outside the VPC is the SSL one, so that data is encrypted in transit. Some extra Ansible steps are needed for setting up a secure connection. We do not cover them here, but they are reported in the example code in the github repo.

Summary

In this blog post, we described how we deployed our own Kafka cluster with a dedicated Zookeeper ensemble. We used Terraform to create the required resources and Ansible to provision the necessary software.

We provided an example code on GitHub which contains all the code and steps described here and also some extra required parts. The code should not be considered as ready-to-use, but more like a tasting menu: you should take something here and something there depending on your needs.

Hope you enjoyed the post and that this will help some people out there who also want to build their own self-managed Kafka cluster.