Kafka — Event hub project

Rodrigo Kellermann
gb.tech
Published in
4 min readApr 7, 2022

Event Hub project

Kafka is a new platform at Grupo Boticario (GB), we have started building it on 2021, it started as a project called “Event Hub”. /*we are hiring!*/

Event hub is aiming to collect entire company data that is relevant for more than one other area/system. The purpose is to provide data inside Kafka and let the interested ones pick it from there.

During project start phase, we discussed how to build the first Kafka platform, and for many reasons and our strong AWS relationship and footprint, we decided to start with AWS Managed Services for Kafka (AWS MSK)

We could start small and billing would go thru our current AWS billing process, also we would take advantage of automations/integrations already delivered by AWS MSK.

AWS MSK

Second important question was: Are we going for a centralized schema control with Kafka?

We thought that as an important thing, and we decided to try out AVRO schemas with Confluent Schema Registry .

So, we had kafka and avro to begin with, how about people start coding and throwing data inside kafka? There would be many questions and not many standards set.

How about the Event Hub project team creates some code called “reference applications”?

So, applications with producers and consumers were created using NodeJs, Java/Kotlin, Python, etc… following what we defined as our standards about configurations/libraries/definitions.

So, developers that are going to put real code in production could look at reference apps as a starting point and of course reach others for additional doubts.

People started coding, so we discovered new behaviors/issues.

For instance, KafkaJS has a behavior where it will always connect to the first broker from the bootstrap list, so a random function was provided to avoid it.

Lambda code connecting to Kafka, if you do “connect” and “disconnect” it will flood kafka with new TCP connections (which increases cpu load). So, we defined connection outside the lambda handler as good practice.

MSK deployment

MSK environment is currently composed of 6 brokers, distributed over 3 availability zones. We are using topic defaults with 6 partitions per topic and replication factor 3.

Storage autoscaling is enabled and triggers it at 60% disk usage.

On security perspective, we are using all TLS communications and authentication is using SASL/SCRAM connected with Secrets Manager.

For helping sizing MSK, AWS provides this spreadsheet .

MSK best practices it’s a very important link to help you start with it.

Till now, we haven’t had a performance situation where default topic partition is not enough for handling the needed performance, but it’s something to keep in mind.

MSK deployment

Schema Registry Environment

It's built under Kubernetes, using CP Helm Charts . An internal endpoint using AWS load balancer is provided, topics data and definitions are stored in a Kafka topic.

Monitoring && Tools

We decide to keep using New Relic (NR) as with other platforms on the company, so we are using Cloudwatch Metric Streams to send MSK monitoring data to NR.

Besides that, we felt that some tools were missing, so we started exploring and we deployed:

Cluster Manager for Apache Kafka (CMAK), it was created by Yahoo, it’s an admin tool where you can operations on the cluster (create topics, reassign partitions, etc)

Cruise Control for Apache Kafka, it was created by Linkedin, it helps keeping the Kafka healthy, showing unbalanced brokers, failed partitions, performance issues, etc.

Eventhub Registry Engine, it is an internal tool that was created to help provision new topics and new schemas, basically you make Pull Requests (PR) and the engine check if your proposal is fine with engine rules (topic name, schema version, etc) and then it creates the resources on kafka and schema registry. So we avoid manual toil on DevOps and we keep some standards.

Eventhub Endpoint Discovery, its an internal lambda that keeps calling “get bootstrap brokers” from MSK from time to time and updates some internal DNS endpoints. We defined DNS entries for “*.pub” and “*.sub”.

What is next?

Kafka Rest Proxy, some teams are testing with applications that cannot talk kafka protocol

MSK Connect, we have created first two integrations using managed kafka connect from AWS

Improvements on internal automation tools, provide more features and capabilities to avoid toil on DevOps teams.

As we see new applications entering the Kafka world, we see new problems, and some we see because it affects brokers performance, so we regularly keep an eye on metrics at New Relic and Cruise Control alerts.

Thanks to all EventHub Team!

--

--