Event-Driven Architecture at Scale Using Kafka

Published in

Engineered @ Publicis Sapient

8 min readJan 2, 2023

Introduction

You are a digital technology leader who is either starting your digital transformation (may be pandemic-forced) or neck deep into one already. Your focus is to gain a competitive advantage in your field by building highly robust, scalable, and agile applications. Using EDA helps achieve the competitive level of the non-functional attributes by needing minimal connection between systems while still allowing them to interact, making it ideal for modern, distributed systems. Modern application designs are often event-driven, requiring an advanced knowledge of the event-driven architecture.

Event-driven micro-services allow for real-time micro-services communication, enabling data to be consumed in the form of events before they’re even requested. It extends core principles of micro-services, including decoupling, separation of concerns, agility, and real-time streaming of event data.

There are many technologies in this space - open source, cloud-native, or vendor-specific. But based on our experience, we feel Kafka is a great fit for most use cases requiring scale due to its distributed nature. It is used by a vast number of companies to build high-performance data pipelines, enable real-time data analysis, and integrate data from critical applications.

Pitfalls faced in most Large Kafka Implementations

Error Handling/Reliability

We know this is the most important quality of modern applications, but with multiple configurations available in Kafka, it gets complicated to select the right combination of configurations. While the right circuit-breaking settings with exponential backoffs with jitter can get you a resilient application, but the emphasis is required on producer and consumer configs like acks, consumer delivery semantics, offset commit strategies, etc. to have true reliability in a distributed system. The throughput (best effort) vs reliability (never lose my message) lens is a fair game based on a use case.

Schema Evolution

Message’s schematic sanctity is a gigantic problem in distributed applications, where producer and consumer applications don’t lie in the same product boundaries. A lot of implementations fail in this space as testing or retrieving information from each other becomes irreparable. Design first schema-led approach is the best solution to this problem which still requires collaboration with cross-products teams to get it right. This way it can be ensured that a well formatted and valid message is exchanged between a producer and a consumer. Data semantics check can also be added as an add-on to achieve data quality. Doesn’t that sound amazing?

Automation

Manual toil is always a killer for modern application implementations. It’s nothing different for Kafka-based implementations. If you don’t nail your IAC to get the Kafka creation, then you are already one step in the ditch. Automation with tools like CDK, terraform, etc. will get you the desired infrastructure, but think of extending that to the automated creation of topics, schemas, back-up of schemas, metric-based elasticity for brokers etc. with DLQ management and automation to make life easy.

Serverless vs long-running Compute

Another silent assassin for these failed implementations is the use of the right compute for producer and consumer applications. While serverless is very tempting, it is not always the right choice. Apply the lens of throughput, scale and long-running vs on-demand nature to decide if serverless will work for you or not. Pressure test this in your automated testing environments to make the right choices.

So how should you proceed? Think of it as a common framework

Templated infrastructure, producers, and consumers can be created with all the NFRs (like reliability, resiliency, security, SRE, etc.) as in built framework that can be taken by different product teams to extend in their business use-case. Centralized control with decentralized implementation can help improve the development efforts and standardization.

Key aspects to take care are :

Standardize the technology platform for eventing and streaming use cases like Kafka
Standardize patterns around producing/consuming events, evolving schemas (compatibility included), stream processing, etc.
Standardize SRE & Auditing design for event and stream processing
Set up the data quality governance around events and their schemas within a domain and across domains.
A consistent approach to improve reliability for fault-tolerance and event reprocessing.

How to get the key aspects of Kafka implementation right

Schema Validation to drive design first and Data Quality

Message Schemas could be used as a contract between Producer and Consumer, which will ensure that a producer will publish a message in a format that a consumer is expecting. This contract will enforce
data governance between the message that is produced and consequently consumed. Schema Registry is an application external to Kafka, that maintains the database of schemas. Confluent provides its own version of Schema Registry and AWS provides that functionality as one of the
features of its Glue service.

Schema Registry could become one of the points of failure. Hence, certain measures should be taken to increase the availability. It will be a good design decision to run multiple instances of Schema Registry
and have it deployed in an auto scale mode. In this way, Schema Registry will be available when the load increases.

Messages will need to have the schema information along with the original message. Either the entire schema or an identifier of the schema will be sent along with the message. To make this process performant, it will be beneficial to locally cache the schema so that producers and consumers do not have to make frequent calls to an external component like Schema Registry.

Any modifications or changes in schema should be done in a lower environment and should be tested before implementing them in Production environment. In this way, a well-tested and designed schema will be introduced in Prod, as it will need a bit of downtime to have a changed/new version of schema.

Reliable Producers

For large volumes of data transactions in an organization, it is not only important to have a performant system that can handle the heavy data traffic, but it is equally important to have a reliable system that ensures all the data that is sent is also received. Kafka is highly fault tolerant, as it has some internal mechanisms to ensure that messages are sent and received reliably. Transient issues in sending will be retried a certain number of times by Kafka in the hopes that the error will resolve. However, sometimes this in-built level of error handling is not enough.

If the Producers are unable to send the messages to Kafka cluster, then it needs to be stored somewhere to resend them to Kafka. A Dead Letter Queue (DLQ) topic, without schema validation, could be used to hold the messages and resend the messages later. A header attribute could be used in the messages to store metadata around the context of the failure. DLQ topic will not work in case the Kafka cluster is unavailable. In this scenario, DLQ will not work as there will be no connectivity with the cluster. Therefore, a different solution will need to be used that will not involve any Kafka cluster component. A separate storage mechanism like a Database (DB) could be used.

Consumers that can scale

Kafka Consumers generally take a long-polling approach when it comes to consuming messages from topics in near real-time. This is optimized by parallelizing individual consumer applications to consume from separate Topic-Partitions. Given this, it is vital to design your applications to scale as safely as possible.

There are many important factors to consider when designing scalable consumer applications (like Segregated Consumer Group Naming, CommitFailedException, Static Group Membership, etc). The nature of how you consume must capitalize on the parallelism unit that the framework provides, and you should avoid writing code that can interfere with scalability patterns. How your consumer handles retriable and non-retriable failures can have downstream impacts on reliability and performance, so choose an approach that makes sense from your business use case. Finally, rebalancing is an inevitable functionality in most pub-sub frameworks, so optimize your design to rebalance only when necessary.

Automation to keep sanity

IAC using Terraform or similar tools to stand up Kafka clusters on your cloud or on prem.
Use IAC to automate topic creation or schema creations.

resource "kafka_topic" "mytfkafkatopic" {
  for_each           = toset(var.topic_name)
  name               = "${var.region}-${var.market}-${var.environment}-${each.key}"
  replication_factor = var.replication_factor
  partitions         = var.partitions
  config = {
    "segment.ms"     = var.segment
    "cleanup.policy" = var.cleanup-policy
    "retention.ms"   = var.retention-ms
  }

Automate broker scaling using defined metrics
Automate cluster rebalancing using cruise control
Automate testing using test containers like tools

Testing & Observability

Unit tests ensure that the code has no glaring issues, but integration and performance tests are vital in uncovering problems with an actual running system. Much of a producer’s/consumer’s functionality can only be tested alongside a running Kafka and Schema Registry setup. This is easy to achieve with local docker containers. For Jvm applications , the TestContainers library can be used to create an end-to-end integration
testing suite. This isolated environment is the perfect place to test all functionalities without worrying about changing an existing Kafka cluster.

To fully capitalize on those benefits, we need to closely monitor our infrastructure and catch problems before they arise. “Preventing fires is better than fighting them.”
• Monitor the Kafka Cluster
• Monitoring Distributed event services
• Monitoring Topics

Administration

Kafka does not have a default UI, but there are some third-party tools which can be used to display Kafka resources graphically.
Kafdrop is one of the web UIs for viewing Kafka topics and browsing consumer groups. The tool displays information such as brokers, topics, partitions, consumers, and lets you view messages.

You can also extend this tool like we did to get additional benefits like DLQ management and automation with features like automated retry of retryable error messages, bulk fix and upload of DLQ messages for replay etc.

DLQ message update and Replay Capability

Conclusion

Enterprises that use event-driven architecture can enjoy the advantages of scalable and potent real-time communication. Many strategic undertakings can benefit from this, including the internet of things (IoT), e-commerce, data integration across systems, and edge and fraud detection.

There’s a lot of word around Kafka and how it can pave way for your digital transformation success. When properly executed, Kafka can be a decisive tool for sending and receiving data across your network. Improper implementations will not only deter you from processing data more efficiently, but they may also set off a technical debt cycle. So control these pitfalls and keep their solution in mind as you implement this potentially powerful tool.

Given the demand of Kafka , choosing the right cluster set up either in SaaS or cloud-native environment is very important. Please see the part-2 of this blog to review the criteria to chose the right one.

Contributors

Amit Sharma: Head Of Engineering

Kamakhya Das: Senior Architect

Gopa Roy: Senior Developer

Mckenzie Murphy & Rakib Kamal & Warren Bigelow: Software Developer