Iris, the one stop shop for your Kafka broker monitoring needs

7 min readSep 12, 2022

‘What’s a Kafka?’

Apache Kafka is an open source event-driven data streaming platform developed by LinkedIn written in Scala and Java. It was later transferred and donated to the Apache Software Foundation as an open source product.

Built on the backbone of message queues and publish-subscribe systems, Kafka’s main use is to manage large streams of data in real-time. Kafka allows users to send messages between applications in highly distributed systems.

Let’s take John, for example. John’s a smart guy. He likes to grind LeetCode and share his solutions to many of his software engineering followers — he has 14 million dedicated LeetCode followers because he’s that good. Here, John is publishing his solutions to the internet as his followers await intensely for his next release. Some of John’s followers are keeping up with him. Others are stuck and falling behind. They need more time before they want to jump to the next problem John releases the solution for.

John and his dedicated LeetCode followers

So how can we resolve this issue? We can’t use a real time Websocket connection, and we don’t want to locally or externally cache the solution in a database as that would use more system resources putting the responsibility on the consumer end. Surely, there has to be a better option.

This is where a message broker comes in. By using a message broker like Apache Kafka, we can easily send data to a persistent log grouped into topics that can be accessed at any time from any consumer holding a subscription to that topic. If John’s followers need to take some time to digest a Leetcode hard solution, they should be able to resume at their last previous point as John releases more and more solutions. With Apache Kafka’s message broker architecture, they can do just that. John’s friends can receive all of the leetcode solutions from John’s posts independent of John’s pace.

John, his dedicated LeetCode followers and an integrated Kafka solution

‘Ok, but who uses it?’

As major tech industries around the world rapidly grow, Apache Kafka has been perfectly situated for dynamically scaling with the market’s need for a faster message brokering architecture. It’s been widely adopted by big tech giants like Pinterest, Airbnb, Cisco and Salesforce to name a few marking it as one of THE de facto message brokers in modern software infrastructures.

In 2021, PayPal redesigned a new system architecture using Apache Kafka as a buffer and load distributor for their crawling jobs process where they stream URLs into the processing pipeline using their Akka platform. After implementing the new design, PayPal’s performance has increased tremendously. Their batch processing increases efficiency by lowering thread use and reducing CPU utilization rate by 90%. Most impressively, Kafka enabled an 80% reduction in PayPal’s overall codebase.

As larger volumes of data records are being delivered across brokers, the need for lower latency and higher fault tolerance increases. Kafka has become an industry standard allowing organizations to modernize their data strategies with event streaming architecture.

‘If it’s so great, how does it work?’

Good question. Here’s the long. In short.

Each topic in the pub-sub system is an ordered log of events. As producers write data to Kafka, these events are appended to each topic log. Unlike a message queue, Kafka brokers persist a log of this written data and the consumer must pull these messages off the broker according to an index; this is called the offset. This allows the multiple consumers to read data in written order and even resume from their last read point in the event of a crash.

However, only a single consumer may consume at a specific offset at any given time. This is great for many consumers fetching data from the same topic but at different timelines in the topic log. But what happens if multiple consumers are required to pull data from the same offset?

To increase the horizontal scalability, Kafka implements partitions as a way to distribute a topic’s messages too many consumers. Partitions provide a unique way to distribute the logs amongst several areas to allow parallel access of data. These topics and their partitions are managed by the Kafka broker. It’s responsible for appending events to topic logs and delivering data to consumers by implementing the logic to require replication.

For data availability, one replica of each topic partition is elected as the ‘leader’; the container where all the data input and output will primarily reside. Replicas of the topic act as ‘followers’. As data is produced to the leader, data is replicated to the followers. If nodes were to fail, followers will be designated as new leaders and the cycle will continue. This automated process is not managed by the user and remains a core feature of what makes Kafka so powerful and highly available.

https://www.cloudkarafka.com/blog/part1-kafka-for-beginners-what-is-apache-kafka.html

Not all these unique processes are resilient. A large business opts to replicate their data in the off chance that their brokers may go offline or their leader nodes fail. Multiple brokers could go down due to server connection issues and leaders may become unavailable if the number of minimum in-sync replicas drops.

These failure events can be diagnosed by specific characteristic telemetry data. Messages begin to both increase in response time or lag. They end up exceeding the retry threshold and start to time out. But wouldn’t it be helpful to preemptively diagnose these problems leading up to the failure? That’s why it’s important to monitor Kafka’s health by selecting significant key metrics to ensure the broker runs smoothly and the client experience is not interrupted in any way.

And that’s where Iris comes into play.

‘Where have I heard of that?’

Iris is one of the newest open source products to the scene seeking to fill the gap between a great pay-to-use monitoring service like DataDog and other free-to-use tools like Prometheus with less manual setup. Iris is an open source Apache Kafka health monitoring suite built under OSLabs. It provides developers with the ability to chart and log real-time broker health metrics. Iris delivers in key categories like server diagnostics, message load throughput, log size latency and connectivity errors.

The simple UI is quick to set up and instantly begins monitoring your broker health. This no-frills solution dynamically allows the user to display and visualize a wide array of metrics like…

Messages In/Out per Sec to monitor broker data stream flow and latency
Request Rate and Queue size for diagnosing probable stagnations in request completion
Offline Partition Count to understand data I/O availability
Under Replicated Partitions to compare in sync replicas to configuration standard
Broker Network and I/O Activity to measure reasonable process durations

Developers have the ability to pick and choose which live metrics they want to analyze and across which time frames. Multiple individual line charts can be graphed in your feed within the highly customizable UI. At the same time, data is constantly logged to a SQL database configured to easily fetch existing data.

Iris’ newest feature now enables the developer to view this historically logged data based on query-able insights from the user. We can now use this data charted from a past time frame and compare it directly to our data in real time. Baselining key health metrics has never been simpler!

Future Work

As we continue to grow, our goal with Iris is to provide the developer with an ‘as simple as possible, no simpler’ monitoring solution to ensure we are delivering a robust system. We wanted to architect a product stripped of the complexity of most monitoring suites.

In the coming weeks, we will continue to add more key metrics based on the community’s feedback and optimize the UI/UX for an even smoother developer integration and experience. We are hard at work on refactoring the backend architecture to optimize data request patterns while also hosting Iris onto an AWS EC2 environment to offer a platform agnostic tool and to free up the developer’s resources.

Please check us out on GitHub and LinkedIn to follow our progress! We love to hear feedback from the developer community on all the ways we can improve Iris!

Iris: LinkedIn | GitHub

Brennan Lee: LinkedIn | Github

John (Huu) Le: LinkedIn | Github

Michael Reyes: LinkedIn | Github

Walter Li: LinkedIn | Github

Support this Project

If you would like to support the active development of Iris and OSLabs:

Clap this Article!
Fork or star the repo on Github
Contribute to this project by raising a new issue or making a PR to solve an issue