Exploring Data Pipelines in Apache Kafka with Streams Explorer

Published in

bakdata

6 min readFeb 15, 2021

When working with large-scale streaming data, it is crucial to monitor your pipelines and to explore their individual parts.

Streams Explorer allows examining Apache Kafka data pipelines in a Kubernetes cluster including the inspection of schemas and monitoring of metrics. It is part of our open-source projects and easy to setup if you’re already using streams-bootstrap or faust-bootstrap for your application deployments. Streams Explorer gives you a central starting point in your toolkit to explore and monitor streaming applications and topics without searching through Kubernetes manually.

Overview

By default, Streams Explorer shows all data pipelines consisting of streams-bootstrap Java and/or faust-bootstrap Python applications deployed in one or multiple Kubernetes namespaces. The selection menu at the top allows picking a specific pipeline to focus on. We recommend naming the individual pipelines through the pipeline label in your Kubernetes deployments, which is important for the distinction in large environments. Otherwise, the data pipeline is named according to one streaming application in the pipeline by default.

Metrics from Prometheus are shown below each node. Topics include the approximate number of records (size) and the rate of messages in or out per second over the last five minutes. Streaming application nodes show the current consumer lag and the number of replicas for the deployment. Because we can’t directly tell which consumer group belongs to a certain application, we need to set an annotation consumerGroup on our app deployments.

Displaying metrics in the graph requires Kafka Exporter & Kafka Lag Exporter to collect Kafka metrics for Prometheus.

The Details view underneath the graph displays information about the selected node. This includes the Avro schema for topics, configuration details for streaming apps, and information about Kafka connectors along with links to external services that can be customised to your setup. More on that later when we get to extending the functionality using plugins.

Usage

The setup is straightforward. We can deploy streams-explorer to our Kubernetes cluster using the Helm chart.

helm repo add streams-explorer https://raw.githubusercontent.com/bakdata/streams-explorer/v1.0.3/helm-chart/
helm install --values helm-chart/values.yaml streams-explorer

When deploying Streams Explorer to the cluster a service account for accessing the Kubernetes API is created by default.

Setup instructions for Docker Compose and standalone can be found on the streams-explorer repository.

Showcase: Demo pipeline

We include a demo pipeline for ATM fraud detection in the repository, which serves as a running example to showcase the Streams Explorer features. The original by Confluent is written in KSQL and outlined in this blogpost.

We re-built the pipeline from scratch using our streams-bootstrap library, which provides a unified way of configuring and deploying Kafka Streams applications. There is a separate blogpost going into detail about streams-bootstrap (previously called common-kafka-streams).

The demo pipeline consists of four streaming applications. We can build the containers using Jib and push the resulting images to the container registry.

gradle jib -Djib.to.image=url-to-container-registry.com/streams-explorer-demo-transactionavroproducer -Djib.container.mainClass=com.bakdata.kafka.TransactionAvroProducer

Now we deploy them to Kubernetes using Helm. Make sure to set the URLs to your Kafka broker and Schema Registry in the values.yaml file and fill in your Docker image from the previous step.

helm repo add bakdata-common https://raw.githubusercontent.com/bakdata/streams-bootstrap/1.6.0/charts/
helm repo update
helm install --values values-transactionavroproducer.yaml demo-transactionavroproducer bakdata-common/streams-app

Repeat the steps for the remaining three streaming applications.

Let me quickly introduce the ATM fraud detection demo pipeline:

The pipeline consists of four streaming applications. The first one is a producer that converts incoming transactions from raw JSON to the Transaction Avro schema. After that, the TransactionJoiner looks for consecutive transactions from the same account within 10 minutes and joins them for the next step. Now the actual fraud detection takes place, which checks for the following three conditions:

same account number as a prior transaction
at a different location
within 10 minutes from the previous

If all three conditions are true, the fraudulent transaction is forwarded to the AccountLinker. Here additional checks are performed such as calculating the distance and required travel speed between the two locations. Finally, this last step enhances the data with information about the originating account.

Time to run some test data through the pipeline to see it in action. First, we need to find the leader Kafka broker for the accounts topic and set up the port-forward to localhost.

kafka-topics --zookeeper localhost:2181 --describe --topic atm-fraud-accounts-topic

We will use this topic to store our example accounts. The following command handles both converting the test data to the Account Avro schema and writing it to the topic.

kafka-avro-console-producer --broker-list localhost:9092 --topic atm-fraud-accounts-topic --property value.schema=$(cat src/main/avro/Account.avsc | tr -d '\040\011\012\015') --property schema.registry.url=http://localhost:8081 < test-data/accounts.txt

To generate our incoming transactions stream (legitimate or fraudulent) we use gess. Once the daemon is running we pipe the output from the UDP port to kafkacat which writes the individual messages to the input topic from where our TransactionAvroProducer reads.

./gess.sh start
nc -v -u -l 6900 | kafkacat -b localhost:9092 -P -t atm-fraud-raw-input-topic

While letting our topic populate for a few minutes, we can monitor the pipeline in Streams Explorer. The metrics for messages in/out per second should be slowly increasing, indicating the stream of incoming transactions. At this point, you can check the topic size for the atm-fraud-possiblefraudtransactions-topic to confirm the fraud detection is working.

Plugins

By implementing a plugin loader that can load Python files from an external user-defined folder, we support extending and customising the functionality to fit your specific setup. There are currently two different types of plugins (LinkingServices and Extractors) with sample implementations provided for each of them. Plugins can be easily added to the backend core which is available as a Python package from PyPI.

The LinkingService extends the Details view at the bottom of the screen to include links to external tools and services depending on the node type. The included DefaultLinker works with AKHQ, Grafana and Kibana by adding links to those services, making it convenient to inspect further information about the selected node, such as logs of a streaming app, messages inside a topic or metrics in the included Grafana dashboards.

You can create a custom Linker by implementing the abstract LinkingService class. Here is a snippet of the DefaultLinker where we create a NodeInfoListItem for topics that links to AKHQ, a GUI to manage and view data inside your Apache Kafka cluster.

Extractors can be used to add various types of sources or sinks by extracting information from the configuration of Kafka Connectors, CronJobs, or environment variables of streaming application deployments. All extractors should implement the Extractor base class. For an example, take a look at ElasticsearchSink that uses the Elasticsearch sink connector configuration to extract the corresponding elastic search index and add a node to the streams explorer’s graph that visualises the Elasticsearch index. The basic pattern for adding a source could look like this:

Conclusion

We hope you enjoyed reading this Streams Explorer introduction and like how it simplifies data pipeline exploration.

You have seen a sample application that demonstrates the interaction of streams-bootstrap and Streams Explorer to easily create Kafka Streams applications that can be visualized and monitored within complex data pipelines.

We are planning to continue the development on Streams Explorer and adding new features. Contributions and feedback are always welcome. For ideas and suggestions, please open an issue on GitHub to get in touch with us.

Thanks to Victor Künstler, Alexander Albrecht, Christoph Böhm, Philipp Schirmer, Benjamin Feldmann, and Torben Meyer.