Streaming analytics with Kafka and ksqlDB

Published in

Data Science and Machine Learning at Pluralsight

6 min readDec 10, 2020

The Green River — Utah Stream Processing. Photo by Doc Searls

This post is co-authored by Maha Arunachalam (devops engineer at Pluralsight) and in collaboration with Theo Cowan (devops engineer); Connor McKay (machine learning engineer); and Jeff Lewis & Zander Nickle (streaming software engineers)

If you’ve been around the dataverse a bit, you’ve likely heard of Apache Kafka. Perhaps you’ve even played with it. Overall, Kafka (the tool, not the philosopher) provides a horizontally scalable stream-processing engine that offers highly performant, fault-tolerant, real-time data feeds.

Messaging

We’ve embraced Kafka at Pluralsight and in this post we’ll provide the high-level of how we’re using it and ksqlDB (formerly KSQL) to do streaming analytics. First, let’s step back a bit. As more engineering orgs organize themselves around microservices, the need arises for inter-component and inter-team communication. At first blush, RESTful endpoints seem great for solving that problem. However, as outlined in this post, it is better for inter-dependent services to be more lightly coupled (and asynchronous). However, http web services are synchronous. Thus the need arises for message queues. While outside the scope of this post, teams at Pluralsight use RabbitMQ and its AMQP protocol for inter-team messaging.

More and more, however, product teams at Pluralsight (PS) are using Kafka for both this inter-team asynchronous communication and analytics. As stated here, however, Kafka isn’t really a message queue in the traditional sense at all — in implementation it looks nothing like RabbitMQ or other similar technologies. It is much closer in architecture to a distributed filesystem or database than a traditional message queue.

Kafka at PS

At Pluralsight, we like Kafka not only because it provides high durability and horizontal scalability, but also because it makes it easy to keep numerous consumers in sync with a given data source. To enable the analytics benefits of Kafka, we store data in perpetuity and leverage both event-style and entity-style topics. For those of you more oriented around databases, an entity is what you’re used to. Entities describe the current state — how many channels does this user have, how much content have they viewed, etc. Events, on the other hand, are, well, events. For example, a user started a course, a user changed their interests, a user searched for X. We orient topics around either one.

Kafka differs from a database in that you don’t query Kafka itself. You can set up consumers directly on Kafka topics to action off of messages as they are consumed, but more commonly you will materialize the data into postgres, S3, etc. This enables running queries in a traditional sense against the data from topics. We replicate most topics out to Postgres and for most of our Kafka topics, materializing into (and querying from) postgres works fairly well. However, as our topic size and analytics needs have grown, we’ve also leveraged AWS’s Athena/S3 data lake solution, which provides a Presto-like query engine without having to manage EC2 instances. While S3 doesn’t manage updates to Kafka records well and this isn’t a perfect solution, the S3/Athena environment does work nicely for event-style Kafka topics.

At Pluralsight, we deploy Kafka along with the Confluent platform. This gives us all of the Kafka functionality described above, as well as extra features and functionality. Some of these are schema management/enforcement and Control Center to monitor brokers. We also plan on leveraging more of their platform moving forward to increase the security, durability, and availability of our Kafka brokers.

Streaming landscape

ksqlDB (formerly called KSQL) is from Confluent and is based on Kafka Streams. It provides a user-friendly API that puts streaming analytics at the fingertips of data scientists (and not just engineers). Since it’s based on Kafka Streams, ksqlDB provides resilient stream processing operations like filters, joins, maps, and aggregations.

You may have heard of Spark Streaming, an extension of the core Spark API, which historically was the most popular processing system for streaming analytics. There are teams working with Spark Streaming at PS, though it has proven more difficult to get it working with AWS EMR compared to using Confluent-based ksqlDB. The other thing we like about ksqlDB is the tighter integration with Kafka (since it’s based on Kafka Streams), as well as the simpler API compared to Spark Streaming. Again, making it such that data scientists can work with streaming analytics is huge.

How PS is using ksqlDB

While there is much more to explore, ksqlDB offers: 1) easy-to-use source and sink connectors; 2) streaming analytics; and 3) the ability to create derived topics.

ksqlDB and a ksqlDB cluster (see more below) not only offer tools for doing streaming analytics, but also for fundamental data engineering tasks via Confluent Kafka connectors. For example, you can easily create Kafka topics via connectors from Postgres (via Debezium) or from MQTT for IoT/Arduino style applications. At PS we use the S3 sink connector for landing data from Kafka to S3, such that we can leverage a map-reduce style data lake (such as AWS Athena) for fast querying. If you’re excited to dig in, there is an impressive repo and examples here where you can quickly spin-up Kafka and ksqlDB via docker and sink geo-location data to ksqlDB like this: smartphone -> MQTT -> kafka -> ksqlDB. It’s a great way to get your hands dirty.

ksqlDB also makes it easy to finally bring typical SQL-style syntax to streaming data. This works via a stream, which is based on a Kafka topic. The upshot is that one can do groupby’s and filtering on one stream or join multiple streams. Groupby feasibility, however, depends on stream size and pod memory. As stated here, unlike with typical SQL, with ksqlDB we create programs that operate continually over unbounded streams of events, ad infinitum. These processes stop only when you explicitly terminate them. This functionality opens up a wide variety of applications, one of which is derived topics.

In the Data Warehouse / OLAP world, it’s often beneficial to create denormalized tables to provide easy access to standard metrics and dimensions, simplify joins, etc. With Kafka we create derived topics to achieve similar ends. When your analytics are based around Kafka, it becomes convenient to have derived topics gathering standard user attributes or metrics from several topics into one to facilitate things like model-building, experimentation, and product research. Creating a new Kafka topic from a stream is as simple as filling out a parameter when creating the underlying stream itself.

Kubernetes and ksqlDB

Over the last few years, software firms have moved to containers as an efficient way to deploy apps for better parity between development, staging and production servers. Because of the way Kubernetes makes deploying and packaging those applications easy, it reigns as the most popular open source container orchestration platform.

At Pluralsight, we have set up our own Kubernetes cluster. Though spinning up our own cluster has its pros and cons, the liberty of customizing to our requirements makes it the preferred solution vs using the AWS managed service. In the cluster, each product team has a namespace to provision resources for itself. We use helm or kustomize to template and deploy applications to Kubernetes via a yaml manifest. The kustomize template for ksqlDB, for example, helps us to provide information about the memory requirements, kafka-cluster details and also, IAM role needed to access the sink s3 bucket in AWS.

In addition, the kubernetes Web UI provides detailed information necessary to check or update the config, services, and logs for different applications that are deployed by the team. This templating and UI monitoring makes it approachable to data scientists without a heavy engineering background. Once the ksqlDB cluster is provisioned with those tools, we can create our connectors, run our streaming queries, etc. In other words, Kubernetes lets our data scientists worry about the task at hand instead of resource provisioning.

Finally

For those of you who have been in the data world a while, practical streaming analytics has perpetually felt over the horizon. With tools like ksqlDB, data scientists writ-large will finally be able to leverage these techniques to improve the business. In the future, we’ll get into the weeds around specific ksqlDB projects as well as our work with Spark Streaming and Materialize.io — stay tuned!