How to Create a Self-Serve Streaming Data Platform

Published in

Riskified Tech

8 min readJun 4, 2024

In the age of data in tech, you’ve probably stumbled upon buzzwords like: ‘big data’, ‘data lakes’, ‘data streaming’, and so on and forth. The demand for seamless pipelines that move data points from A to B is enormous, and getting it from A to B is insufficient. You also need to get it there in real time!

As a Software Engineer, how often have you been asked to move data around, modify it, and load it into a different platform? This probably rings a bell. Depending on your company’s needs, this demand can grow into dozens or hundreds of integrations. It’s difficult to maintain all of these integrations, not to mention all the code you’d need to write per integration.

Ideally, you’d have a platform to manage all these integrations, pulling out reusable code to apply to these. The good news is you can build your platform based on open-source code without spending a single dollar on commercial software.

In this post, I will show you how to lay the foundations for managing your own data platform based on open-source code and allow anyone in your organization to create seamless streaming data pipelines. First, let’s dive into what the common practice is when it comes to moving data around.

ETL as the Data Integration Process

The common practice of data engineering to maintain data pipelines is called ETL. ETL (Extract, Transform, Load) is the process of extracting data from various sources, transforming it, and loading it into another source.

Traditional ETL uses a batching strategy, extracting data from a bounded data source and loading it onto the destination source by a scheduled interval. This usually covers offline-consumption use cases such as BI reports and dashboards because the freshness rate isn’t crucial, and waiting a few hours or minutes meets your business requirements. Batch jobs will cut it if you do not require this data in real-time, and it’s usually simpler to maintain. See the Airflow project for further reference.

The need often arises for online applications to fetch fresh data from an online data store your application does not directly communicate with. Batch processing won’t cut it for these cases since these business needs require an ongoing stream of data flowing continuously and won’t tolerate the batching latency overhead. By the time it gets to your source, it’s already too late.

That’s where streaming ETLs come into play. Streaming data as a live feed from unbounded data sources, such as streaming technologies like Kafka, Pulsar, Kinesis, and PubSub, usually provides a fine-grained solution to consuming an unbounded data stream.

In a microservices architecture, services share data asynchronously through message brokers, as mentioned above. Sharing data between services is effective, but that’s just one record on the wire for each poll. Message brokers aren’t queryable and provide a stream of immutable log data. If you want to query this data, aggregate it, modify it, or filter it, you would typically need a consumer service that consumes the data and also perhaps do stateless transformations such as field masking, removing fields, or filtering based on a predicate, then load it into a queryable online data store. That’s a classic example of a streaming ETL.

In a data-intensive product, these use cases can grow massively, and it would be pretty hard to maintain these ETLs on a large scale. Each requires separate deployments, secrets, permissions, and, worst of all — different code bases. Ideally, you’d manage these ETLs on a single platform, where you could configure them according to your business needs and manage different source plugins.

Kafka Connect as the Platform’s Engine

The Apache Kafka community identified this pain point. It created an easy-to-use tool called Kafka Connect to manage and process data from external systems into Apache Kafka or from Apache Kafka to these external data systems.

Kafka Connect allows you to define Connectors using JSON configurations that stream data from Kafka (Sink) or to Kafka (Source), and it can do single message transformations without a single line of code. Kafka Connect is a free, open-source solution, so you can start using it in your company right away (assuming you already have an Apache Kafka cluster up and running).

Kafka Connect is suitable to act as your self-service data platform infrastructure engine, making the entire process a self-service tool.

We will use this technology to implement our self-serve data platform. Before kickstarting this project, here are a few requirements for our platform:

1. Horizontal/vertical scale needs to be seamless (hint — Kubernetes, the obvious choice).

2. The deployment image and connectors plugins must be detached to allow maximum flexibility between deployment environments, so no fat jars are allowed on the image! This way, you can adjust your Connect clusters according to the minimum-required plugins installed.

3. Users should be able to add new connectors independently, and their changes will be version-controlled.

4. Users need a centralized view of the connectors to perform basic operations such as describing, restarting, or pausing.

5. Connector alerts are routed to the owning team to provide complete visibility over the status of their assets.

Deployment

Our preferred choice was Kubernetes, which goes well with Kafka Connect’s distributed mode. Confluent has many Docker images you could use to get going, but using their Helm Charts speeds up the process of getting the Kafka Connect cluster running. We needed to fork the cp-kafka-connect chart and apply some changes to meet our requirements. If you’re unsure how to deploy a Helm Chart, I’d suggest going over the official quickstart.

CI/CD

Kafka Connect uses connectors to integrate a designated source to Kafka (Source) and from Kafka (Sink). The open-source ecosystem maintains community connectors, which are free to use. Kafka Connect has pre-installed connectors, such as MirrorMaker2, but we wanted to add connector plugins seamlessly as our business use case grew.

One way to add connectors to the setup could be packing the plugin JARs alongside the Connect Cluster image, but this approach creates a high coupling between connectors and cluster deployments. This coupling requires building a new image each time a connector is added, and if you’re re-using the image for several deployments, you might end up with unused plugins that slow down your system and require resources.

A simple step to dynamically add connectors to the deployment is to use a Kubernetes Init Container. You could download connectors directly from the internet using ConfluentHub CLI (you’ll need your Init Container to have ConfluentHub CLI pre-installed). Still, downloading crucial components from the internet may not be the best practice, so using an Artifactory or an Object Store like S3 to manage your plugin artifacts is better.

We used a simple interface to manage our connectors over the ConfluentHub mirror. The connector plugins list is written to a YAML manifest, and the CI caches the connector JAR from the internet to the company’s Artifactory.

The list of desired connectors per deployment is stored as a Kubernetes ConfigMap, and the initContainer uses that ConfigMap to determine which connectors to pull from the Artifactory. This allowed flexibility between deployments, while each deployment started only with the minimum required plugins.

IaC to Manage Connectors

Disclaimer: I’m a Terraform fan.

Did you know even Domino’s Pizza has a Terraform provider? Surely, if you can order pizza with Terraform, you can also manage your connectors as Terraform resources.

Terraform is a good choice to manage your connectors over API calls for several reasons:

It is version-controlled, and keeping track of every change/rollback is harder when you use only API calls.
Terraform lets you modify or validate user input, which is crucial if you want to set ground rules for all connectors.
Terraform modules allow you to standardize resources and enforce desired configurations.
It grants your users complete independence to safely add new connectors without worrying about misconfigurations. Terraform modules guarantee that your infrastructure is consistent, reusable, easy to configure, and modular.

There are some great Terraform providers out there for Kafka. Our preferred choice is Mongey’s Kafka Connect Provider.

One example of critical enforcement to our managed connectors is prepending the team name to the connector name. Kafka Connect does not store any tagging metadata, and prepending the owner to the connector name allowed us to catch the team name for monitoring reasons.

Monitoring and Visibility

JMX is the standard way to monitor any JVM-based application, and Kafka Connect is no different. Luckily, the Confluent Kafka Connect Helm Chart has already covered you and provides the infrastructure to expose your JMX metrics. Prometheus can easily scrape these metrics if you set the endpoint as a Prometheus target.

You can extend the JMX config to add custom labels. Previously, I mentioned that we prepended the team name to the Kafka Connector name, catching the prefix with regex and adding it as a custom label, which allowed us to segment metrics by teams. As an example, if I have a connector called “myteam_my_awsome_connector,” then “myteam” is the first matched group, and it will be extracted to the “team” label.

- pattern: "Kafka.connect<type=connector-metrics, connector=([^_]+)(_?.+)><>(.+): (.+)"
  value: 1
  name: kafka_connect_connector_metrics
  labels:
  connector: $1$2
  team: $1
  $3: $4
  type: UNTYPED

Once your Connect Cluster metrics are stored in Prometheus, the next step should be to set alerts using the Prometheus Operator. This example could speed up your learning curve if you need to gain experience deploying Prometheus Rules in Kubernetes.

No monitoring system is complete without visuals, and Confluent’s chart already has an all-in-one Grafana dashboard in its chart. If you’re only interested in the Kafka Connect components, you can delete the irrelevant panels once you import the dashboard.

UI

Your Kafka Connect Cluster is up and running, and connectors are streaming data to or from Kafka, but it’s pretty hard to manage your connectors by sending ad-hoc HTTP requests to explore the status of your connectors.

Having a UI to administer your cluster and view your connector status is a good practice, especially if you expose your connector assets to the rest of the company. In our case, we implemented a UI over the Connect REST Interface, where we could better control user actions.

Our connectors are managed in Terraform, so we limited the functionalities to read-oriented operations such as List & Describe connector configs, except the option to restart a connector actively.

The good news is you don’t have to implement your UI because there are many alternatives in the open-source community. lenses.io has a very good open-source UI for Kafka Connect — check out their GitHub repository.

Wrap up

Managing a self-serve ETL platform might initially sound intimidating, and many companies rush to invest in major development efforts to create their tailor-made solution or just buy a solution that does it for you. Still, the good news is you can use open-source to achieve the same capabilities without writing code yourself.

Kafka Connect has an active community, and many projects have been built around the Connect platform. Look at the Debezium project as a streaming replacement that fits perfectly in Kafka Connect. The project contains several DB connector classes, and its community maintains it. It is a great project for offline consumption and simplifies your Event-Driven Architecture when producing to Kafka is a side effect.