Designing Event-Based Data Pipelines with Apache Kafka

Martin Arroyo
99P Labs
Published in
8 min readMay 2, 2023

Introduction

Our process for designing and building data pipelines has evolved over time from lessons learned after encountering bottlenecks and limitations using traditional Extract, Transform, and Load (ETL) methods. The data volume that we deal with is both large and delivered at a high velocity, which created issues for us when using typical batch processing workflows. We found that batch processing created bottlenecks with both data processing and querying speed that needed to be addressed.

These learnings influenced the development of the third iteration of our data platform — Platform 3 — where Apache Kafka has become a key component in our ingestion and ETL strategy. Since adopting Kafka, we have been able to significantly increase our throughput while decreasing processing times.

What follows is the design philosophy that organically evolved from our lessons learned. It is not meant to be prescriptive, but rather us sharing insights from our experiences building pipelines with Apache Kafka and what works for us.

What is Apache Kafka?

Apache Kafka is an open-source, distributed streaming platform that enables the handling of real-time data feeds and processing at scale. Developed by the Apache Software Foundation, Kafka is designed to support high-throughput, fault-tolerant, and low-latency data streaming, making it an ideal choice for applications involving big data, data integration, and real-time analytics. Kafka is built on a publish-subscribe model, which allows producers to send messages to topics and consumers to read those messages in a fault-tolerant and scalable manner.

Kafka’s architecture consists of four key components: producers, brokers, topics, and consumers. Producers create and publish messages to topics, which are logical channels used to categorize and store data. Brokers are responsible for storing and managing the published messages, ensuring fault tolerance and load balancing. Consumers subscribe to topics and process the messages as they arrive. Kafka’s ability to handle massive amounts of data while maintaining high performance has led to its widespread adoption across numerous industries, including finance, telecommunications, and social media.

What role does it play in our data pipelines?

We use Kafka to coordinate and move data between stages in our data pipelines. It facilitates the transfer of “data events” between these stages, which then trigger some process on the data, such as parsing, aggregation, or a transformation of some kind.

Design Philosophy

Overview

When designing a pipeline, our goal is to continuously collect data from a given source and produce a query-able data set that enables analytics downstream. First, we consider how our data will be ingested from source to sink. After that, we determine how to shape, transform, and ultimately present the data for further analysis by internal and external stakeholders.

Our data processing flow consists of two key phases: First, a data ingestion phase, which is followed by an ETL phase. Each phase is made up of a collection of processes, called workers, which are Kafka clients that are responsible for a single unit of work on the data. These workers listen to topics and respond to “data events” published by other workers. They also publish their own “data events” to a topic once they have completed their process. Each worker typically represents a single, logical stage in a pipeline (e.g. A parser worker that represents the parsing stage of the data ingestion phase.)

Once data has been brought into our system from the original source and parsed, the ingestion phase ends and the data is now ready to be structured. In the first part of the ETL phase, we create the reference data set, which is a structured representation of our source data. This reference data is then used to create new representations of the source data that are suitable for analysis. The output of the ETL phase is what we call our derived data set, since it is derived from the reference data.

As a general rule, we don’t expose the reference data to anyone other than team members. The derived data sets are what we make available to both internal and external stakeholders through our platform.

Our derived data sets are created based on personas that we have identified and their data access needs. In general, we have a top secret (TSEC) derived data set, which contains more privileged information than our secret (SEC) data. This allows us to govern access to the data and ensure that only users with the appropriate clearances can see more sensitive information.

Data Processing Phases

Ingestion

During the ingestion phase, workers move the data along in a series of stages from a given source to our sink, which is a data lake. There are minimal — if any — modifications made to the data at this point. Common stages in this phase of the pipeline are parsing and unzipping of source files to prepare the data for restructuring. At the end of this phase, we are ready to create the reference data set.

ETL

The ETL phase is where data gets structured for presentation and analysis. First, we begin by creating the reference data set. Often, our source is telematic log data, which usually arrives unstructured. The first stage structures this data in a tabular format.

The remaining stages further clean, structure, transform, aggregate, and enrich the data from our reference set. The end result is one or more derived data sets. That data is then made accessible to the 99P Labs community through our APIs and SDK.

Maintenance

Since each worker represents a single stage of data processing, refactoring is easier. This is because we can focus on a single piece of functionality at a time. If a particular transformation needs refactoring or debugging, only that worker will be affected while the rest of the pipeline can continue processing data. Likewise, removal or modification of a stage can be done without affecting other workers in the pipeline.

Scaling

Another benefit of having workers is the ability to simply add more of them when you need additional processing capacity. Kafka helps enable horizontal scaling for workloads by partitioning topics, allowing us to attach a worker to each partition and distribute processing.

Schema Enforcement and Evolution

To ensure that our schema remains consistent throughout the process, each worker agrees to a schema contract that is enforced by Protocol Buffers, also known as Protobufs. Since Protobufs provide backwards compatibility, we are able to evolve the schema over time without breaking older processes or requiring significant refactoring.

Data pipeline examples

We’ll walk through two of our data pipeline designs, starting from a simple example, then moving to a more complex pipeline.

Telematics

This is a collection of proprietary Telematics data for research use cases. The data is received from vehicles in the US and consists of trip information and related vehicle attributes.

Telematics Data Pipeline Design

Ingestion Phase

The Telematics pipeline is one of the simplest pipeline designs that we currently have. There is only one worker that is responsible for both ingestion and backfilling from the source data. This worker checks the source data and returns the location of any new data it finds. The locations are then published to the reference worker so that it can extract and load new data into the reference data set.

ETL Phase

First, the reference data set is created. Then, the next worker aggregates the reference data to create our trip summary data set. Following that, the last two workers read from new trip summary data to create their respective data sets, which can be used to enrich the trip summary. The vehicle info worker decodes VIN information using the NHTSA Vehicle API in order to obtain additional vehicle metadata. The geographic info worker takes in coordinates from the trip summary data and performs a reverse geocode lookup, adding general location information about the trip.

Vehicle-to-Everything (V2X)

V2X data is collected as part of the US 33 Smart Corridor initiative. The data is a feed from V2X instrumented vehicles and captures various external interactions.

Out of the three pipelines mentioned here, V2X is the most complex. It has a lot of moving pieces, and the output is two derived data sets instead of one.

V2X Data Pipeline Design

Ingestion Phase

To begin ingestion, the Scanner worker checks the V2X data source for any new zip files. When it discovers a new file, it writes that file to our source archive and publishes a data event that includes the location of the newly written file. From there, the Unzipper, which subscribes to the notifications from Scanner, grabs and unzips the file before publishing the unzipped binary data as an event.

After receiving the unzipped binary data, the Parsing worker parses that data and sends the results to the Reference worker so that it can create the reference data set.

ETL Phase

The Reference worker, after creating the reference data set, sends the location of new data downstream to the TSEC worker, which begins creating the TSEC data set. For V2X, this is simply a version of the reference data that is meant to be exposed to external stakeholders. The data then flows to the next worker, Trip Summary, which is responsible for aggregating and summarizing the trips that each vehicle took. After Trip Summary, the next stage is Visit Summary, which identifies the locations that a vehicle has visited, based on dwell time and ignition cycles.

Following Visit Summary, the Geofence worker then establishes geofences around points where drivers frequently visit. The Geofence worker then publishes events which are picked up by the SEC worker. SEC (short for secret) is the other derived data set that the V2X pipeline outputs aside from TSEC.

While TSEC will have the trip summary information and other trip metadata (e.g. visits and geofences,) SEC writes data that is meant for more general and public consumption. As such, we redact more data from SEC in order to prevent the sharing of sensitive information, so it will not have a trip summary and certain geographic information will be obfuscated.

Observed Benefits and Drawbacks

By making Apache Kafka a centerpiece of our data pipeline design strategy, we are able to realize increased throughput along with reduced latency during data processing. It also enables us to scale our processes up and down according to our needs. Another benefit we are seeing is that maintenance and refactoring is simpler compared to the previous platform.

However, there are some drawbacks to using Kafka. Breaking down a pipeline into a collection of individual units of work makes future maintenance easier, but initially designing non-trivial pipelines with Kafka can be harder than traditional batch methods. In particular, data processing where lots of complex joins are required can be more difficult to accomplish with a Kafka-based pipeline. That being said, our experience so far is that the benefits far outweigh any drawbacks.

Conclusion

Incorporating Apache Kafka into our data pipeline design has proven to be a game-changer for handling large volumes of data at high velocities. By embracing Kafka’s distributed, fault-tolerant architecture, we’ve seen improvements in throughput, latency, and scalability.

While there are some challenges, particularly when dealing with complex joins, the benefits we’ve gained from using Kafka far outweigh any drawbacks. As we continue to evolve and optimize our data pipelines, Kafka’s role as a key component in our ingestion and ETL strategy will remain crucial to our success.

Thank you so much for reading. If you’d like to discuss collaborating on a project, don’t hesitate to reach out by following our 99P Labs blog here on Medium, connecting with us on our LinkedIn page, leaving a comment, or contacting us at research@99plabs.com.

--

--

Martin Arroyo
99P Labs

Data Engineer @ 99P Labs | Data Analytics Instructor @ COOP Careers