To stream or to not stream. That is a question.

Published in

Life of a Senior Data Engineer

7 min readMay 17, 2020

Data streaming through Kafka is becoming an essential part of any data application. We are using Kafka mostly thanks to its stability and its capacity in dealing with message persistence into multiple partitions.

I will not go into details why we are using Kafka for streaming, but rather, I would want to tackle a derived problem arises from Streaming usage:

How do you enforce data quality of the flow on your streaming platform ? Precisely, how do you deal with derived data ?

Traditionally, in a batch processing platform, you will have a ton of ETL to clean the data and enforce quality control on it. But with all the data coming from the stream, suddenly you are taken away the power of writing (“nasty”) ETL scripts everywhere for the sake of enforcing data quality control.

Design.

We approach the data quality control by using Kafka itself. As we can create a reasonably large amount of topics on Kakfa, I decided to try the idea that both the “main” component which consume the data and the “data quality” component of the system should consume from the same topic.

As one Kafka topic can have 1 to multiple partitions, we can allocate one partition extra reserved for Data Quality Control .It is important to note that we can only have ONE consumer per ONE Kafka partition.

The idea is pretty simple:

We apply data quality rule on-the-flight in parallel. Thus, less headache and no dependency from any other system except from Kafka itself.

In the full system, I want to publish 2 things back to Kafka:

cleaned data, which is the result of any ETL operation. This is published back to Kafka under a new topic. Also, any aggregations (yes, the famous GROUP BY) can be calculated on the flight.
result of the quality control: is the data complete ? is there any null or suspicious value ?… The result of the operation is sinked back to Kafka for future consumption or processing . You may plug in a real-time dashboard if you wish to.

Bare-metal implementation

So do you know believe that we can achieve such a thing called : data quality control on the stream ?

Now come the actual implementation at physical level for my prototype:

I have 3 separate components: Gin, Vodka and Bourbon.

Let’s take a look into them in details

Gin

Gin is nothing special but a simple Kafka producer. It produces a continuous stream of JSON message with the following format.

{
   "id":1,
   "first_name":"Barthel",
   "last_name":"Kittel",
   "email":"bkittel0@printfriendly.com",
   "gender":"Male",
   "ip_address":"130.187.82.195",
   "date":"06/05/2018",
   "country":"france"
}

Check out the implementation of Gin

LaPetiteSouris/KuronoSoshiki

Gin is a simple Kafka producer module. It loads data from given file and publish to the topic set in environment…

github.com

Vodka

Vodka is the ETL processor. It consumes the message and generates derived stream out of it. Vodka is an important component (just like vodka itself). My requirement for Vodka is:

Reliable
Scalable
Persistence and recovery should be easy upon failure.
Communication of data internally inside Vodka itself should be easy. For example: I want to do several process on the stream, then pass it to another module for decoration, then publish it to one or several topics. This internal communication inside Kafka can be hard to achieve if we spawn multiple instances for scaling.

With these categories in mind, instead of trying the official Kafka consumer library, I tried Faust.

What is Faust, you may ask ?

Ich bin ein Teil von jener Kraft,
Die stets das Böse will und stets das Gute schafft.
-Johann Wolfgang von Goethe, Faust-

Well I like Goethe a lot, but unfortunately he is not helping in the creation of Vodka.

The real Faust in this case is an open source library which provides an abstraction level over Kafka consumer, which allows me to easily spawn a cluster of consumer for the job.

Faust

Faust is a stream processing library, porting the ideas from Kafka Streams to Python. It is used at Robinhood to build…

faust.readthedocs.io

I easily spawn in instances of consumer to consume message from the topic’s partition, then performing ETL, sinks…etc. Faust itself handle internal data communication out of the box.

Check out the implementation of Vodka

LaPetiteSouris/KuronoSoshiki

Vodka is the a Faust ( https://github.com/robinhood/faust) worker It contains 3 Faust agents Clean the raw message Sink…

github.com

Bourbon

Bourbon is the data quality watchdog . It subscribes to raw data topic and performs data quality controls over the receiving stream. It shares the same design and implementation with Vodka.

While Vodka is the processor, Bourbon is the controller . In this version, it simply checks if email and ip address is correct. It publishes the result in a very rudimentary format.

Given:

{
   "id":1,
   "first_name":"Barthel",
   "last_name":"Kittel",
   "email":"bkittel0@printfriendly.com",
   "gender":"Male",
   "ip_address":"130.187.82.195",
   "date":"06/05/2018",
   "country":"france"
}

Bourbon is going to publish the following message to the Data Quality Topic

{
   "id":None,
   "first_name":None,
   "last_name":None,
   "email":True,
   "gender":None,
   "ip_address":True",
   "date":None,
   "country":None
}

For each field, a None value indicates no check has been performed. A True and False relatively indicates that the check has been performed and the quality is good or bad.

Check the implementation of Bourbon

LaPetiteSouris/KuronoSoshiki

Bourbon is the a Faust ( https://github.com/robinhood/faust) worker Bourbon is the simplified version of a dedicated…

github.com

Results

I will not go into details about instruction how to launch the project, as those instruction has been written the README.md file in the project itself.

The project is inspired a lot from [https://github.com/florimondmanca/kafka-fraud-detector]

The author has done a brilliant job of setting up Apache Kafka, Apache Zookeeper stack using docker-compose. Without this, I would struggle a lot in setting the basic infrastructure.

Here is the full implementation of my project with instruction on how to launch them.

LaPetiteSouris/KuronoSoshiki

This is the implementation of my small blogpost: A simple Data Quality Control for Streaming Service. Inspired by the…

github.com

Now, let’s see Gin, Bourbon and Vodka at work.

vodka_1    | [2020-05-17 08:08:50,821] [1] [WARNING] Tajikistan has now appeared 1 times vodka_1    | [2020-05-17 08:08:50,822] [1] [WARNING] derived data: vodka_1    | [2020-05-17 08:08:50,822] [1] [WARNING] {'id': 29, 'first_name': 'Nevin', 'last_name': 'Starmore', 'email': 'nstarmores@webnode.com', 'gender': 'Male', 'ip_address': '148.159.12.195', 'date': '2018-09-09', 'country': 'Tajikistan'} bourbon_1  | [2020-05-17 08:08:54,481] [1] [WARNING] data quality enforced on message... bourbon_1  | [2020-05-17 08:08:54,481] [1] [WARNING] {'id': None, 'first_name': None, 'last_name': None, 'email': False, 'gender': None, 'ip_address': True, 'date': None, 'country': None}

Here are the operational logs of Vodka and Bourbon.

You can notice that Vodka is constantly doing 2 things:

Process raw data. In the implementation, Vodka actually makes sure first_name, last_name, country always starts with capital letter and also convert date from EU style to US style date format.
Also, Vodka is doing aggregation on countries to count how many time the country appears in a specific time window.

On the side of Bourbon , we can see that the data quality is being check constantly on the flight, although, it is a very rudimentary check.

Some thoughts

After the small experiments, I personally think that Kafka and Faust are good matches together. While Kafka gives the impression of a production-grade streaming platform that business can rely on, Faust give the extra-power shot that the Kafka Consumer may need. However, still several negative stuff that I disliked:

Apache Kafka set up is extremely painful with a lot of components to ensure. It seems that Kafka is not designed to be cloud-native, such kind of application that you can deploy in one command on Kubernetes cluster. This kind of cloud-native feature I felt in love at first sight with NATS (https://nats.io)
Faust, although a brilliant piece of work, seems to undervalue the importance of encoding/decoding standards. Avro and Protobuf are state-of-the-art encoding protocols, which ensures not only light-weight and reliable protocol, but also greatly ensure the schema evolution in the life of the project. They are the de-facto rulers of the world now due to their superiority over JSON in all fields. Yet, Faust does not support Avro and Protobuf out of the box. This is quite disappointing to find out.
Related to point 2, Faust requires user to declare schema as class. This approach does not satisfy me as it neglects the power of most modern encoding protocols (Avro/Protobuf in mind). It is like reinventing the wheel to me.

In the end, if there is any mistake in the article as well as in the code (surely there are a lot), you may spare me and blame my pets. One was helping by chewing my keyboard and one was screaming into my ears when I worked on this mini project. Here are the ones to blame

To stream or to not stream. That is a question.

LaPetiteSouris/KuronoSoshiki

Gin is a simple Kafka producer module. It loads data from given file and publish to the topic set in environment…

Faust

Faust is a stream processing library, porting the ideas from Kafka Streams to Python. It is used at Robinhood to build…

LaPetiteSouris/KuronoSoshiki

Vodka is the a Faust ( https://github.com/robinhood/faust) worker It contains 3 Faust agents Clean the raw message Sink…

LaPetiteSouris/KuronoSoshiki

Bourbon is the a Faust ( https://github.com/robinhood/faust) worker Bourbon is the simplified version of a dedicated…

LaPetiteSouris/KuronoSoshiki

This is the implementation of my small blogpost: A simple Data Quality Control for Streaming Service. Inspired by the…

Written by Tiểu Đông Tà