To stream or to not stream. That is a question.

Tiểu Đông Tà
Life of a Senior Data Engineer
7 min readMay 17, 2020

Data streaming through Kafka is becoming an essential part of any data application. We are using Kafka mostly thanks to its stability and its capacity in dealing with message persistence into multiple partitions.

I will not go into details why we are using Kafka for streaming, but rather, I would want to tackle a derived problem arises from Streaming usage:

How do you enforce data quality of the flow on your streaming platform ? Precisely, how do you deal with derived data ?

Traditionally, in a batch processing platform, you will have a ton of ETL to clean the data and enforce quality control on it. But with all the data coming from the stream, suddenly you are taken away the power of writing (“nasty”) ETL scripts everywhere for the sake of enforcing data quality control.

Design.

We approach the data quality control by using Kafka itself. As we can create a reasonably large amount of topics on Kakfa, I decided to try the idea that both the “main” component which consume the data and the “data quality” component of the system should consume from the same topic.

Logical layer of the system

As one Kafka topic can have 1 to multiple partitions, we can allocate one partition extra reserved for Data Quality Control .It is important to note that we can only have ONE consumer per ONE Kafka partition.

The idea is pretty simple:

We apply data quality rule on-the-flight in parallel. Thus, less headache and no dependency from any other system except from Kafka itself.

All layer of the system

In the full system, I want to publish 2 things back to Kafka:

  1. cleaned data, which is the result of any ETL operation. This is published back to Kafka under a new topic. Also, any aggregations (yes, the famous GROUP BY) can be calculated on the flight.
  2. result of the quality control: is the data complete ? is there any null or suspicious value ?… The result of the operation is sinked back to Kafka for future consumption or processing . You may plug in a real-time dashboard if you wish to.

Bare-metal implementation

So do you know believe that we can achieve such a thing called : data quality control on the stream ?

yes, you better believe so.

Now come the actual implementation at physical level for my prototype:

Components of the system

I have 3 separate components: Gin, Vodka and Bourbon.

Let’s take a look into them in details

Gin

Gin

Gin is nothing special but a simple Kafka producer. It produces a continuous stream of JSON message with the following format.

{
"id":1,
"first_name":"Barthel",
"last_name":"Kittel",
"email":"bkittel0@printfriendly.com",
"gender":"Male",
"ip_address":"130.187.82.195",
"date":"06/05/2018",
"country":"france"
}

Check out the implementation of Gin

Vodka

This is Vodka

Vodka is the ETL processor. It consumes the message and generates derived stream out of it. Vodka is an important component (just like vodka itself). My requirement for Vodka is:

  • Reliable
  • Scalable
  • Persistence and recovery should be easy upon failure.
  • Communication of data internally inside Vodka itself should be easy. For example: I want to do several process on the stream, then pass it to another module for decoration, then publish it to one or several topics. This internal communication inside Kafka can be hard to achieve if we spawn multiple instances for scaling.

With these categories in mind, instead of trying the official Kafka consumer library, I tried Faust.

What is Faust, you may ask ?

Ich bin ein Teil von jener Kraft,
Die stets das Böse will und stets das Gute schafft.

-Johann Wolfgang von Goethe, Faust-

Well I like Goethe a lot, but unfortunately he is not helping in the creation of Vodka.

The real Faust in this case is an open source library which provides an abstraction level over Kafka consumer, which allows me to easily spawn a cluster of consumer for the job.

I easily spawn in instances of consumer to consume message from the topic’s partition, then performing ETL, sinks…etc. Faust itself handle internal data communication out of the box.

Check out the implementation of Vodka

Bourbon

This is Bourbon

Bourbon is the data quality watchdog . It subscribes to raw data topic and performs data quality controls over the receiving stream. It shares the same design and implementation with Vodka.

While Vodka is the processor, Bourbon is the controller . In this version, it simply checks if email and ip address is correct. It publishes the result in a very rudimentary format.

Given:

{
"id":1,
"first_name":"Barthel",
"last_name":"Kittel",
"email":"bkittel0@printfriendly.com",
"gender":"Male",
"ip_address":"130.187.82.195",
"date":"06/05/2018",
"country":"france"
}

Bourbon is going to publish the following message to the Data Quality Topic

{
"id":None,
"first_name":None,
"last_name":None,
"email":True,
"gender":None,
"ip_address":True",
"date":None,
"country":None
}

For each field, a None value indicates no check has been performed. A True and False relatively indicates that the check has been performed and the quality is good or bad.

Check the implementation of Bourbon

Results

I will not go into details about instruction how to launch the project, as those instruction has been written the README.md file in the project itself.

The project is inspired a lot from [https://github.com/florimondmanca/kafka-fraud-detector]

The author has done a brilliant job of setting up Apache Kafka, Apache Zookeeper stack using docker-compose. Without this, I would struggle a lot in setting the basic infrastructure.

Here is the full implementation of my project with instruction on how to launch them.

Now, let’s see Gin, Bourbon and Vodka at work.

vodka_1    | [2020-05-17 08:08:50,821] [1] [WARNING] Tajikistan has now appeared 1 times vodka_1    | [2020-05-17 08:08:50,822] [1] [WARNING] derived data: vodka_1    | [2020-05-17 08:08:50,822] [1] [WARNING] {'id': 29, 'first_name': 'Nevin', 'last_name': 'Starmore', 'email': 'nstarmores@webnode.com', 'gender': 'Male', 'ip_address': '148.159.12.195', 'date': '2018-09-09', 'country': 'Tajikistan'} bourbon_1  | [2020-05-17 08:08:54,481] [1] [WARNING] data quality enforced on message... bourbon_1  | [2020-05-17 08:08:54,481] [1] [WARNING] {'id': None, 'first_name': None, 'last_name': None, 'email': False, 'gender': None, 'ip_address': True, 'date': None, 'country': None}

Here are the operational logs of Vodka and Bourbon.

You can notice that Vodka is constantly doing 2 things:

  • Process raw data. In the implementation, Vodka actually makes sure first_name, last_name, country always starts with capital letter and also convert date from EU style to US style date format.
  • Also, Vodka is doing aggregation on countries to count how many time the country appears in a specific time window.

On the side of Bourbon , we can see that the data quality is being check constantly on the flight, although, it is a very rudimentary check.

Some thoughts

After the small experiments, I personally think that Kafka and Faust are good matches together. While Kafka gives the impression of a production-grade streaming platform that business can rely on, Faust give the extra-power shot that the Kafka Consumer may need. However, still several negative stuff that I disliked:

  • Apache Kafka set up is extremely painful with a lot of components to ensure. It seems that Kafka is not designed to be cloud-native, such kind of application that you can deploy in one command on Kubernetes cluster. This kind of cloud-native feature I felt in love at first sight with NATS (https://nats.io)
  • Faust, although a brilliant piece of work, seems to undervalue the importance of encoding/decoding standards. Avro and Protobuf are state-of-the-art encoding protocols, which ensures not only light-weight and reliable protocol, but also greatly ensure the schema evolution in the life of the project. They are the de-facto rulers of the world now due to their superiority over JSON in all fields. Yet, Faust does not support Avro and Protobuf out of the box. This is quite disappointing to find out.
  • Related to point 2, Faust requires user to declare schema as class. This approach does not satisfy me as it neglects the power of most modern encoding protocols (Avro/Protobuf in mind). It is like reinventing the wheel to me.

In the end, if there is any mistake in the article as well as in the code (surely there are a lot), you may spare me and blame my pets. One was helping by chewing my keyboard and one was screaming into my ears when I worked on this mini project. Here are the ones to blame

The screaming one (left) and the one who likes chewing the keyboard(right)

--

--

Tiểu Đông Tà
Life of a Senior Data Engineer

Làm việc tùy tiện theo ý mình, y bốc tướng số trị thủy toán thư, môn nào cũng muốn học lấy một chút