How to detect the origin of data quality issues that will impact your analytics pipeline while still deploying services fast in Kafka

Michael E Colley
5 min readSep 9, 2021

--

Photo by Vitaly Vlasov from Pexels

Kafka has been around for a while and is a mainstay among some of the biggest names in tech like Netflix, LinkedIn and Uber. Its low latency allows for real-time event streaming and the basic commit log design pattern makes it incredibly reliable for delivering data from A to B. The paradigm shift that using Kafka offered going from static to events-based datasets is one of the reasons Data Analytics has grown as a discipline as we’re now able to deliver granular and insightful analytics about the state of users. This in turn keeps the business units and Execs happy, and so the world continues to turn.

But things aren’t all rainbows and sunshine. At the seams between your backend and your data warehouse therein lies a problem. Like with most technical design decisions you will have made trade-offs. You’ve either optimized for service deployment or built your data pipeline for resilience. You know these two are mutually exclusive because you’ve deployed a generic service that will dump any data fed into your Producer directly into Big Query but it arrives as JSON to reduce the burden of statically typing each event type on your engineering team. This works wonders for speeding up development time but costs your data team dearly in lost data context making it hard to enforce schema validity across the entire pipeline.

Having both would be having your cake and eating it too. We know that it’s not realistic though. Your CTO tells you that you need to prioritize shipping new features to reach profitability and so the data team let out a sigh and reside themselves to having to stomach the risk of dirty data pulling down the pipeline, kicking the company-wide KPI reporting out of sync, and giving the data team a bad rap. This nightmare may sound a bit extreme and unlikely in organizations at scale, but the truth is that this was the world I lived in until we built Turnstyl.

What is Turnstyl?

Turnstyl is an NPM package built to work with Kafka and other message brokers to allow you to compare the data schema input at the producer level to what arrives in your data warehouse downstream.

The aim of Turnstyl is to monitor by periodically polling both the producer events at the top of the data pipeline to the events arriving into the data warehouse at the bottom. Upon receiving two models that don’t have matching schemas, Turnstyl will log an error to the console, locally, and to Slack (in Beta) to let you and your team know that something has changed upstream. With Turnstyl alerting you to change in your pipeline, your data team will save hours of debugging time, allowing them to respond faster to breaking changes that have been introduced at the data source.

How can you use Turnstyl?

Using Turnstyl is easy as it integrates into your existing Kafka setup without the need to compromise on performance.

  1. In your terminal run npm install turnstyl in your target project directory
  2. Create a turnstyl.config.yaml in your project root directory, using the template below to input your Google Big Query project and dataset names, and the path to your Google API service account credentials JSON.

Example turnstyl.config.yaml file:

version: '1.0'google_service_credentials: './GOOGLE_BIG_QUERY_SERVICE_ACCOUNT_CREDENTIALS.json'
big_query_project_name: 'BIG_QUERY_PROJECT_NAME'
big_query_dataset_name: 'BIG_QUERY_DATASET_NAME'
  1. Import the package into the service where data is being sent to your producer

const { Turnstyl } = require("turnstyl");

  1. Instantiate a new Turnstyl object and invoke the following two methods:

const newTurnstyl = new Turnstyl();

  • newTurnstyl.cacheProducerEvent(topic, message); - Caches your events where they are produced (or in the producer) and with reference to the topic that they will be sent through, parsing the object for it's data types.
  • newTurnstyl.compareProducerToDBSchema(topic); - Compares the events that have arrived into Big Query (assuming the topic name matches your target table) flagging if there is a discrepancy between the two.
const { Kafka } = require('kafkajs');
const { Turnstyl } = require('turnstyl');
const newTurnstyl = new Turnstyl();/**
* @function producer function that connects to Kafka sends a message then disconnects
* @param producerName STRING name of producer
* @param message STRING message that will be sent to Kafka
* @param topic STRINg name of the topic that message will be posed to on Kafka
*/
const producer = async (
producerName
message
topic
) => {
//Declare a variable kafka assigned to an instance of kafka (door into the kafka brokerage)
const kafka = new Kafka({
clientId: producerName,
brokers: ['kafka:9092','kafka:9093','kafka:9095'],
});
const producer = kafka.producer();
newTurnstyl.cacheProducerEvent(topic, message);
newTurnstyl.compareProducerToDBSchema(topic);
try {
// Connect to the producer
await producer.connect();
} catch (error) {
console.log('Producer Connection error: ', error);
}
try {
// Send message
await producer.send({
topic: topic,
messages: [{ value: JSON.stringify(message) }],
});
console.log('message is:', message);
} catch (error) {
console.log('error in message send', error);
}
console.log('Data sent by producer');
// Close connection to the broker
producer.disconnect();
};
export { producer };

Current Features

  • Compare data object processed by the producer to the JSON deposited in the data warehouse
  • Send alerts to the console
  • Write logs locally to keep track of data discrepancies mismatches
  • Integration with Kafka
  • Integration with Google Cloud Platform (Big Query)

Roadmap

Head to our roadmap to see our upcoming planned features

Contributors

Jae KimLinkedIn

Yolan ArnettLinkedIn

Dillon SchriverLinkedIn

Emeric DavidLinkedIn

Michael ColleyLinkedInTwitter

How to show your support for Turnstyl

--

--

Michael E Colley

Software Engineer @Intercom / Previously Data Scientist @ Monzo / Ex-WeWork