Self-serve data unification with Kafka

Published in

Wood Mackenzie

7 min readDec 10, 2019

It’s no secret in the oil and gas industry that trying to unify and make sense of data is a difficult problem. Data comes in many standards and formats with some of it even dating back over half a century on handwritten scanned documents.

At Wood Mackenzie, we source all the data we can get our hands on and attempt to make sense of it with the intent to provide insights and analytics to our customers. One of the longstanding challenges in the business has been unifying all of our sourced data, validating it and aligning it with common entity models so that we can trust the insights we are generating.

Generally, we want to “un-bake” the data spaghetti from this:

Picture of cooked bowl of spaghetti — source: https://www.pexels.com/photo/stir-fry-noodles-in-bowl-2347311/

To this:

Picture of neatly arrange uncooked spaghetti — source: https://www.pexels.com/photo/pasta-beside-brown-wooden-laddle-on-brown-knitted-textile-209569/

The second problem we are facing is that domain expertise and knowledge is spread out all across the business with varying skillsets and experience. We needed the ingestion, management and ownership of new data to be accessible to data engineers and analysts in all the horizontal and vertical business domain areas that we deal in.

Building a modern data pipeline largely requires a certain level of software engineering expertise and we did not want to force everybody in the company to become expert software engineers, so to provide a solution that scales across the company we approached the problem with an emphasis on some key elements:

Simplify and Abstract the ingestion, validation and transformation of new data through the pipeline enabling data engineers to own the data, thus removing the dependency on the data platform engineers to source new data.
A Real-Time data transformation platform that vastly reduces the cadence between sourcing and publishing and validates and enriches our raw data, so that the business and our customers can truly trust the data we are producing.
Reduce ownership overhead: A single, unified data transformation platform would self-serve, with a team owning the platform but allowing other teams to potentially own their own data pipelines. Alerting, monitoring and standard framework functionality will come out of the box.
Provide a Holistic method of producing comprehensive trusted datasets derived from raw data to the business and external customers.
Provide a framework for downstream consumers to consume and process the data out of the platform.

Unifying ALL THE THINGS

With all of this mind, we settled with using Kafka to control the flow and unification of data through the pipeline. Kafka provides a generic and consistent method for dealing with individual, sparse, raw datasets that suited our use case well. Some bonuses:

Auditability: Having a reliable way to audit exactly what happens to the raw data as it goes through the pipeline provides a layer of trust and debuggability.
Strong schema enforcement with Protobuf: We want the individual attributes for an entity to be strongly typed and consistent throughout the entire pipeline and we want to align all the data to a common model for each entity type to enforce a consistent view of the data entities across the business.
Kafka Streams and Faust: Provides the real-time processing we desired to bring the data sourcing cadence down from months to minutes. Kafka streams provide the foundation for the pipeline whilst Faust provides a python-based stream interface to allow non-java developers to code basic derived data services through cookie-cut methods.
Kafka Connect: We utilise this framework to allow downstream services and users to easily consume the output from various stages in the pipeline into whatever platform or system they desire; be it (Elasticsearch, S3, Postgres, MySQL, etc). Connect jobs are configurable with an in-house Kafka connect CLI and enforced through our CI pipeline.
Kafka has all the bells and whistles of a modern open-source data framework such as being highly scalable, fault-tolerant and well supported by the open-source community.

Totally not a real example

One of your clients (GenericInvestmentsLtd) message you with a requirement about wanting to invest in car companies and need data and insights on who to invest in.

You decide you want to ingest data about new up and coming cars to run some analytics on the data to sell to the investment company. The first company you target is Tesla as they have a fancy new “Model Z” coming out that is all the buzz in the media.

There are a few problems though;

Elon is being cheeky and only drip-feeding information about the car out to the media
Different media companies are reporting on the cars specs in different formats and are reporting on different aspects

So we find a few different sources and the data looks like this:

In the mean-time a data modeller within your company creates a common model for entity type “Car” that looks like this and tells the data engineer they want one record per (Make/Model) with just the latest facts:

Problems with the traditional relational paradigm

Ingesting this data into a relational database is troublesome because the schema would have to constantly evolve to make way for new data sources with different attributes and different formats.
Handling the mapping at the ingest side also becomes messy because you start embedding custom business logic in and across each scraper built for each source and we also lose the auditability of the raw data if we are transforming it during the ingest phase.

So how does this look?

Data Engineer creates a scraper for each new data source using cookie-cut python scrapers which is part of a wider abstracted scraper framework. As part of this work, the scraper framework will serialise the message into a generic Protocol Buffers format, encoding the actual data into an “any” type. The scraper will forward the data to an ingest Rest API which is a gateway into Kafka.
Data that is ingested will by default go nowhere until the data source has been registered with the config API. The Data Engineer will interact with the Config API to register the data and provide the mappings for individual raw attributes to extracted attributes. They can also define validations they want performed on each individual attribute in this step.
A Kafka streams “Record Service” holds a global KTable of all configured entities from the entity-config topic. The record service then consumes off the ingest-item topic and using the entity-config information will distil each individual attribute from each source down into individual messages, validate the value, perform the mappings and forward each attribute into the “record-attributes” topic.
Data Modeller will then define the schema for the common model for each entity and register it in the common model schema registry. They may decide they only need a subset of the available attributes or all the attributes.
A Kafka Streams app takes the common model schemas and maps them to data ingested from the record-attributes topic un-pivoting the data and materialising it into a compacted topic per common model.
(Not shown in the above figure is a common model mapping between output topic and entity type which is also configured in the config API and consumed by the common model materializer)

Things that we learnt

1. Kafka is amazingly fast. At one point we accidentally created an infinite loop and generated billions of messages in short succession on a small cluster and Kafka was still fine.

2. Scaling Kafka vertically is easy but scaling horizontally requires some additional work to rebalance partitions across nodes.

3. Kafka streams is really nice and relatively easy to learn if you have previous experience with message bus frameworks and other data technology stacks like Hadoop and spark.

4. Setting up your own Kafka infrastructure was an interesting learning experience. We are currently exploring the options for Kafka as a managed service to abstract some of that complexity away from our software engineers.

5. We really like Protobuf as a serialisation format but there are currently no good schema registry solutions for Protobuf that have the same capabilities as confluent schema registry for Avro so we are exploring the possible use of Avro instead of Protobuf.

About the Author

Dean Cornish is a Principal Software Engineer working at Wood Mackenzie for the last 8 months in the data platform department. When he’s not talking about data he’s playing cricket and enjoys basking in Edinburgh’s beautiful year-round sunshine.