Kafka in Yelp

Here is the story from Yelp about their data pipeline. They started to use Kafka for database change capture across their internal services. They have over 150 services and it was challenge for them to leverage data across those services. Once they tried data sharing REST API, direct SQL api exposure or something like those but those did not go well. Finally they decided to migrate to Kafka and they achieved it in the approach of Stream-table duality. It is well-covered by Jay Kreps in The Log: What every software engineer should know about real-time data’s unifying abstraction.

I found interesting story here about schema registry. Today many companies are dealing with managing data capability and access control across internal services. If you are dealing with it, this story would help you.

Schematizer — a schema register

As described above, they have bunch of services producing data, it means they have variety of data with variety of formats. Also their data has tree-like dependency so downstream applications will be broken if upstream data schema has change without any consideration.

So they developed schema registory called “Schematizer”.

This enables to transport data without schemas. Instead, all of our avro-encoded data payloads are packed in an envelope with some metadata, including a message uuid, encryption details, a timestamp, and the identifier for the schema the payload was encoded with. This allows applications to dynamically retrieve schemas to decode data at runtime.

The Schematizer service is responsible for registering and validating schemas, and assigning Kafka topics to those schemas.

The users are to be aware of schema, not topic. It abstracts topic. It keeps registered schema immutable and versioned manner. So if writer/producer updates schema without any notification, reader/consumer can read data by fetching newer schema from Schematizer.

Also producers and consumers are required to register themselves to the registry whenever they produce/consume data. With this restriction, they can easily manage breaking schema changes and which topics are actively used and not used or something like that.

Of course when producers are required to describe the data they produce explicitly and Schematizer has capability for managing it. So consumers can look up what types of data are on it and available or not.