The Need for a Stream Registry — Intro

Rene Parra
Expedia Group Technology
5 min readJan 17, 2019
Stream Registry enables stream discovery and governance.

Some history and context…

Calendar from 2014–2016

In previous articles, I’ve written about Domain Events, Business Events, and Command Events and how analytics are evolving at present. In this article I want to talk a little a bit about how we did what we did, and then talk about what we are doing to improve.

Back in 2014, HomeAway spent lots of time and resources to answer basic questions about the business. So, we did what most businesses did at this time in history. We employed time-honored ETL and OLAP techniques to extract operational data into analytical systems.

This was slow, expensive, and many times required many teams to coördinate around a single question or class of questions. We repeatedly had to “re-tool” the pipeline whenever a new analytical question came up that didn’t quite fit the previous set of questions or class of questions. Therefore, like many other businesses at that time, we found ourselves asking “Is there a better a way?”

Was there a better way?

Back in 2014 time frame, we found the need to implement “exploratory analytics” at scale. At that time, we were still in the data center (not in the cloud). So we explored what so many others did then as well. We decided to take a strong look at the elephant in the room — hadoop. Quickly, the need to enable “exploratory analytics” and “democratizing data” for the masses centered on one key blocker — the ability to fill the data lake with relevant data in a format that was easily indexed and available to relevant data tools.

We looked at a variety of methods to filling the lake. I won’t mention them all here. I will call out the one that we ended up selecting, primarily because of its simplicity, its scale, and most importantly its potential to help revolutionize the entire business if certain operational concerns were met.

We made the bet and selected Apache Kafka to fill the lake and we were pleased with the results.

Apache Kafka — A distributed streaming platform

How NOT to get broad adoption of your distributed streaming platform

Don’t do this. Do NOT press the big red button.

Do NOT press the big red button

Here’s what we did.

One aspect that we found key to broad developer adoption is ease of use. So, we built a client library to make the following things easier for developers.

Client Library

  • Hardcoding connections to the one shared cluster

When we started off, we had ONE shared cluster for every integration. To make things “easy” for developers, we hardcoded the configuration per environment. (One for each of our lab environments and one for production.) Although this was great for early adoption, it caused us pain years later when we wanted to employ multi-cluster strategies to minimize blast radius.

Don’t do this.

  • We put all schemas in a single library

The way we did this was by requiring all schemas to be specified in a single library. Although this worked in the beginning, having one library that had everyone else’s domain was not optimal. Why? People were blocked on that one library and developers wanted their models in their repositories to be deployed at their velocity and pace.

Don’t do this.

  • Further hardcoded configuration

In order to simplify configuration, we specified the partition key for everyone. BAD IDEA. Although this helped with mass adoption, in the end we had many topics partitioned by the wrong key. This caused later downstream processing headaches.

Don’t do this.

  • Made our event header optional

We had an event header that helped track lots of operational concerns. This event header later proved useful for other important trace context fields critical for system health monitoring. However, we made it optional so there were many events that weren’t using the event header and therefore they instantly were not available for lots of “data at rest” forensics for determining upstream issues. Even worse, at one point we regressed our auto-magic injection of fields on the event header and left this “as an exercise for the developer” to do. This left it open for developers to put whatever they wanted into these fields and left it open for errors or incorrect data complicating debugging sessions and availability further.

Don’t do this.

Manual JIRA and WIKI process for minimal governance

Manual wiki and JIRA process for governance. Don’t do this.

The other thing we did was require a JIRA ticket and an entry into a wiki page to keep track of each topic’s owner, use case, capacity, partition size, partition key, etc. This quickly grew to be unwieldy and difficult to enforce for all topics.

Don’t do this. 😁

One of the things we felt we almost did “right”.

Schema Registry

The last thing we did was have the client library require a schema registry to decouple consumers from producers and provide a generic way for downstream consumers to get the latest schema without having to upgrade any libraries or code. This schema resolution requirement plus requiring a data at rest format led us to choose Avro. In this manner, downstream consumers were free to upgrade when they wanted and upstream producers could deploy at whatever frequency they wanted. This is the number one reason to embrace microservices — to enable independently deployable business functionality.

Please DO TRY this at home.

This worked great, but as we learned from the lessons above, we knew we really needed more.

The next post in this series “The need for a Stream Registry — Part One” describes high level goals of a stream registry and the experience from a developer point of view.

--

--