The Need for a Stream Registry — Part One

Stream Registry enables stream discovery and governance.

The previous blog post gave a short introduction and some context into why HomeAway needed a Stream Registry.

High Level Goals

The following is a list of goals for a hypothetical Stream Registry:

  • Provide a mechanism for centralized registration, stream discovery and governance (replacing the need for a wiki and Jira process) with company-specific tags for governance reasons and easy implementation of show-back policies, capacity planning and migration strategies.
  • Centralize specification of producer/consumer configuration by cluster so that producers need only bind to a “stream-name” and then be served their configuration from a central location. Such configuration could specify quotas for streams and default to applicable cluster-wide policies.
  • Decouple producers from consumers. By wrapping/leveraging the existing Schema Registry, this requirement is easily met.
  • Require a company-specific event header. This header should contain pure operational tags such as trace context. If the event participated in an httpRequest, then a corresponding requestContext should be provided. In the future, stream providers may make this purely operational event header disappear as metadata event processing matures.
  • Provide a discovery mechanism of the current state of producers and consumers for every stream.
  • Provide a place to discover which cluster to connect to for any given region, cloud or company. In other words, it should easily facilitate multi-cluster replication, multi-region replication, multi-cloud replication and/or multi-company replication.
  • Act as the “control plane” for all external stream sources and sinks that can be cluster-, region-, cloud- and company-specific.

This is a long list and we are pretty sure we missed a few. Please provide feedback!

The Big Idea

I believe one major component of the success of future companies will be built around their ability to harness value from their immutable streams of data characterized by their schema and other metadata.

Put simply, a stream is represented by its schema and its name — a simple string.

A schema and its stream name

Note: Supporting multiple schemas for the same stream will be supported. This is a desirable feature for clickstream, command-events and event-sourcing.

Part One :: The Developer Experience

So, as a developer, I need to produce “widgets” along with a Widget schema.

A producer begins by first defining a widget. The producer developer defines the schema that will represent the widget. A model library is then generated in the target language of choice, optionally generating a SerDe library (short for serializer and deserializer) that transforms widgets to/from their wire representation while handling version resolution. The widget schema is then bound to the stream name “widgets”, creating the underlying stream if necessary. The developer then uses a client library to register with the stream registry and receive all necessary connection bundles to connect to the target cluster. Finally the developer leverages the model library to connect to the target cluster and produce events to the stream as normal.

Whew! That was a mouthful. Lets go through that again, slowly.

Lets first introduce all the moving parts — all the stream components.

Stream components that coördinate with the stream registry

Stream Modeling Pipeline

The stream modeling pipeline performs the following functions:

  • It enforces compatibility requirements for the given schema. (backward, forward, both)
  • It generates the model library for a given version of the schema.
  • It registers the schema with the stream registry (which registers with underlying schema-registry).
  • It creates the stream if necessary.

The stream modeling pipeline generates a model library from a schema definition language.

Stream Modeling Pipeline specifies field validation and provides an event header

A function the stream modeling pipeline performs is to enforce that a ubiquitous event header is specified on every schema in the shared streaming platform. The event header has purely operational fields related to the event that are necessary to track on every event in the system.

For the most part, the developer can ignore the event header as this is added to the schema automagically. All she needs to do is define her schema and select the target language of her choice.

Finally, the last function the stream modeling pipeline serves is to register the stream, along with its schema, field validations and event header, to the stream-name string in the target environment. This has the side-effect of provisioning the underlying realization of the stream on whatever stream platform is supported by the stream registry. It also can be opinionated so that the producer may be on one cluster and consumers can be on another.

Stream modeling pipeline registers the stream

Client Library

The client library is responsible for producing and consuming from streams. Its functions are depicted below.

The client library adds an event header and is responsible for producing and consuming streams.

The client library is responsible for:

  • Adding the required event header (with relevant fields populated).
  • Requesting producer or consumer configuration from the stream registry to enable multi-cluster use cases.

The developer is now free to log as she pleases with full confidence that the configuration of the stream is located in a central location, and can easily be changed without having to make code changes at runtime.

Part One :: COMPLETE

Let’s repeat verbatim what was said at the beginning of “Level One — Developer Experience”.

A producer begins by first defining a widget. The producer developer defines the schema that will represent the widget. A model library is then generated in the target language of choice, optionally generating a serde library (short for serializer and deserializer) that transforms widgets to/from their wire representation while handling version resolution. The widget schema is then bound to the stream name “widgets”, creating the underlying stream if necessary. The developer then uses a client library to register with the stream registry and receive all necessary connection bundles to connect to the target cluster. Finally the developer leverages the model library to connect to the target cluster and produce events to the stream as normal.

TL;DR Part One

A developer who wishes to use streams need only:

  • Define a schema (with field validations)
  • Bind it to a stream name
  • Use the resulting model library in conjunction with the client library to save events to streams

The stream registry enables centralization of governance and the ability to bind producers and consumers to specific clusters enabling a multi-cluster strategy.

Stay tuned for The Need for a Stream Registry — Part Two. Whereas this blog post was more of the developer experience, part two will discuss how engineering teams can orchestrate all these streams centrally while enabling a distributed experience throughout the ecosystem.