Stream Inversion & Topology
Mapping event models to the physical world
In an event-first system all inter-service communication is performed asynchronously via event streams. Upstream services emit events as their state changes and downstream services react.
On a recent project of mine there were over 150 event types. (find function stats here) There were about 30 autonomous services. The ingress external-service-gateway (ESG) services produced from 1–12 event types each. Most backend-for-frontend (BFF) services produced a small handful of event types, while the most complicated produced over 2 dozen. Each control service consumed dozens of event types, but only produced a handful of higher-order event types and consumed those as well. The BFF services on average consumed a dozen event types. And the egress ESG services consumed from 1–12 event types each.
The question is, how many event streams are needed to support these event types? Logically we can think of each event type as its own pub/sub topic, but our physical model needs to be manageable and cost effective. The technology you use for streaming will impact the physical model, but I recommend using the ideas covered in this post regardless of the tool. (I discuss why I prefer streams over other messaging alternatives here.) Given my preference for serverless-first, I will use AWS Kinesis for an example of a physical model.
The physical model of the AWS Kinesis streaming service consists of streams and shards. There is no concept of topics. If we were to treat a stream as a topic then on my recent project we would have had over 150 streams, which would have been cost prohibitive. Plus, there is no direct concept of merging streams, so we would not be able to easily apply logic across multiple event types. And the concept of shards is not relevant here as they are just used to adjust the capacity of a specific stream.
Stream Inversion
I use a technique I call stream inversion, whereby we invert the logical model and multiplex many event types through one stream. Each event is wrapped in an envelope with id, type and timestamp fields (at a minimum) and a body that is specific to the event type, such as the thing-submitted example below. (I will dive further into the shape of events in a separate post.)
{
id: 'ab3ba422-0a2f-11ea-8d71-362b9e155667',
type: 'thing-submitted',
timestamp: 1574100988321,
partitionKey: 'edac237c-2523-41a2-84ee-2b1a3baeccce',
thing: {
id: 'edac237c-2523-41a2-84ee-2b1a3baeccce',
name: 'Thing One',
description: 'This is thing one.'
}
}
Consumers can then filter on the event type field. Kinesis does not support filtering, so we perform the filtering in our listener functions. (I discuss listener functions briefly here and I will cover them further in another post.) The following is a example of using Functional Reactive Programming (FRP) to process a micro-batch of records from a stream and act on the events of interest. (find a more complete example here) FRP is a natural match for implementing stream processors. The filter step weeds out unwanted event types. This example just performs a simple action, but we can do all kinds of interesting and complex processing. For example, we could optimize performance by grouping events on partition key across event types to increase throughput.
export const handler = async (e) => {
return _(e.Records)
.map(recordToEvent)
.filter(onEventType)
.flatMap(doSomething)
.collect().toPromise();
};const onEventType = event => event.type.match(/thing-*/);...
I want to include a quick note here about the importance that batch size plays in streaming. A typical assumption is that filtering events in the consumer function results in more function invocations. If the batch size equals 1, then this would definitely be true. But as the batch size increases and when the volume and mix of event types increases this can actually be a wash. I use the average batch size as a divisor when comparing the estimated invocations resulting from the different stream topologies.
Architecture is full of trade-offs and I have found the simplicity of consumer-side filtering to be well worth it. This should become more clear below.
Stream Topology
Stream inversion gives us a tool for controlling the number of streams. Now we need to decide how to combine event types into a stream topology.
We need to define some boundaries/bulkheads. The 150 event types I mention above were for one system, out of about a half dozen related systems in that enterprise. Each system owned its own streams and lived in its own AWS account. They were autonomous. They had proper bulkheads between them. The ingress and egress ESG services bridged the necessary events between the related systems and created an anti-corruption layer. (AWS EventBridge might be a natural fit here.)
In an event-first system we treat all events as first-class citizens. This means that all events are stored in a data lake in perpetuity. A common set of services (per system) needs access to all the streams in the system’s topology to persist and index the events, collect metrics and monitor for fault events. To minimize the churn on these services, it is preferable to have fewer streams. I recommend striving for just a handful of streams.
Each autonomous service in your system should emit all its events to the same stream. In theory all the events in a specific service are related, thus they are all likely of interest as a group to services downstream. Don’t make downstream services work hard to get access to all your events.
Things get much more tricky downstream. To optimize the stream processing of a specific service it is best to consume from a single stream. However, a service has no control over which services emit events to that stream. Plus, different downstream services are often interested in different yet intersecting cross-sections of events. In all likelihood, it wont be possible to get a perfect match of producers to consumers. This is one of those trade-offs I mentioned above and one reason why I prefer streams paired with stream inversion. Tuning the batch size gives us the latitude to solve an intractable problem with a simple solution.
Here are some additional dimensions to consider when defining the streams in your topology:
- Event volume: consider separate streams for low and high volume event types, but weigh the saved function invocation cost against the cost of the low volume stream
- Event priority: consider a separate stream for high priority event types to minimize latency caused by consumer read throttling
- Event size: consider a separate stream for large event types, such as fault events, so they do not consume bandwidth needed by high volume or high priority event types
- External events: consider creating an ingress stream and an egress stream for receiving events from and for sending events to sister systems
The reality is that achieving an optimal stream topology is a balancing act. There are trade offs and the right answer may not be obvious. I recommend starting simple with a single stream. Then leverage the observability of the system and tune the topology once the ebb and tide of the events becomes clear. Thanks to the stream inversion technique, this refactoring is just a matter of re-configuration.
For more thoughts on serverless and cloud-native checkout the other posts in this series and my books: Software Architecture Patterns for Serverless Systems, Cloud Native Development Patterns and Best Practices and JavaScript Cloud Native Development Cookbook.