Event-sourcing at Nordstrom: Part 1

At Nordstrom, we’ve been exploring a specific type of near-real-time data streaming called event-sourcing through a combination of open-source projects and features shipped to production over the past four years. Along the way we’ve learned a lot and would love to share some of the opportunities we see.

Photo by Bennett Williamson on Unsplash

Event-sourcing is an exciting new/old trend in modern architectures, and this five-part series will give insight into our journey.

In part one, we introduce the concept of event-sourcing, discuss its relationship with some conceptual neighbors like Internet of Things and Serverless patterns — and define some terms. In parts two through five we’ll discuss some of the key opportunities for improvement and how we’ve approached them. If you’ve had experiences in this area we’d love to hear from you in the comments. Let’s go!

TL;DR

  • Challenge: creating a true event-sourced architecture is difficult and not all of the pieces exist or interoperate well today.
  • Opportunity: challenges solved on the way to creating true event-sourcing will significantly improve the quality of your streams and stream-processing systems. Event-sourcing as an architectural goal should be considered and preferred when assessing design options.

What is event-sourcing and why should I care?

In an event-sourced architecture, the single source of truth for the entire system is an ordered event ledger (also commonly referred to as a log, a journal, a stream (when in motion), or a table (when at rest)) that records real world events — everything observed or decided. In it’s purest form, the application state for all applications and their views can be deterministically restored using only the event ledger.

Events record facts that are true at the moment of the event, represent something that occurred in the world, and are relevant to and understood by the business. For example: “On Monday at 3:16PM, SKU J012 was sold for $200 to customer 31337.”

Consuming applications read from the ledger and record or act on the data. A common use case is the production of domain-specific, read-only aggregated views of data fronted by RESTful request-response APIs like “Inventory” or “Catalog”.

Benefits we’ve seen from event-sourced architectures include:

  • Increased data quality: since you’re using the same events for analytics as you are for feature teams, the quality of your data for machine learning and analytics is greatly improved.
  • Relatable architecture: your technology design more closely matches what occurs in the real world, making your technical architecture more aligned and more understandable to your business teams (“when this, then that”).
  • Quick reaction: react in near-real-time (~2 seconds) to anything happening anywhere in your system with reduced complexity.
  • Scalability: easily scales horizontally to hundreds of thousands of events per second and the addition of hundreds of event consumers.
  • Business continuity: letting each consumer read from the ledger at their own pace removes back-pressure and creates a natural circuit-breaker opportunity for protecting downstream systems and recovering from outages.
  • Service provisioning: strong decoupling of services reduces time to market for new services or changes to old ones, including running multiple versions or multiple complete services in parallel on live production data. This lets you easily migrate to a new service and discard the old one (“databases as cattle”).
  • Tech agnostic: events are not dependent on any technical details of their source systems and are understandable to business people, increasing autonomy by providing full system knowledge to any team (“events don’t know they came from a mainframe”).
  • Historical record: creates the opportunity to audit or replay history, allowing the recreation of historical state for debugging or disaster recovery.
  • Vendor agnostic: reduce the specter of vendor lock-in by duplicating event ledgers in another service provider’s cloud or onto/off of an on-premesis location.
  • Setting a high bar: every additional consumer drives quality into the stream by raising the requirements for producer data quality and freshness which creates a virtuous cycle that everyone benefits from.

Can you give me an example of an event-sourced system?

Sure! Consider a game of chess. A traditional database is the equivalent of a photograph of the game in progress — from that moment-in-time snapshot you can see where all of the pieces are, who is playing, and you could calculate a probability of who is going to win the game.

Photo by JESHOOTS.COM on Unsplash

Contrast this with an event-sourced chess system where the focus is on recording all individual moves onto a ledger (in chess, known as a move-recording notations). For each move we see an entry in the ledger that contains the table number, the game number, the player’s name moving the piece, the date and time of the move, and the result of that move (for example, if any pieces were captured).

Given access to the ledger of all moves, you could exactly restore the game state at any point in the game (the gold standard of an event-sourced architecture). You could also easily add functionality to your system like a chess clock, a tournament manager, cheat detection, player analytics, a chess coach, a system to detect the most interesting games for commentary or highlighting, and more.

Additionally, the burden of the move detection system (perhaps an IoT chessboard) becomes solely to detect moves and log them. Once the event is accepted onto the ledger, the event producer’s job is complete.

Event-sourcing and serverless architectures

Serverless functions are fundamentally event-driven. There are two basic ways to trigger a serverless function: directly with an invocation, or indirectly based on something that occurred in the world. This second method can be used to chain together powerful sequences of events every time something occurs — you might think of a (useful) Rube Goldberg machine.

Once you’re comfortable with an event-triggered ecosystem it’s natural to want a way of tracking, maintaining and processing those events in a reliable and robust fashion. Event ledgers (with serverless processing) give you this capability. If you use serverless systems in a robust way I predict it’s likely you’ll end up thinking deeply about event-sourced architectures very soon (if you aren’t already).

Event-sourcing and IoT (Internet of Things)

In an IoT system, sensors (like a temperature or light sensor) are event producers and actuators (like a light or a lock) are event consumers. Additionally, these sensors and actuators often rely on unreliable network connections. Ledgers are a powerful tool for handling occasionally connected (or offline first) scenarios and ensuring that systems stay aligned. Many IoT ecosystems use ledgers behind the scenes to handle offline scenarios. CRDTs (convergent data types) are another fascinating direction for ensuring convergence of data across occasionally-connected devices and are worth an entire blog post themselves.

Let’s define a few terms

We’ll be using a lot of technical terminology in this series, some of which has multiple applications. Let’s define a few terms:

Distributed ledger/unified log/journal/stream

A place where specific events (ex: At this date and time, Rob edited the document in this way.) are written down in the order they are received. The ledger is append-only (write the new event at the end) and immutable (written in “pen” and never changed). It is often distributed in that it is replicated for durability and partitioned for scalability.

Partitioning of the ledger

To divide the work of processing a stream of events, the ledger is broken into separate partitions (sometimes called shards) using a specific attribute associated with each event called the partition key.

Partition Key

A derived or assigned field within an event identifying the subject of the event for which ordering is most important. (ex: if customer name/ID is used as partition key, then all actions by the same customer will be processed in a consistent order by all event consumers)

Shards/partitions

A simple real-world partitioning example might be multiple physical tables at an old-fashioned in-person college course registration day event split out by first letter of last name (ex: A-G, H-N, O-U, V-Z). Each table (and the line leading to it) is the equivalent of a partition or shard. A partition is selected as the location to send an event by using a function on the key value of the event.

Ordering guarantees within a partition key

Within each partition key we can guarantee that all events are in order (Rob did A, then B, then C). But we can’t guarantee ordering across partition keys (Did Grace do D before Rob did C?). The ordering within a key is critical because it allows us to have multiple different readers all arrive at the same conclusion at any point in their reading of the ledger. This provides distributed consensus between consumers without requiring them to talk to one another.

Event producers and consumers

The senders and receivers of events to and from the stream, respectively. Alternatively in an Internet of Things environment the sensors (temperature, light detectors, buttons, etc) and actuators (motors, displays, lights, etc) of the system. A further alternative naming/way of thinking is the observers and interpreters of reality.

Offset/ledger position for an event consumer

The position(s) to which a consumer has read in one or more shards of the stream. It represents how far the consumer has successfully processed the stream and where it should resume processing.

Materialized view/read-only aggregate/write-through cache

When a consumer reads from the event stream and creates a stateful object (like a NoSQL database table), it is creating a “materialized view” of the event stream. An example of that would be the picture of our chessboard we discussed above. All of the moves are updated on the board from the ledger, and the picture of the board becomes one materialized view of the system. A separate materialized view might be an integer array representing the number of seconds remaining on the chess clock for each player.

Logical clock

A logical clock refers to the position of the reader in the ledger in terms of the offset of their read position from the beginning. This serves as a sense of relative, causal time. “By ledger position 2718 we had collected $1,000 for charity.” This differs from wall-clock time which uses an arbitrary notion of current time. “By 12:15PM according to my watch we had collected $1,000 for charity.” Logical clock time is deterministic on a ledger, wall-clock time can be problematic for many reasons. (time zones, drift, leap seconds, clock adjustments causing skips in time)

Vector clock

A vector clock refers to the position of a reader across many ledgers (these could be multiple partitions/shards of a single topic). One example is in Slack - the collection of message offsets to which you have read in all of the channels to which you are subscribed forms a vector clock of the content you have consumed. In this example, the vector clock is used (among other things) to determine whether you have unread messages.

Eventual consistency

A system is eventually consistent if, when you stop making changes, all participants will come to the same answer. For example, when I charge something on my credit card, the bank balance is not immediately updated but I am sure that it will be eventually. This is as opposed to strong consistency where every participant will always have the same answer.

Next up? Event Producers.

In the next part of this series, we’ll take a deeper look at event producers and some of the common challenges we’ve observed. Event producers are the observers and deciders in your system that are responsible for recording event information onto the ledger.

To learn more about what we’re doing with event-sourced architectures at Nordstrom check out this article by A Cloud Guru on event-sourcing at Nordstrom, this one on our award-winning open sourced serverless event-source project “Hello, Retail!”, and Rob Gruhl’s talk at the Emit conference.

To learn more about event-sourced architecture in general, I’d recommend Martin Fowler’s talk, Martin Kleppmann’s short ebook and full book, Greg Young’s unfinished ebook, and Jay Krepp’s blogs.

…and if at this point you’re thinking event-sourcing is a magical pattern that will solve all of life’s problems, take a look at Chris Kiehl’s article Event Sourcing Is Hard.