Scaling Event Sourcing for Netflix Downloads, Episode 2

by Karen Casella, Phillipa Avery, Robert Reta, Joseph Breuer

In the first episode of this series of posts, we introduced the Netflix downloads project and the use cases that led us to consider a solution based on the event sourcing pattern. In this post, we provide an overview of the general event sourcing pattern and how we applied it to some of these key use cases.

A Flexible Dilemma

When we first started to design the downloads licensing service, the content licensing restrictions had yet to be defined. All we knew was that they were coming, and we needed to be able to adapt to them. So, how do you start to design and implement a service when the requirements are yet to be decided? What if that service is going live in a matter of months to millions of members on a single day with a 6am global press release? You make the system flexible to change. Easy. Right?

Anyone who’s familiar with relational databases knows that the phrases ‘flexible’ and ‘easy to change’ are not overly true with regards to the underlying table schema. There are ways to change the schema, but they are not easily accessible, require a deep knowledge of SQL, and direct interaction with the database. In addition, once the data is mutated you lose valuable context to the cause of the change and what the previous state was.

Document oriented NoSQL databases are known for providing such flexibility to change, and we quickly moved in that direction as a means to provide a flexible and scalable solution. The document model provides us with the flexibility we need for the data model, but doesn’t provide us with the traceability to determine what caused the data mutation. Given the longevity of the data, we wanted the ability to look back in time to debug a change in state.

Enter Event Sourcing

Event Sourcing is an architectural pattern that is resurfacing lately as a valuable component of a modern distributed microservices ecosystem. Martin Fowler describes the basic pattern as follows: “The fundamental idea of Event Sourcing is that of ensuring every change to the state of an application is captured in an event object, and that these event objects are themselves stored in the sequence they were applied for the same lifetime as the application state itself.”

There are many excellent overviews of the Event Sourcing pattern. Two of our favorites include:

In short, Event Sourcing is an architectural pattern that maintains a complete transaction history for a data model. Instead of persisting the data model itself, you persist the events that lead to a change in the data. These events are then played in order, building up an aggregate view of the complete data domain. The ability to replay events to any point in time is also an excellent debugging tool that enables us to easily explain why a member’s account is in a particular state and allows us to easily test system variations.

The Pattern

The following diagram provides a high-level view of how we applied the Event Sourcing pattern for the Netflix system responsible for enforcing the downloads business rules, with a generic explanation of each component following.

The Event Sourcing pattern depends upon three different service layers: commands, events & aggregates.

  • A Command represents the client request to change the state of the Aggregate. The Command is used by the Command Handler to determine how to create a list of Events needed to satisfy the Command.
  • An Event is an immutable representation of a change of state for the Aggregate, i.e., the action taken to change the state. Events are always represented in the past tense.
  • An Aggregate is the aggregated representation of the current state of the domain model. The Aggregate takes a stream of events and determines how to represent the aggregated data for the requested business logic purpose.

As shown, there are a number of actors involved in implementing the pattern.

  • The REST Service is the application layer that accepts requests from the client and passes them on to the Aggregate Service.
  • The Aggregate Service handles client requests. The Aggregate Service queries for existing Aggregates and if one does not exist, can create an empty Aggregate. The Aggregate Service then generates the Command associated with the request and passes the Command, along with the Aggregate, to the Command Handler.
  • The Command Handler takes the Aggregate and the Command and assesses, based on state transition validity checks, whether or not the Command can be applied to the Aggregate in its current state. If the state transition is valid, the Command Handler creates an Event and passes the Event and Aggregate to the Event Handler.
  • The Event Handler applies the Events to the Aggregate, resulting in a new Aggregate state, and passes the list of Events to the Repository Service.
  • The Repository Service manages state by applying the newly created Events to the Aggregate. The events are then saved to the Event Store, resulting in the new state of the Aggregate to be available in our system.
  • The Event Store is an abstracted interaction for event read/write functionality with the backing database.

Netflix Downloads Use Case

When a member selects a title to download, the license lifecycle begins:

The Netflix client application first requests a license. Once the license is acquired, the Netflix client downloads the content and the member can play their newly downloaded content. Dependent on member actions, the state of the license can change throughout the lifecycle. The member may start, pause, resume or stop viewing the content. The member may remove downloaded content. Each of these actions potentially results in a state change for the license. The license is created, potentially renewed several times, and finally released (deleted), either explicitly by the member, or implicitly based on business rules.

There is a large amount of business logic involved in this lifecycle. Maintaining the license state is the job of the event-sourcing based license accounting service, which tracks a complete transaction history for the license, member’s downloaded content and device data models. This allows events to be played back in order, building up an aggregate view of the complete data objects.

Netflix Aggregates

The Netflix client application makes several different types of requests that are translated into commands, events and aggregates. To support the business requirements for license enforcement, we have three inter-related Aggregates: License, Downloaded Titles and Device. Each of these has its own service, handlers and repository. Following is a description of each of the concepts introduced above as they apply to some key use cases for the Downloads feature.

License Acquisition Use Case

Following is a simple use case of a member obtaining a license for the first time for a particular piece of content.

On the initial license request, the client sends a request to the Acquire License Endpoint, with the member identity, along with the title of the content being requested for download, to the License Service:

The License Service determines if the requested action is allowed by querying existing Aggregate data and applying appropriate business rules. Since this is the first request for the member on this content, and assuming that device and studio business rules are satisfied, the License Service creates a new empty License Aggregate and a Create License Command to pass on to the License Command Handler:

The License Command Handler applies the Create License Command to the License Aggregate and generates a License Created Event:

The License Command Handler passes the empty License Aggregate with the License Created Event to the Event Handler, which creates a new License Aggregate:

The Event Repository then persists the License Created Event in the Event Store:

Finally, the License Repository returns the new License Aggregate to the License Service, which packages the Aggregate information into the response back to the client.

License Renewal Use Case

Prior to the expiration of the license, the device may request an extension to the existing license, known as license renewal. Renewing a license is similar to the original license acquisition flow, with the major difference being that the current License Aggregate is passed to the License Command Handler along with a Renew License Command. After the License Command Handler generates the appropriate events, The License Event Handler applies the License Renewed Event to the License Aggregate as shown below. Note that the new License Aggregate has an expiration date 30 days beyond the current date. This 30 day renewal represents the business rule for license renewals currently in force. If this were to change, we would make a simple configuration change to the Event Handler.

Download Limit Rejection Use Case

Each time a device requests a new or renewed license from the License Service, the Downloaded Service retrieves the current aggregates for the member and evaluates for business rule validation. For example, one such validation requires some content to only be downloaded twice a year. When a device makes a license request, the License Service checks if the content has been previously downloaded that year. It’s possible to derive this information by retrieving all the license aggregates for the year and filtering on the content Id. This is a lot of processing, and we decided instead to de-normalize the data and create a separate aggregate for the content download, indexed on the member and content Id. This requires new Download Events, Service, Aggregate and Repository.

When we receive subsequent events for content, we can check for any previous downloads for the content. Per the sequence diagram below, if the Download Service sees that the member has exceeded the downloads for the content it can reject the request.

The End and Beyond

For our use cases, we have found Event Sourcing to be a very useful pattern for implementing a flexible and robust system. It’s not all sunshine and roses, however, and we definitely made some mistakes and have areas to improve upon (subsequent posts will go into those details). Overall, the flexibility of the architecture has given us the means to rapidly innovate and react to changing requirements, to debug issues with changing data states over time, and to get a brand new service out and serving millions of users around the world in a relatively short time.

More, Please!

In the next post of this series, we will provide a deep dive into the implementation details, including the use of data versioning and snapshotting to provide flexibility and scale. Following that, we will share our experiences implementing Event Sourcing, some of the lessons we learned (and the mistakes we made) around testing, scalability and optimization, and introduce some thoughts on future improvements and extensions that we are planning going forward.

The team recently presented this topic at QCon New York and you can download the slides and watch the video here. Join us at Velocity New York (October 2–4, 2017) for an even more technical deep dive on the implementation and lessons learned.

The authors are members of the Netflix Playback Access team. We pride ourselves on being experts at distributed systems development and operations. And, we’re hiring Senior Software Engineers! Email kcasella@netflix.com or connect on LinkedIn if you are interested.