Differences between Projections, Sagas and Aggregates in times of failure

I find the definitions of Aggregates, Projections and Sagas to be insufficient in most sources. Many approach these concepts from the point of view of normal system operation. However, a very interesting thing happens when one attempts to implement any system in real life — the whole thing can fail and abruptly shut down. What happens to Aggregates, Projections and Sagas during such a failure? What happens during a system startup or recovery? Is it any different from what goes on during the normal system operation?

This post aims to provide a clarification on some concepts that I glossed over in my previous post on Long Running Processes.

In this post I will attempt to answer these questions. We will draw a distinction between Aggregates, Projections and Sagas by considering how each of them behaves during the system recovery and the normal operation phases.

Recovery vs. Normal system operation phases

We define the system recovery phase to be the behaviour of the system that precedes its regular operation phase. The distinction between the two phases is purely subjective and can be drawn arbitrarily. What is important for our explanation here, is that such a distinction exists. How one separates the two is up to the system’s designer.

Recovery vs. Startup

Before we proceed, let us first agree that system recovery and system startup are one and the same, and making them as two different routines is pointless.

Consider a system that has a specialized routine for startup, and one for recovery. Now pull the power plug. Start the system. Which routine is executing? Clearly, it must be the recovery routine. Now consider under which circumstances can a regular startup routine ever run. It must satisfy these conditions:

  1. The system did not fail
  2. The sytem was shut-down properly

Considering that most systems designed do fail and that a system downtime is considered bad, it is rather unlikely that such two conditions will ever be fulfilled. Even so, were they to be fulfilled, a proper system shutdown would be so rare, so that when it did happen, the likelyhood of the startup working improperly would be large, as it would be seldom exercised (unlike that of a recovery procedure which is tested every time the system fails). Coupled with the fact that most systems can crash and recover faster than shutdown properly and startup [1] it is, therefore, much easier, more efficient and less error-prone to design the system to always run the recovery procedure at startup, assuming that a failure occurred. In fact, proper system shutdown may also be considered a failure in some sense, and can be treated by the same recovery logic that treats a crash.

Because of the reasoning above we will refer from now on to the system startup as system recovery.

Differences of behaviour

Events vs. State? In this article I assume that a system’s state is some sequence of events. This is true of any state, since any machine can be represented by a finite-state automaton. Any system that does not preserve events but amalgamates them into some single structure is simply loosing information. We will not consider such a system here.

Let us now have a look at how the Aggregates, Projections and Sagas behave, both, during the system recovery phase and during the normal operation phase.

A system’s recovery routine needs to restore the system’s state. It does this by replaying events from the system’s past. This procedure may be expressed differently in various systems but fundamentally it is what happens. One can say that a system recalls what happened to it previously. We refer to these events as past events.

A system’s normal operational phase deals with events produced by the system or outside events. We call these events the future events — even though some of them may be happening at the present time.

Aggregates, Projections and Sagas, all react differently to past and future events.

Aggregates

Aggregates are intimately interested in the history of the system, since they uphold the system invariants. The Aggregates need to process commands and for that they need to know the full history of what occurred before.

However, Aggregates are not interested in any of the events in the future. Why? Because they themselves are responsible for producing them! If an Aggregate emits and event that is then processed by the system, it must under no circumstances be applied to the same aggregate, lest a duplicate event is introduced into the system.

So, aggregates need to process past events but must not process future events.

Projections

Projections maintain accumulated representation of some part of the system. They do this by observing what happens and saving information that is relevant to them. Thus projections are clearly interested in the past history of the system. When a recovery mechanism provides past events to a projection, the projection is able to restore its state from it.

As new events are produced in the system, perhaps by the aggregates, projections must integrate that information as well. Therefore projections care about these events just as much as they do about the previous ones.

Therefore, projections must process both past and future events.

Sagas

Sagas are process managers (indeed, many prefer process managers — I mainly use the word Saga as it is shorter). The Sagas don’t care as much about the state of the system as they do about reacting to changes in that state. It is irrelevant where to get the current state of the system for Sagas. The common approach would be for a Saga to read some state from a Projection. This is the approach we consider here.

After the Saga has analyzed the state, it may take an action — such as issue a command. From this we may conclude that Saga does not directly care about the past events, but rather delegates to specialized components, i.e. Projections, for preserving that state.

Now, once the Saga is operational, it must absolutely react to new things happening in the system. Instead of querying the projections repeatedly, a much better and more straightforward option is for the Saga to receive events directly. Thus Saga must be informed of new events.

In this way we can see that Sagas do not process past events but must process future events.

Comparison

Let us assign past events to the recovery phase and future events to the normal operation phase of the system. We then get the following comparison that depicts which components participate in which operational phases of the system:

Figure 1: Event handling by Aggregates, Projections and Sagas under Recovery and Normal operation

As we can see that Aggregates participate in the recovery phase only. Sagas, on the other hand, participate only in the normal operation phase. Finally, the Projections handle events all the time, so they participate both in the recovery and normal operation of the system.

Conclusion

In this post we tried to point out a distinction between the three concepts: Aggregates, Projections and Sagas. We did so from the point of view of their behaviour during the two different phases of the system’s operation: recovery and normal operation. We showed that recovery phase is important to both Aggregates and Projections, while the normal phase is important to Projections and Sagas. Hopefully this will be helpful to someone designing systems that employ these concepts.

References

  1. Crash-Only Software, George Candea and Armando Fox, Standford Univerity, 2003, Proceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems.