Commands and Events in a Distributed System
Understanding the difference between the two makes designing your microservices much easier
As we build out the components in our distributed systems, we can find ourselves bombarded with seemingly unique use cases. Each one can seem like a new problem that needs to be individually analyzed and debated before it can be solved.
In reality, most use cases we’ll encounter will fall under some common patterns. Instead of inventing the wheel over and over again, we should strive to develop guidelines or rules-of-thumb, that help us make quick, consistent decisions.
In fact, as complicated as a distributed system design can be, they can be boiled down to some basic concepts. One important concept is the distinction between commands and events. Nearly all interactions between individual microservices involve either one or the other. If we can recognize whether a given use case involves command-processing or event-handling, we’ll already be well on our way to a solid design that’s consistent with the rest of our system.
With that, let’s look at the difference between the two.
Commands are the first things that we as software engineers work with. From our first Hello World example to our first monolithic website backed by a single monolithic database, we’ve been unknowingly handling commands.
But what are they?
Commands are actions that a person or some other entity wishes to be performed. Importantly, this action has not yet happened. It may happen in the future or it may not happen at all.
Let’s look at a few examples. In fact, forget about programming for a moment. Instead, I’ll describe a typical interaction in my household. It’s time for dinner, but my son is playing an online video game with his friends. So I issue a command: “Turn off the video game so that you can come eat dinner.”
In general, two things can happen:
- My command could succeed. My son tells his friends goodbye, logs out of the game, and joins us at the dinner table.
- My command could fail. My son pretends not to hear me and refuses to leave the game.
Now let’s consider a typical software example. A customer has visited an online shopping website and has selected some items to buy. To complete their purchase, they enter some information into the website’s checkout form and click the “Complete Purchase” button.
Again, two things could happen:
- The command could succeed. The purchase data is committed to a database and the customer receives a response indicating that their order is en route.
- The command could fail. This could occur for any number of reasons. The billing information might be incorrect, the shipping address could be incomplete, the items in the shopping cart could be out of stock, or a technical issue could occur on the back end.
Of course, in order to even get to the checkout page, a number of other commands will have already succeeded. Even something as simple as clicking on a link to view a product’s details represents a command.
As we just discussed, commands represent actions that a person or agent wishes to be performed. Events are actions that have successfully completed.
Generally, when a command succeeds, the result is a record written in some persistent datastore. Typically, that would mean a database transaction is committed, or it might simply mean a file being saved to a filesystem.
Take our canonical Hello World web application example, where we respond to a browser request with a simple HTTP response. In that case, we were probably focused on the command — receiving the GET request — and fulfilling it. But once the request was fulfilled, it had become an event. Moreover, that event was probably recorded in the web server’s logs.
So, you can consider events to be the mirror image of commands. Or you can consider commands to be caterpillars, and events to be butterflies. The bottom line is that an action starts out as a command. Once it succeeds, it becomes an event.
Differences Between Commands and Events
So, requests that we receive start out as commands, and then become events. What’s the big deal?
Well, it turns out that this has a number of implications on how we should handle each.
Commands can change — events do not
As we’ve discussed, commands can fail.
This might happen as a result of a technical issue, in which case a re-submission of the same command might succeed.
Or, it might fail because of a data validation error. In such cases, re-submissions of the same command would continue to fail. However, the user-agent can modify the command and retry it. If our customer’s order was declined because their credit card number has expired, the customer can enter a different, valid credit card number and resubmit the order.
However, once the command succeeds, there’s no turning back. The resulting event has happened and cannot be undone.
Furthermore, the event is immutable. Whatever the state of the command when it was committed, that’s the state of the event forever.
Once an online customer completes a purchase, then that purchase has become an event. Conceptually, there is no going back in time and negating it. Of course, anything is possible with software engineering — we could simply delete our record of the event. But there a multitude of reasons why we should never do this:
- Whether or not we delete our own records, the event will still have occurred in reality. And our record should, well, represent reality.
- It’s likely that other entities will have a record of the same event. The customer, of course, will know that they had just made a purchase. In addition, we might have contacted the customer’s credit card company, or a shipping company, or our warehouse… all of which now have a record of the event.
That isn’t to say that the purchase cannot be canceled. Our customer might find a better price elsewhere, or decide that they simply don’t need the product. That’s fine, but that cancellation would require a separate, subsequent command. And once that command succeeds, it too would become a separate, subsequent event.
Commands are unordered — events are ordered
Above, we noted that the cancellation event would be subsequent to the purchase event. This is important. Commands can be fired off and handled by our system in any arbitrary order. But once they succeed — once they become events — then their order is immutable.
Maintaining this ordering is just as important as maintaining the immutability of each event itself. If not, then our event-handling services could wind up with incorrect or corrupt data. Imagine, for example, a service that attempts to process an order-canceled event before being able to process the original order-placed event.
Commands can be rejected — events cannot
Services that handle commands may routinely decide to reject commands if they contain bad or invalid data. Moreover, while not ideal, it’s not the end of the world if a command fails due to a technical glitch.
However, an event-handling service must handle each event that occurs. It does not have the luxury of rejecting events, or allowing them to be dropped, for any reason.
Why is this? Recall that commands represent something that hasn’t happened yet. If a command fails for any reason, then every interested party involved — our customer, our own systems, any downstream systems — will all be in agreement that, well, nothing happened. And since our user-agent will be informed of the failure, they can then decide how to proceed (e.g. give up, try again, etc).
An event, on the other hand, represents something that has occurred. As far as our user-agent is concerned, their request was successful. So if we are a service that processes events, we cannot simply decide to reject some. Otherwise, our data will wind up in a permanently inconsistent state across our organization.
Or to provide a concrete example, imagine our unhappy customers who successfully make online purchases…but never receive their goods.
This means that data validation is the job of command handlers and not event handlers. Moreover, while we can tolerate the occasional lost command due to technical issues, it’s vital that our event-handling services are able to withstand outages to ensure no events are lost.
Commands now, events later
So, the way we process commands is fundamentally different than the way we handle events. In fact, if you take nothing else away from this article, take the following rule of thumb:
- We process commands that we are responsible for; moreover, we process them immediately, until they either succeed or fail.
- Someone else later handles the events that result from our commands; moreover, they should ensure that every event is eventually handled, regardless of how long it takes.
Commands and Events in a Distributed System
Next, let’s explore what it looks like to handle commands and events in a distributed system. To understand, let’s briefly discuss an important design pattern, the Bounded Context.
The Bounded Context
Say we ran a typical retail website. Undoubtedly, we would need some services that facilitate online purchases. For this, we might form a
Checkout Bounded Context, consisting of:
- A team — front-end and back-end engineers, designers, product managers, DBAs, etc — who will be responsible for building and maintaining the checkout experience
- The services, applications, and infrastructure that the team will build
Most likely, this team would know nothing about actually fulfilling an order once it is successfully placed. For this, we would have an
Order Fulfillment Bounded Context, which itself would consist of its own team, plus services/applications/infrastructure.
How do commands and events relate to Bounded Contexts?
Checkout Bounded Context is responsible for validating and handling online-order commands. Once a command is complete,
Checkout’s job is done. Any successful order must be fulfilled, of course, but that is not
Order Fulfillment Bounded Context will be responsible for fulfilling the order. But how will it be made aware of the order in the first place?
Our first, naive answer might be that
Checkout will call an endpoint — say, by posting to a ReST URL hosted within
In reality, that approach presents a few different problems:
- It couples the checkout process to the order fulfillment process.
Checkout’s job is to get the order processed and committed. But in requiring it to make that POST call, it is now also responsible for kicking off the fulfillment process. What if the call to
Order Fulfillmentfails? Has the
ordercommand — which had previously been thought to have succeeded — now suddenly failed? Effectively, we will have chained the command that is handled by
Checkoutwith a second command that is initiated by
- It requires
Checkoutto be aware of all other Bounded Contexts that are interested in checkout events. The
Checkoutteam will need to write and maintain the code to call
Order Fulfillment. And any time a new Bounded Context (say,
Analytics) becomes interested in order events, they will need to do the same.
- It can be the start of performance bottlenecks and cascading failures. As a part of handling online orders,
Checkoutwill make synchronous calls to other Bounded Contexts, which in turn may make calls to additional resources. Even under the best circumstances, this explosion of calls can eat away at response times. Any failure can wreak havoc on the
Checkout can alleviate some of these problems by making an asynchronous call to
Order Fulfillment, perhaps introducing a retry mechanism in case the first call to
Order Fulfillment fails.
That actually moves us in the right direction. But there’s a still better way.
Checkout needs only to make the results of the command — the resulting event — available to any other interested Bounded Contexts.
In other words,
Checkout should simply publish each successful command as an event.
Order Fulfillment — and any other Bounded Context that needs to respond to completed orders — can subscribe to those events and consume them as they arrive.
How are events published and consumed?
These days, we generally publish events to a so-called event bus, like Kafka.
There are plenty of other articles that explore Kafka’s details — we won’t do that here. But essentially, Kafka is a datastore which can be divided into topics (in our example,
online-orders might be considered to be a topic). A single service (the publisher) would publish events (represented as messages on the Kafka topic) to a single topic. A message is simply a self-contained data structure that conforms to a predefined schema (perhaps defined via JSON or, better, a binary format like Avro), and that provides the relevant details of the event.
Any number of other services (the
consumers) can then subscribe to that topic to consume those messages.
Moreover, by using an event bus like Kafka, we can be sure that:
- Messages will be consumed by each subscriber in the order in which they were produced (assuming that we are thoughtful about our design and that we configure things correctly). This is vital — remember — we wouldn’t want to process an
order-canceledevent before we can process the original
- Each consumer will receive all messages that are published to the topic
- If a consumer is temporarily down, or has a temporary issue handling a message, then message delivery to that service will be retried. In fact, each consumer manages its own offset, which tracks the last message that it successfully consumed. Unlike with synchronous calls — which would need to give up retrying in a very short period of time — message delivery can be retried for hours or days, if necessary.
- We can safely scale out the number of instances of each consumer.
Most importantly, it ensures that one Bounded Context can focus on handling the commands for which it is responsible, and then quickly hand off the corresponding event for other interested Bounded Contexts.
Commands, Events, Coupling, and (A)Synchronicity
This leads us to an oft-asked question when building microservices: When should our services communicate synchronously and when should the communicate asynchronously?
To answer, let’s revisit the rule of thumb we’d mentioned earlier. We want to handle the commands that we’re responsible for immediately, until they’ve completed (or failed). Sometime later, someone else will handle the resulting events.
So we ask ourselves: does our use case serve to process a command, or does it serve to notify other services that an event has occurred? Odds are, if we’re processing a command, we want to do it synchronously. The agent can issue a request and wait for the response; by the time it’s returned, the command will have completed (or it will have failed).
However, if we’re processing an event, there is generally not anyone waiting for us to finish. Moreover, we don’t want anyone to wait on us. Instead, we want to process the event at our own pace, retrying if necessary. So typically we’ll publish and handle events asynchronously.
So as a rule of thumb, we can say that there is a correlation between:
- commands and synchronous communication, and
- events and asynchronous communication
Are there ever exceptions to this rule? Sure.
As an example, we sometimes need to do a lot of work before we can commit a command. Often we’ll need to validate parts of the command, and in turn, this might mean expensive calls to other systems — sometimes systems within our organization, and sometimes to third-party systems.
Now, most users are accustomed to waiting for awhile when making an online purchase. But in some cases, the expected wait might be unreasonably long.
In these cases, we may asynchronously hand off part of the request to another component. For example, we can make a quick check up front, which will catch some percentage of likely failures. Then we would post a message to a message queue, such as RabbitMQ. Once we’ve posted the message, we will be able to return an immediate response to the user-agent.
Meanwhile, another one of our components — which will have subscribed to RabbitMQ — will then pick up where we’d left off, and complete the processing of the command.
In this scenario, we’ll have processed the command in an asynchronous manner. So we might ask: Is processing a command asynchronously the same as publishing an event?
In fact, there is a fundamental difference between the two. And it has to do with coupling.
Commands involve coupled components
The components involved in processing commands are all tightly-coupled. This was true when we handled everything synchronously, and it remains true when we add the asynchronous message passing. When we publish the message to RabbitMQ, we expect it to be consumed and processed by exactly one component on the other end. In fact, in all likelihood , we’ve written and are maintaining that component ourselves.
In fact, this is why we typically use a message queue like RabbitMQ — which is more suited to point-to-point communication — for asynchronous handling of commands.
By contrast, when we publish an event, we (theoretically) have no idea who will consume the message. Maybe there will be just one consumer subscribed. Maybe there will be none. Maybe there will be a hundred. The point is, our producer and any potential consumers are decoupled.
So we might revise our rule of thumb thus:
- Processing commands is performed across coupled components, ideally but not always via synchronous communication.
- Publishing and consuming events is performed across decoupled components, nearly always via asynchronous communication.
We mentioned earlier that we’re probably accustomed to handling commands, but probably less so with handling events. So we’ll close out with a few common questions that arise as we start producing and handling events.
When should we emit events?
At this point, we might wonder: When should we publish events? We’ve already stated that events are the result of some task being completed. So does that mean any task at all? That whenever one of our services does anything at all, it should publish a corresponding event?
We can answer this question by thinking in business terms. Did we just perform a task with some business importance? If so, then in all likelihood, some other system might want to know about it. In that case, we should publish an event. Otherwise, we might not need to bother.
A previous company I’d worked at referred to published events as “BizOps” (short for “business operations”). This name was meant to solidify the idea that we wanted to publish events that had some meaning to the business.
Note that some organizations might also be interested in failed commands. For example, this might be important for analytics or auditing purposes. So it’s quite reasonable for us to publish command failures as events.
Note also that the handling of one event can introduce a second event of business importance. So event consumers will typically wind up producing events of their own. From our previous example, our Order Fulfillment Bounded Context operates solely based on event consumption — it never receives commands to process. Yet we can be fairly certain that other parts of our organization might want to be notified when an order has been fulfilled.
Can we consume our own events?
Earlier, we’d stated that we produce events for someone else to consume. But we can also consume our own events.
Why would we want to do this? Let’s take a new example to illustrate. Imagine our company offers a website allowing users to post their favorite recipes, and/or browse for recipes to make. The services that handle those activities are owned by our Recipe Inventory Bounded Context.
Within Recipe Inventory, we have a data microservice, backed by a CRUD database, to allow users to submit their recipes. To support browsing for recipes, however, we’d build a service on top of something more optimized for searching (say, Elastic Search).
When a customer adds a new recipe, they are clearly issuing a command. We want to validate their submission and save it into the CRUD database.
We also want our search service to become aware of the new recipe. But how?
Our first thought — since we own both services — might be to make a POST from the data microservice to the search service. However, for all the reasons we’ve already discussed, we would likely be better off publishing the “add-recipe” event, then subscribing our search service to consume those events in order to add the new recipes to the search index.
The new recipe won’t wind up in the search index immediately, of course (a property known as eventual consistency). In reality, however, users almost never notice the delay. Moreover, the delay would typically be measured in mere seconds, if not milliseconds.
How do we handle events from third parties?
Throughout this article, we’ve discussed event-handling within our own systems. Yet we sometimes need to consume events that are published by third parties that are not a part of our organization. This can get tricky, for a few reasons:
- Often, we can’t simply subscribe to a third party’s Kafka cluster and consume messages the way we might internally. And while there are some emerging, cloud-specific solutions such as Google Pub/Sub, those solutions won’t applicable to every use case.
- We will want to validate any data coming into our system from an external source. This can be at odds with how we think about handling events within our own system.
A general, commonly-used technique for event publishing across organizations is Webhooks. If you’re not familiar with the concept of Webhooks, it’s actually quite simple. Rather than subscribing to a Kafka topic, we implement an HTTP endpoint. Typically this just means implementing a POST request handler. The endpoint will expect payloads in a format defined by the third party.
Once the endpoint is implemented, we register the endpoint’s URL with the third party (generally through some out-of-band mechanism, such as a form within a secure partner portal). Now, whenever an event occurs within the third party, a corresponding POST request will be made to our endpoint.
This works. But our tendency will be to treat Webhook requests as we would any other HTTP POST; that is, as commands. In reality, the third party is most likely notifying us of an event that has occurred and, as such, we will want to do whatever we can to respond to the event.
Except…that’s at odds with our need to validate input into our system. In addition, it’s at odds with the simple fact that a temporary outage of our HTTP request handler means that we risk missing events.
To properly implement such a solution, we must ensure the following:
- If our endpoint is unavailable — or if it returns a 5xx response — the third party has a robust retry mechanism to ensure that our system will eventually consume the event.
- We perform data validation as soon as possible. That way, we can immediately either return a 4xx (Bad Request) response, or persist the message within our system (perhaps to our own Kafka topic) to ensure that we don’t lose it.
- Our third party has a mechanism to handle any 4xx (Bad Request) responses that we return, such that the message can be fixed and reposted for our later consumption.
Nearly every service we build in a microservices architecture will be involved with handling either commands or events. There are fundamental differences between these two concepts.
Understanding these differences will help us to recognize common use cases as they arise, which in turn will make it much easier for us to design more consistent, functional systems.