EDA — The paradigm shift we’ve been waiting for
Phil Karlton once said; there are only two hard things in Computer Science: cache invalidation and naming things. I want to add one more thing to that list: versioning. At Dolittle, we work hard at making it easier to write software, not so much the actual writing part as a focus — although we believe we bring a lot to the table for that as well, but much more on the maintainability aspect. Once the software is written and put in the hands of users, that is when the fun begins. Your code gets tried and tested and also gets input from the users with all the changes they want to your system for it to be useful for them or add more business value to their businesses. I digress, let’s focus on what this article is really about — Event-Driven Architecture.
Events != Events
Words have meaning and meaning depends on the context. “Event” has been used in different contexts to describe what are really unrelated concepts. It is not an unambiguous term and comes with a load of conceptual baggage. For instance, when talking about IoT technology — events are the messages coming out of sensors. We then use technologies like Kafka, Microsoft’s Azure Event Hub, Event Grid or Azure IoT Hub to cater to the need of collecting these messages and transporting them. The IoT scenarios often are fed into the concept of Big Data, and one talks about events occurring with a very high resolution, potentially thousands a second. At Dolittle, those are not what we constitute as events; we actually call them data-points in a TimeSeries as a consequence of measurements done on a sensor. Its resolution or velocity should be decided by the consumer since the producer really produces based on it being asked. What we typically do with these data-points is to observe them. While when we talk about events at Dolittle, we talk about domain events — or business events or even business moments as Gartner also calls them.
«In the context of digital business, “business events” could be anything that is noted digitally, reflecting the discovery of notable states or state changes, for example, completion of a purchase order, aircraft landing and room noise-level signal are all digitally reported events. Some business events, or convergence of events, form “business moments“, a detected situation that calls for business action.»
From this we can discern several notable features of a domain event:
- An event is something that happened. It is a fact.
- It is something that is important to the business and is identified as such.
- It usually records a state transition
By understanding the distinction between domain events and timeseries events, you can start to see that the motivations and technical challenges for each are different. Thus they have different requirements and require different toolsets, both mentally and in terms of technology. Although you can probably achieve a lot with the same software used in the IoT scenario — you quickly realise that the technology is built for scenarios and scale that most line-of-business applications don’t really have. They’re born out of the necessity to deal with a high throughput of events, while most line-of-business applications generally really don’t have the volume of state changes. Line-of-business applications will be much less forgiving in terms of losing events or events being processed out of order.
We have already stated that domain events represent the state changes in your system. With this statement comes a great opportunity for thinking differently about how we approach state in our application. If we have a record of all the state changes that have occurred in our system, then the current state of our system is just the result of applying all these state changes, nothing more. Traditionally, this current state would be stored in a relational database and be considered “the truth” of our system. What if we don’t rely on regular databases for representing the truth in our systems? The traditional relational database, the constraining factor in distributed, scalable systems, is demoted from the key role as “source of truth” to that of performance optimisation: a cache. The truth can be stored in a specialised database for events; an event store. Gartner refers to this as an event ledger, drawing the obvious parallel with an accounting ledger. If you start feeling that this concept sounds familiar and has been part of the last couple of years hype; you’re right. It’s the same basic idea on which Blockchain rests. An event store or event ledger is simply an immutable log; storage mechanism that you’re only ever appending to. This approach changes how you need to think about building your software — but it opens up massive new opportunities and can reduce your business’ meantime to action.
The last decade has been the age of APIs. Its been all about how we can get to the perfect APIs and how do we manage these in a good way with things like dealing with authorisation, scale, fault handling, DDOS attacks and good strategies for throttling and such. All this and more boils down to the concept of API management. When we now enter a new more reactive age; the age of events — we need to take the lessons from API management and create the patterns, tools and common understanding of what is needed to be successful when leveraging events. What we do know is that being successful with events, it is really not just about «fire & forget» — you need to have a relationship with different aspects of it actively. Below are some of the things we believe in and work on for our platform.
We believe that a good event platform works on the concept of streams. A stream is simply a persisted, ordered series of events. The fact that the stream is persisted and ordered allows us to approach distributed development more confidently and rigorously. A stream of immutable, ordered events is infinitely cacheable. It also makes optimistic concurrency in a distributed system trivial.
Event Types and Schemas
How an event looks matters a lot in a domain/business thinking. What attributes it has, its name and meaning. This information is important and should be stored in a schema store for events.
There is a basic distinction between internal domain events and external integration events. In much the same way that you would not grant access for other applications to your internal database, you do not allow other applications to subscribe to your internal domain events. Your external stream of integration events is your contract to the outside world, your API.
Internal domain events are generally smaller, more focused events. The name should indicate not only what happened but why. The event should contain only the information that is needed. External integration events will generally contain more contextual information while having less specificity of why the event occurred. For example, a warehouse application is only interested in that a customer account has been suspended, not all the various reasons for which the customer account can be suspended — which the the Customer Management application cares about.
The ability to project from one or more events down to aggregations or specialised read models is critical, and also reactively do this whenever an event occurs. This ability is something that is supported by Event Store, the event store we have chosen and are investing our time and effort into. In addition to the built-in projections, we make it easy to materialise a projection into a relational database or document database, to perform queries for the application on top. On top of that, we will make it extremely easy to store and maintain the projected results of streams in other databases, enabling ad-hoc queries and the use of existing BI tools and knowledge. Another aspect of projections is the ability to react to the changes in these projections and then perform actions in your domain.
Remember I added a third problem to Phil Karlton’s claim: versioning. I would argue this is much harder than the two other problems combined. Sure, when versioning software, we have concepts like SemVer that helps us give meaning to versioning and helps us communicate it. But the actual versioning and making sure systems that are dependant on each other have a good way of being decoupled and not end up broken as a consequence of versioning issues. The widespread approach is to keep major changes of APIs around as either running instances or in the same codebase with different routes. With a managed event platform, we could be dealing with this differently; in fact, we could take some lessons from technologies such as active record and its migrations strategies. What if we could declaratively describe the differences between one version of an event and another. What if this could do an upcast and a downcast — leaving the event to be both backwards and forward compatible. In Dolittle, we call this migration and each version a generation.
Events need to be processed, for those scenarios when projections aren’t what you’re looking for or not enough — you need to respond and process events programmatically. These processors are processing events in streams and have an offset within a stream being maintained. Processors can fail for different reasons; it is, therefore, important to understand what that means. A processor can be quarantined, but not necessarily globally — but perhaps just for one particular event source, based on the id. This issue calls for tooling that gives the developer or owner of an application the insight into the problem and can understand the cause — either it is a code problem or a state problem for the system. When a problem has been solved, quarantined event processors should be possible to resume.
When one claims that the event store/ledger is the source of truth, you also need a way to replay for different purposes. In a day to day operation, the scenario of replaying events towards a newly introduced event processor is what you do the most. This is a very different scenario from replaying to regain a transient state in a database. A good event-based system should make the distinction and implement easy support for both. The reason you want to formalise the difference is for multiple reasons, but a simple use case like sending out an email as a reaction to an event — once you’ve done that — you do not ever want to replay and do that again.
Although not a criteria for success with event-driven architecture, we at Dolittle believe that having an ambition of being decentralised for most of the things we do benefit the running environment in many ways. It increases reliability, reduces the need for a centralised piece that has its own scale rules and needs to be clustered. Instead; we spread the load across the different microservices and use a natural and formalised partitioning strategy for what is effectively a cluster of microservices — rather than a cluster of well-defined centrepieces.
As this article started off with; caching is a super complex problem — when do you know when to invalidate it. Is it an arbitrary expiration, do we do it on regular cadence? This is really one of the biggest promises of an Event-Driven Architecture, and the cache problem gets solved by actually knowing exactly when to invalidate it; when the event occurred. We can then leverage the cache in a reliable way and actually rename it and call it «current state». The current state will change all the time.
Of the two original problems described by Phil Karlton, we’re left with naming. Naming is hard, and if that is the only problem we have left — we’re in a good place; it should be hard. This is where you should invest your time — getting the language right for your software — the domain language. This is what Domain Driven Design is all about and something we at Dolittle strongly believe in and work equally hard to make it easier to write software adhering to the principles of Domain-Driven Design.
How it all fits together
At Dolittle, we take a holistic approach to what we do and makes painstaking efforts thinking about how it all should fit together. The figure below shows how systems built around our concepts uses business events, and how observational data-points thinking fits together.