Our Component Interaction Principles
The concept of microservices has seen ever-increasing adoption over the past several years and World Nomads Group is no exception to this phenomenon. As is typical of many organisations with a non-trivial legacy codebase, the transition from monolithic applications to truly independent, decoupled microservices is a massive change, and best handled in a gradual, iterative, evolutionary fashion — especially when there are many competing shorter-term delivery goals to meet as well.
Firstly, it’s worth clarifying that by ‘microservice’ I mean a business domain based, logically separate, release-isolated component or set of components. In this sense, we also use the term ‘bounded context’ to define a microservice. In a lot of ways, our thinking on microservices is informed by the ideas of Udi Dahan and Sam Newman.
While there is a lot of literature around on the principles of microservices, there are often different ways to adopt each of these principles, and it can be a daunting task to make architectural decisions because distributed applications are inherently complex. For example, how should microservices components talk to each other? What can or should one component assume about another? What traps need to be avoided?
To help steer us in the right direction, to make the right trade-offs, and to bring order to an otherwise chaotic landscape, we decided to create a set of concrete guidelines for microservices component interaction. These are listed below and discussed.
1. Prefer asynchronous interactions over synchronous
If we simply have a web of API’s and they all simply performed HTTP requests between each other we would have a highly synchronous architecture. The originator would have to wait until the completion of the cascade of requests triggered by their original request before having a response. Do all of these have to happen in-band? Can some of the operations performed be off-lined? Would that improve responsiveness? What about the fact that all of these components are temporally coupled to each other? If a request cannot be processed it is completely lost. It cannot be retried later after some sort of recovery. On the other hand, if interactions can be made asynchronous (i.e. offline) then there is far better decoupling. Business operations can be broken up into logical chunks rather than everything having to work to pinpoint precision (remember this ad?).
Now of course not everything can be offlined (e.g. when a purchase request is made, we HAVE to synchronously hit the payment gateway before a policy can be issued), but if we can pull apart the many operations that certainly can be offlined, the result will have a much tighter and cleaner synchronous business operation.
2. Prefer topic-based publish-subscribe to polling
With asynchronous or offline-able interactions using a message queue, we gain the ability to trigger a worker upon a message being published. This creates a highly responsive system when load is low and the ability to identify bottlenecks easily when load is high (just look for the most backed up queues).
If instead, the worker has to poll a data store for new items that creates additional complexity around the appropriate polling frequency, how much data to get at a time, deleting data that has been processed, and so on. Furthermore, the time it takes to process a message can vary from near-real-time all the way up to the poll interval.
Note that some message queue mechanisms have HTTP interfaces (such as Amazon SQS) meaning that workers need to poll for queued items, but often in these cases the queue service itself has wait features (e.g. SQS Long Polling) such that the client only needs to implement a simple perpetual loop.
While a message queue is a great asynchronous transport mechanism, it’s not quite publish-subscribe — because a queue cannot have two subscribers that perform different tasks, at least not without some major coupling and hackery. Sure you might only have one publisher and one consumer today, but what happens when one more type of consumer is needed? Should the publisher have to know about this and publish to two queues? How do we handle scenarios where one queue publish succeeds but the other fails? Or just throw in the towel on independent components and put both responsibilities into the one consumer component? Clearly, we need to cater for this.
To support different types of consumers we need one published message to be replicated on multiple queues reliably and with delivery guarantees — so that each queue can be dedicated to a consumer. Fortunately, various messaging frameworks support this out of the box, e.g. Rabbit MQ with its concept of exchanges and queues, and on AWS, connecting multiple queues to an SNS topic.
Now, what if a consumer only cares about a subset of messages published? This is quite easy to support with the concept of a message topic (a message attribute set by the publisher) and for consumers to opt into messages that match a particular message topic regex pattern. On Rabbit MQ this is also out of the box, but on AWS this needs a small amount of custom code on the consumer end.
3. Implement Retries, Back-offs and Dead letter queues
Distributed systems are inherently dependent on network reliability to operate but then networks are notoriously fallible. To mitigate the impact requires us to build in fault tolerance and some kind of escalation path into our standard patterns.
Firstly, on the publish side, implement a limited number of retries with exponential back-off. Maybe 3 retries 10, 20 and 40 seconds apart. If it’s a minor connectivity blip the publish will just eventually succeed. Yes there is a risk of message duplication but as mentioned below, that situation must be handled by the downstream consumers anyway. If a message was still not able to be published then we have an error that needs to be logged and alerts raised based on criticality of the operation. This is where log aggregation and alerting off log message contents can be really useful. We implement this using Sumologic real-time searches and integration with Pager Duty which encapsulates the escalation path.
On the subscribe side, the approach is much the same, but the back-off waits can be longer allowing for more time for any environmental issues to resolve. The waits don’t necessarily need to be implemented in code if the messaging framework caters for retries. On AWS SQS the visibility timeout and auto-dead-lettering feature can be used to achieve this. Anything that is dead-lettered needs human intervention and therefore integration with the escalation path — in our case, Pager Duty.
4. Embrace eventual consistency
If we have a system where different components are subscribing to published messages and creating data independently in disparate data sources, we cannot rely on all data being consistent all the time.
Associations between data in different data sources need to be loose and our system needs to account for eventual consistency. It needs to handle states where the entire picture is not complete, as well as ensure that eventually (within some reasonable timeframe) all data WILL be consistent. This is yet another reason to implement Retries and back-offs.
5. Say NO to distributed transactions
Along the same lines as the above principle, in distributed systems, it is a highly limiting and coupling move to introduce distributed transactions, i.e any situation where a single event results in the mutation of two separate sources of data which cannot be committed atomically. Here is an excellent article about why distributed transactions are dangerous.
With microservices the situation where two separate data sources need to be committed as a result of a single business operation is very common, but the trick is in how those commits are orchestrated.
In our case, we have made the distributed transaction the message broker’s problem. As they guarantee delivery to all subscribed queues, there can be multiple types of independent consumers that each take care of their own data commit, i.e. independent transactions, with retries and idempotence (more on idempotence below) for fault tolerance and eventual consistency.
6. Idempotence to the max
Make every component as idempotent as possible. In other words, don’t trust an up-stream component to only send you a message once. Most messaging mechanisms can only guarantee at-least-once delivery, which means a message may well be delivered multiple times. Even if it’s a synchronous API interaction, there’s still scope for duplicate calls (e.g. successful response did not reach the client so it tries again). If the data contains the required information for the component to determine how to handle it multiple times then we can alleviate side-effects, which are often hard to trace.
One practice that helps with idempotence is to use system generated IDs rather than database-generated. E.g. we changed the Quote ID from a database-generated auto-increment number to a GUID. Now a quote can be identified uniquely even before it has been persisted, so ‘upserting’ is possible. On the other hand if a Quote had a database-generated ID, then a quote without an identifier would have to be assumed to be new (or some other de-duping logic may have to be applied).
7. Trust downstream components to be idempotent
This is the inverse of the above principle. Due to the fallacies of distributed systems, it is just about impossible to know for sure that downstream component received and successfully processed a message. Retries are a logical work-around for this, but they require downstream components to be idempotent. So let’s set that expectation from the outset.
8. Handle Out-of-order Delivery
This is part of idempotence, or at least related to it. Make sure business logic is robust enough to tell what the logical order of a related group of messages needs to be. Messaging middleware usually doesn’t guarantee ordered delivery, and delayed retry mechanisms can easily cause a re-ordering of message processing. Similarly, if there are multiple consumer instances of a given type parallel-consuming a queue then order is again lost. The idea here is not to rely on timestamps (because clocks can easily vary between different hosts), but to have some other mechanism, such as sequence numbers, or explicit states, which allow the business logic to make ordering decisions. E.g. with purchases, we mandate that purchase-initated state always precedes the purchased state. Note that we won’t always care about order either. E.g. before purchase we don’t really care about the order of quote changes.
9. Do not trust a Client to post sane data
This generally relates to components that are user-facing, e.g. an API that is exposed to UI clients that end users interact with. A client might not be something that World Nomads Group develops or has control over. Over and above this, if an API is public-facing, ANYONE can fire up Postman and create and post a request. In fact, they don’t even need Postman. They can do this all via the Swagger Docs page of the API. So an API cannot trust that a client will always push sane data to the API. The first line of defense (assuming the API is written in a type-safe language) is deserializing the input into a domain object. If garbage is posted, then it will probably not succeed in deserializing and the result should rightly be a 400 Bad Request response. Assuming deserializing succeeds, an API must exhaustively run the object through all relevant validation rules. In other words, don’t assume the UI client has already checked these validation rules. It might well have, but because we can’t trust the client, we run the validation rules again. Upon validation failure the sensible response is a 422 Bad Request. This might seem like unnecessary repetition of code (in the client and in the API) but that is a small price to pay for maintaining the integrity of the API.
Wrapping it Up
Any distributed architecture is a doubled-edged sword. It brings promise of scale, responsibility segregation, complexity control, high availability and of course, let’s not forget, engineering cred. In reality, it reveals that there is no free lunch. One type of complexity is traded off for another type, and those high-end promises require certain pre-requisites in place. Otherwise, there is often chaos, waste, misery and that dreaded slow poison, coupling.
This is why we decided to create these principles, or rules of engagement — to create order, some level of certainty and common patterns that create consistency and reliability.