Event Driven Architecture — Part 2
Part 1: https://medium.com/@foreverdeepak/event-driven-architecture-part-1-eeed34f4bd72
Subscriptions
A subscription is a group of consumers created over a topic. You can create multiple subscriptions to a topic to handle multiple use cases. For example, you can create notification and fulfillment subscriptions over an oms-order topic. Notification subscription is used to notify the order changes to the user and Fulfillment subscription is used to fulfill the order.
A subscription must be associated with only one service and may accommodate multiple consumers depending on the scale. Messages are replicated across these subscriptions and are distributed within the subscription. For instance, the same message, E1, is processed by all subscriptions associated with the topic. However, within each subscription, only one consumer is permitted to process the message.
In EDA, the ordering of events is very much important. To a subscriber, events should be received in the same order in which they are published.
You can use the following type of subscriptions in the application.
Shared Subscription
Shared subscriptions operate concurrently, enabling multiple consumers to simultaneously pull data from a topic for processing. Examples of services that support shared subscriptions include RabbitMQ, Google Pubsub, and AWS SQS. While this type of subscription functions somewhat like a FIFO queue, it differs in that when multiple consumers are assigned to a single queue, the messages within the queue are distributed across the consumers, potentially creating a race condition.
To better illustrate this race condition, let’s consider an example from an e-commerce application. Imagine there’s an ‘order’ topic owned by the Order Management System (OMS) service. This service publish order events, either directly from the application layer or indirectly through Change Data Capture (CDC), to this ‘order’ topic. In the diagram below, the ordering key corresponds to the order ID of the event.
Service A and Service B each have two consumers, namely C1 and C2, with the expectation that a consumer processes events sequentially. For example, consumer C1 from Service A processes events E1, E3, E5, and E8 in sequence. Since Service A utilizes a shared subscription, S1, events with ordering keys O1 and O2 are distributed across consumers C1 and C2.
In real-world scenarios, such as e-commerce systems, sequentially ordered events with the same ordering key typically have reasonable time intervals between them. Consequently, even when these events are distributed among consumers, the likelihood of multiple consumers processing the same ordering key events in parallel is relatively low. However, it’s essential to acknowledge that this is not an absolute certainty, and there are still situations where you may encounter parallel processing or events being processed out of order due to various factors.
- System itself produces events for a domain entity (same ordering key) with high frequency.
- Service downtime for an extended period can lead to a significant backlog of events in the queue. When the service becomes active again, it may result in a surge of traffic across all consumers, potentially leading to parallel processing.
- In the above example, E6 event time T6 is higher than E3 event time T3. If consumers are slow enough due to any reason then it may cause them to process E6 before E3 as they are allocated to different consumers.
Furthermore, event categorization plays a crucial role. In some scenarios, multiple services, even from entirely different modules, may generate events for the same domain entity. It is imperative to categorize and separate such events using distinct topics for each service. For instance, producer services X and Y should utilize separate topics, such as X-topic and Y-topic, to ensure clear differentiation
Detailed solutions for addressing duplicity, out-of-order events, and parallel processing is described in the ‘Best Practices to build the Subscribers’ section.
Ordered Subscription
Ordered subscriptions follow a sequential approach and achieve scalability by partitioning consumers based on ordering keys. When a message is published, an ordering key is attached to the event, allowing it to be allocated to a specific partition. In summary, events with the same ordering key are processed sequentially by only one consumer within the subscription, ensuring order preservation and preventing race conditions.
Kafka, Google pubsub, Apache Pulsar, AWS SQS are few examples which support ordered subscription.
Different MQ vendors use different architecture to support ordered subscriptions.
Kafka provides clients with the flexibility to commit the latest offset to progress in processing. In contrast, Google Pubsub, Apache Pulsar, and AWS SQS follow the ACK/NACK approach.
Ordering subscriptions typically employ a partitioning approach where events with the same hash ordering keys are grouped within the same partition.
Kafka enforces ordering by permitting only one consumer to process events from a partition within a subscription. The responsibility of moving forward within a partition lies with the consumer, which must commit the latest offset to progress in the stream. This design shifts the complete pulling responsibility to the client layer, necessitating that the client be equipped to handle failures effectively to maintain stream progress. In cases where the client fails to recover from an problematic event, it can result in a ‘stop-the-world’ scenario within that partition, affecting not only the processing of events with the problematic ordering key but also all other ordering keys within the same partition.
In contrast, Google Pubsub, SQS, and Pulsar offer consumers the ability to move forward using cumulative ACKs. Instead of directly associating consumers with partitions, these platforms handle consumer load balancing at the broker side. These vendors ensure that once a consumer has received an event for a specific ordering key (let’s say ‘X’), it will continue to receive events with the same ordering key throughout its lifecycle. They employ various load balancing algorithms such as sticky sessions, simple hashing, or consistent hashing for distributing the workload among consumers. In the event that a client fails to handle a specific event, it doesn’t lead to a ‘stop-the-world’ scenario as seen in Kafka. Rather, it affects only the processing of events associated with that particular ordering key. Other ordering key events can continue to be processed in sequence without any interruption.
There are few drawbacks of ACK approach which should be taken in consideration while implementing ordered subscription.
- Some vendors like Apache pulsar may cause out of order messaging in case of failover or retry which violates the rule of ordering subscription.
- Some SDKs or frameworks may not be well-suited for ordered subscriptions and could lead to unintended processing of subsequent events tied to the same ordering key event that had previously failed within the same pull batch.
Best Practices to build the Subscribers
Isolation
It is crucial to prevent simultaneous processing of two events within the same category which are bind with same ordering key. Shared subscriptions can introduce parallel processing if multiple consumers active, underscoring the importance of maintaining proper sequencing. To ensure accurate sequencing in shared subscriptions, the use of exclusive locks is essential. It’s important to highlight that ordered subscriptions, as explained in the ‘Ordered Subscription’ section, do not necessitate locks to achieve proper sequencing.
Idempotency
Idempotency is a concept commonly used in computer science, and various other fields to describe operations or functions that, when applied multiple times, produce the same result as if they had been applied only once. In other words, an idempotent operation can be repeated or retried without causing a different outcome or side effects after the initial application.
The occurrence of duplicate events within the system can result from various factors, such as double submissions or retries. While there are methods to mitigate this issue in many scenarios, achieving a 100% prevention rate may not always be feasible. The system must possess idempotent capabilities to ensure that if it processes the same event multiple times, it can effectively recognize and ignore duplicate instances of the event during subsequent processing.
In certain scenarios, a consumer might have partially processed an event but failed during the process. In such cases, if a retry is attempted, it’s crucial to ensure that the system doesn’t repeat the steps that were already processed. The consumer should be able to skip those completed steps and proceed to the next unprocessed step to successfully complete the workflow.
Parking out of order
While it is highly unlikely for events to be out of order at the producer’s end, this can occasionally occur due to factors like clock skewness or network traffic congestion. For instance, it is possible that an event with the state ‘ORDER_CANCELLED’ may have a timestamp earlier than that of an ‘ORDER_CREATED’ event and is pushed to the topic in a different order than expected. Or When using shared subscriptions, the system may still encounter situations where events are received out of order, even if they were generated at the correct times.
System must have capability to park such events for processing when the right sequence will occur. It could be achieved in following way
Retry with Back off delay
Multiple brokers offer the feature of retrying with a configured frequency. It’s essential to ensure that two sequential retries do not occur with very little time between them. Consecutive retries should be exponentially delayed to prevent overloading the consumer. When retries are exhausted, the event is queued to a Dead Letter Queue (DLQ) for manual processing. However, this approach may not be ideal for high-traffic applications due to the potential volume of events in the DLQ.
In the case of an out-of-order event, you can send a negative ACK to the broker. By the time the out-of-order event is retried, you may have already processed the correct event, ensuring proper sequencing. This approach is particularly relevant for shared subscriptions.
Park in DB and lookup
This approach closely resembles event sourcing, where an append-only table is used to store events in the subscriber service’s database. When an out-of-order event is received, it isn’t immediately processed but is instead saved in the event store. When the correct event is subsequently received, both events are processed in sequence. This approach is considered a safer method as it ensures that actions are taken only upon receiving the events in the correct sequence. It is well-suited for both shared and ordered subscriptions.
Pull rather than Push
The pull approach provides clients with better control over processing events within their capacity, while the push approach can potentially overwhelm consumers when dealing with high loads.
Slow consumers
It’s essential to establish the correct heartbeat interval for consumers and the appropriate acknowledgment time to ensure that brokers don’t mistakenly assume a consumer is down or an event couldn’t be processed. In such cases, the broker might end up resending the same batch to another or same consumer.
Handling Unrecovered Exceptions
Whenever unrecovered exceptions occur, it’s advisable to park the events. These exceptions may include cases like receiving a 503 error for an API or database connection pool outages. If the frequency of such exceptions increases, it’s important to have the capability to pause the consumers until the underlying issue is resolved.