As promised, this is the second part of a two part series on key patterns of event-driven messaging designs that have been implemented at Wix and that have facilitated creating a robust distributed system that can easily handle increasing traffic and storage needs by more than 1400 microservices.
I recommend reading part 1 first, where I write about ‘consume and project’, ‘event-driven end to end’, and ‘in memory Key-Value stores’. Here are the next 3 patterns:
4. Schedule and Forget
…when you need to make sure scheduled events are eventually processed
There are many cases where Wix microservices are required to execute jobs according to some schedule.
One example is Wix Payments Subscriptions Service that manages subscription-based payments (e.g. subscription to Yoga classes).
For each user that has a monthly or yearly subscription, a renewal process with the payment provider has to take place.
To that end, a Wix custom Job Scheduler service invokes a REST endpoint pre-configured by the Payments Subscription service.
The subscription renewal process happens behind the scenes without the need for the (human) user to be involved. That is why it is important that the renewal will eventually succeed even if there are temporary errors — e.g. 3rd payment provider is unavailable.
One way to make sure this process is completely resilient is to have a frequently recurring request by the job scheduler to the Payment Subscriptions service where the current state of renewals is kept in DB and polled on each request for the expiring renewals that have yet to be extended. This will require pessimistic/optimistic locking on the DB because there could be multiple subscription extension requests for the same user at the same time (from two separate ongoing requests).
A better approach would be to first produce the request to Kafka. Why? Handling the request will be done sequentially (per specific users) by a Kafka consumer, so there is no need for a mechanism for synchronization of parallel work.
Additionally, once the message is produced to Kafka, we can make sure that it will eventually be successfully processed by introducing consumer retries. The schedule for requests can be much less frequent as well due to these retries.
In this case we want to make sure that handling order is maintained, so the retry logic can simply be sleeping between attempts with exponential backoff intervals.
Wix developers use our custom-made Greyhound consumers, so they only have to specify a BlockingPolicy with the appropriate retry intervals for their needs.
There are situations where a lag can build up between the consumer and the producer, in case of persisted error for a long time. In these cases there is a special dashboard for unblocking and skipping the message that our developers can use.
If message handling order is not mandatory, a non-blocking retry policy also exists in Greyhound that utilizes “retry topics”.
When a Retry Policy is configured, the Greyhound Consumer will create as many retry topics as retry intervals the user defined. A built-in retry producer, will upon error, produce the message to the next retry topic with a custom header specifying how much delay should take place before the next handler code invocation.
There is also a dead-letter-queue for a situation where all retry attempts were exhausted. In this case the message is put in the dead-letter-queue for manual review by a developer.
This retry mechanism was inspired by this uber article.
Wix has recently open-sourced Greyhound and it will soon be available to beta users. To find out more, you can read the github readme.
- Kafka allows for sequential processing of requests per some key (e.g. userId to have subscription renewal) that simplifies worker logic
- Job schedule frequency for renewal requests can be much lower due to implementation of Kafka retry policies that greatly improve fault tolerance.
5. Events in Transactions
…when idempotency is hard to achieve
Consider the following classic eCommerce flow:
Our Payments service produces an Order Purchase Completed event to Kafka. Now the Checkout service is going to consume this message and produce its own Order Checkout Completed message with all the cart items.
Then all the downstream services (Delivery, Inventory and Invoices) will need to consume this message and continue processing (prepare the delivery, update the inventory and create the invoice, respectively).
This implementation of this event-driven flow will be much easier if the downstream services can rely on the Order Checkout Completed event to only be produced once by the Checkout service.
Why? Because processing the same Checkout Completed event more than once can lead to multiple deliveries or incorrect inventory. In order to prevent this by the downstream services, they will need to store de-duplication state, e.g., poll some storage to make sure they haven’t processed this Order Id before.
This is usually implemented with common DB consistency strategies like pessimistic locking and optimistic locking.
Fortunately, Kafka has a solution for such pipelined events flow, where each event is handled exactly once, even when a service has a consumer-producer pair (e.g. Checkout) that both consumes a message AND produces a new message as a result.
In a nutshell, when the Checkout service handles the incoming Payment Completed event, it needs to wrap the sending of the Checkout Completed event inside a producer transaction, it also needs to send the message offsets (to allow the Kafka broker to track duplicate messages).
Any messages produced during this transaction will only be visible to the downstream consumer (of Inventory Service) once the transaction is complete.
Also, the Payment Service Producer at the start of the Kafka-based flow has to be turned into an Idempotent producer — meaning that the broker will discard any duplicated messages it produces.
For more information you can watch my short intro talk on Exactly once semantics in Kafka
6. Events Aggregation
… when you want to know that a complete batch of events have been consumed
In my last post, I described a business flow at Wix for Importing Contacts into the Wix CRM platform. The backend includes two services. A jobs service that is provided with a CSV file and produces job events to Kafka. And a Contacts Importer Service that consumes and executes the import jobs.
Let’s assume that sometimes the CSV file is very big and that it is more efficient to split the workload into smaller jobs with fewer contacts to import in each of them. This way the work can be parallelized to multiple instances of the Contacts Importer service. But how do you know when to notify the end-user that all contacts have been imported when the import work has been split into many smaller jobs?
Obviously, the current state of completed jobs needs to be persisted, otherwise in-memory accounting of which jobs have been completed can be lost to a random Kubernetes pod restart.
One way to persist this accounting without leaving Kafka, is by using Kafka Compacted Topics. This kind of topic can be thought of as a streamed KV store. I’ve mentioned them extensively in pattern 3 of the first part of this article — In memory KV stores.
In our example the Contacts Importer service (in multiple instances) will consume the jobs with their indexes. Every time it finishes processing some job it needs to update a KV store with a Job Completed event. These updates can happen concurrently, so potential race conditions can occur and invalidate the jobs completion counters.
Atomic KV Store
In order to avoid race conditions, the Contacts Importer service will write the completion events to a Jobs-Completed-Store of type AtomicKVStore.
The atomic store makes sure that all job completion events will be processed sequentially, in-order. It achieves this by creating both a “commands” topic and a compacted “store” topic.
In the diagram below you can see how each new import-job-completed “update” message is produced by the atomic store with the [Import Request Id]+[total job count] as key. By using a key, we can rely on Kafka to always put the “updates” of a specific requestId in a specific partition.
Next a consumer-producer pair that is part of the atomic store, will first listen to each new update and then perform the “command” that was requested by the atomicStore user — In this case, increment the number of completed jobs by 1 from the previous value.
Example update flow end-to-end
Let’s go back to the Contacts Importer service flow. Once this service instance finishes processing some job, it will update the Job-Completed KVAtomicStore (e.g. Import Job 3 of request Id YYY has finished):
The Atomic Store will produce a new message to the job-completed-commands topic with key = YYY-6 and Value — Job 3 Completed.
Next, the Atomic Store’s consumer-producer pair will consume this message and increment the completed jobs count for key = YYY-6 of the KV Store topic.
Exactly Once Processing
Note that processing the “command” requests have to happen exactly once, otherwise the completion counters can be incorrect (false increments). Creating a Kafka transaction (as described in pattern 4 above) for the consumer-producer pair is critical for making sure the accounting remains accurate.
AtomicKVStore Value Update Callback
And finally, once the latest KV produced value of completed jobs count matches the total (e.g. 6 completed jobs for YYY import request), the user can be notified (via web socket — see pattern 3 of the first part of the article) on import completion. The notification can happen as a side-effect of the KV-store topic produce action — i.e. invoking a callback provided to the KV Atomic store by its user.
- The completion notification logic does not have to reside in Contacts Importer service, it can be in any micro-service, as this logic is completely decoupled from other parts of this process and only depends on Kafka topics.
- No scheduled polling needs to occur. The entire process is event-driven, i.e. handling of events in a pipeline fashion.
- There is no possibility of a race condition between jobs completion notifications or duplicate updates by using key-based ordering and exactly once Kafka transactions.
- Kafka Streams API is very natural for such aggregation requirements with API features as groupBy (group by Import Request Id), reduce or count (count completed jobs) and filter (count equal to number of total jobs) followed by the webhook notification side-effect.
For Wix, using the existing producer/consumer infrastructure made more sense and was less intrusive on our microservices topology.
What I take from this
Some of the patterns here are more commonplace than others, but they all share the same principles. By using an event-driven pattern, you get less boilerplate code (and polling, and locking primitives), and more resiliency (less cascading failures, more errors and edge cases handled). In addition the microservices are much less coupled to one another (producer does not need to know who consumes its data) and scaling out is easy as adding more partitions to the topic (and more service instances).
Thank you for reading!
If you’d like to get updates on my experiences with Kafka and event driven architecture, follow me on Twitter and Medium.
You can also visit my website, where you will find my previous blog posts, talks I gave in conferences and open-source projects I’m involved with.
If anything is unclear or you want to point out something, please comment down below.