Approaches to publish 100% events with “at least once delivery guarantee”

Harshil Raval
5 min readAug 29, 2023

--

In a system supported by RDBMS storage systems, after every insert/update generate messages while providing 100% at least once delivery guarantee to the message broker.

Photo by Nana Smirnova on Unsplash

Authors : Harshil Raval, Imran Ahmad

System Description :
The realtime transactional system ( fintech, ecom , etc) has a transactions table named txns. On each transaction an entry is created in the table and on each status update of the transaction an entry is updated. There are no deletes in this table. On every insert/update we have to send an event to downstream which can cater to following post processing usecases :
[ notifications, transaction history (statement of records) ]

Approach 1 : outbox pattern
Use transactional capabilities of RDBMS to store the event while performing transaction and use Outbox table as a DLQ

Steps :
1. Store transaction id from txns table in an txns_outbox table at the time of insert/update in the txns table. [utilising ACID property of framework and RDBMS]
2. Before sending the response back, generate event and delete the entry from txns_outbox table. (Entry deletion is a separate db transaction from step 1 and if after sending message, deletion operation fails then that’s fine)
3. Run a cron at some fixed interval (i.e., 30 mins). This cron takes entries from txns table and processes them to generate event.
4. Once the event is generated, delete the entry from txns_outbox table. (Step 3 and 4 ensure that the events will be generated at least once. SO, if in step 2, delete operation fails, we have sent the event but we are not aware that it was sent. And in step 3 we end up generating that event again)

Approach 2 : CDC
use WAL reader/connector to generate events with 100% guarantee.

Steps :
1. Set up WAL / transaction log reader on RDBMS. (Use Debezium connector for kafka ecosystem)
2. Set up a topic/queue in messaging middleware (kafka) : Q1. This topic consumes all the events generated by Debezium connector.
3. Txns platform has a consumer which listens to Q1 and figures out messages to transform and generate events for based on its internal business logic.
4. Set up a topic/queue in messaging middleware (kafka) : Q2. This topic consumes all the events generated by Txns platform.
5. Txns platform publishes this transformed event to Q2. Acknowledge original message in Q1 only after publishing event in Q2. (In case business logic doesn’t allow Q1’s message publishing, then ack message in Q1 without publishing to Q2)
6. Notification Svc and Transaction History Svc listens to Q2 messages.

Improvement opportunities in above two solutions :
Approach 1 : outbox pattern
This claver approach exploits RDBMS’s ACID guarantee to provide 100% message delivery guarantee. This is an elegant solution and easy to implement. It has following tradeoffs.
A. In this approach, we are already using an outbox table to keep track of the messages and delete them only after there is delivery ack from the broker (kafka) — we are adding minor latency in transaction flow. This added latency is due to the fact that — in normal cases — the approach proposes to send the message during critical transaction execution lifecycle. Hence, we are adding latency in a critical flow (transaction) for non-critical use case(event publication for comms).
B. Also — in this approach if message sending is failed, we keep the info in outbox table and a cron at later point in time will publish the message and delete the data from outbox table. Here the nuance is the order of message delivery is not maintained. But that can be easily solved by attaching version number (Which we use for optimistic locking) with every message. Attaching version number solves causality and that is more than sufficient for most of the systems.

Approach 2 : CDC This Approach relies on CDC pipelines [Which is a proven solution for data replication with 100% replication guarantee.]. This Approach has following tradeoffs :
A. CDC on whole database makes filtering of messages at the 1st hop — before sending them as beautified usable events — a tedious task and one must figure out business context from the technical context before sending out the message. To elaborate on this : on every insert/update/delete cdc will publish a message — the 1st hop has to figure out if the CDC message will result in actual event or not (whether it is transactional or non transactional event) — and to beautify the CDC event into usable event app has to again query database.
B. CDC will not be able to differentiate between a business event and a technical event.
A business event is insert/update/delete to support a business flows like onboarding, txn, etc.
A technical event is insert/update/delete to support column addition/deletion, Archival, data clean up in case of a bug.
In Tech event CDC will send events which might be useless for the pipeline and application logic will filter them out.
In case of a column addition — all the rows of the table will be updated with some default value — which will result in flood of useless events.

Approach 3 : CDC + Outbox pattern

Core Idea :
We want to achieve following things while ensuring 100% at least once delivery guarantee :
1. Minimise latency in critical flow to support non-critical use cases.
2. Minimise number of resources (queues) to achieve the goal.
3. Minimise implementation complexity in terms of separate crons / filtering logic.

Steps :
1. Create a formatted message(Ready to consume) in transaction flow and persist the message in txns_outbox in the same transaction.
2. Create a CDC on txns_outbox table using Debezium (or similar) connector.
3. Set up clean up policy for txns_outbox . We don’t need indexing in txns_outbox .
4. Messages published by this CDC are ready to consumed by consumers of non-critical flows like txn communications to user.

Side effects of Approach 3 (CDC + Outbox):
Apart form solving problems mentioned above this approach has following +ve/-ve side effects:
1. Benefit of txn ordering — because the data inserted in the txns_outbox is at the time of txn [It’s recommended to have version attached to the message to maintain causality. This will help resolve any out of ordering issues due to consumer lag].
2. Latency in critical flow to push the data in txns_outbox .

--

--