Designing System for Data Export: Transactional Outbox Pattern (Part I)

Published in

skyro-tech

6 min readOct 3, 2024

Recently, I have often been involved in discussions about implementations of exporting data to external systems. At first glance, it doesn’t seem complex, but it still makes debates. So, I explored several implementations and decided to prepare several articels to describe them.

In this article we will discuss the simplest (in my eyes) implementation using the most well-known pattern, the Transactional Outbox with PostgreSQL. I will use Java and Spring for the example.

But before we dive into implementations, let’s discuss a system that we need to develop.

Purposes

First of all, cases of business data exporting should be explored. I am sure that many of us have analytics platforms in a company, which consume all business data. Additionally, if your data is important, other teams or services may use them to trigger some processes, invalidate caches, etc. And for some of that processes we have SLA and semantics for delivering.

Requirements

Let’s look at several requirements for our implementations:

We need to send a business view of an internal event. It means that we must prepare a new message for each internal event.
We need to deliver a message at least once, so we can’t miss any messages.
We need to deliver a message as soon as possible. It means that we can not export data once an hour or less frequently.
We must be able to resend failed events both automatically and manually.

System Context

So, now we can look at the high-level architecture of this part of our system, and each system component is explianed below.

Business Services: typical web services that insert or update data and create an export event in the database within the same transaction (according to the Transactional Outbox Pattern).
Exporting Service: it reads events from the database, prepares a business view of each event, and sends it to external systems.
Analytics Platform/External System: clients that prepare business reports or trigger processes based on our events. They support deduplication of these events.

Okay, we undestand purposes, requirements and how our part of the system works. Let’s look at the design of the implementation now.

Design

For me, it’s the easiest solution if you need to develop data export the night before the release. Of course, it isn’t an optimal solution and won’t work efficiently as your data grows.

First of all, let’s look at the process by which the system will work:

The Business Service inserts data into the database and creates an event within the same DB transaction (it’s crucial to guarantee that we don’t lose any events or create an event for rolled-back data).
The Exporting Service polls the database every N seconds, finds events with a NEW status, processes them, prepares a business view, and exports it to the External Service.

Now we need to create the outbox table. Business services will insert an export event into it within the same database transaction as the business tables. We can call it as the Event table, and important fields are explained below.

Event Type: This table stores information about the type of each event. The max_retry_count column indicates how many times the Exporting Service may retry a task of a given type.
Event: This table stores information about individual events. The payload column contains information about business data (such as identifiers of a row in a database table) that we need to process and export. The processing_retries_count column shows how many retries have occurred. Additionally, each event has a status that indicates its current state. We should also make this table partitioned by the status field. One partition will handle intermediate statuses, and another will handle terminal statuses. This will make our queries more efficient.

The status model and the sequence diagram of the process are explained below.

I consider that the Status Model is clear, but I want to describe the PROCESSING status in more detail. This status is set before starting the event processing logic. It indicates that we have begun processing the task. If the task is not processed within 15 minutes (this is a random timeout, not meant for long-running tasks, of course), we consider that the task processing has failed, and we can either move the event back to NEW or set its status to ERROR.

Moving an event by statuses is clear and simple solution. However, if you work with PostgreSQL, you may be aware of the issues related to frequent updates (I’m not sure about other relational databases, but some might have similar problems). If not, you should read this article about the MVCC implementation in PostgreSQL.

Now let’s look on the implementation of it.

Implementation

You can find the implementation in the repository. And I will describe it below.

Of course, it’s a simple solution for the initial implementation. But we know that the ‘first time’ can last a while. So, we can design this solution to be easy to scale and extend. Let’s look at the class diagram:

EventProcessingService is an interface for each EventType processor. It’s simple: you need to create a new enum value in the EventType class and implement the interface.
EventDispatcher is the main class of this implementation, which processes all events, retries them, or moves stalled events to an error state.

As my opinion, this classes structure will work great especially at the first phase of your project.

So, now we can look at the query to find new events:

update event
   set status = 'PROCESSING',
       updated_at = clock_timestamp()
 where id = any(select et.id
                  from event et
                 where et.status = 'NEW'
                   and et.type_code = :code
                 limit :batchSize for no key update skip locked)
 returning id, payload

Note that I use SELECT … FOR UPDATE SKIP LOCKED here to enable multiple instances of the Export Service to retrieve new events. Alternatively, you can use libraries such as ShedLock or Quartz to ensure that events are always processed by a single instance of the application.

Also, take a look at the query to retry stalled events:

with stalled as (
    select e.id
      from event e
      join event_type et on et.code = e.type_code
     where extract(epoch from clock_timestamp() - e.updated_at) / 60 >= :passedMinutesInProcessingToRetry
       and e.status = 'PROCESSING'
       and e.processing_retries_count < et.max_retry_count
)
update event ue
   set status = 'NEW',
       updated_at = clock_timestamp(),
       processing_retries_count = processing_retries_count + 1
  from stalled se
 where ue.id = se.id
   and ue.status != 'NEW'
returning ue.id

Here, we find all events with the PROCESSING status and check how long each event has been in that status. If it has been in this status for a time greater than or equal to your custom timeout (passedMinutesInProcessingToRetry) and has remaining retry attempts, the event will be moved back to the NEW status. And if we modify this query slightly, we can create a query that moves events to the ERROR status (you can find it in the repository).

Conclusion

In conclusion, let us highlight some of the positive and negative aspects of this solution.

Positive

Easy to develop.
Easy to resend some events (if other teams lose your data).
Easy to extend with new events.

Negative

A lot of updating operations for each row (reminder of the problem).
If you have operations that generate a lot of business data, they may also create many events as separate rows in the database. However, you can avoid this by storing a chunk of business data in the payload field.

So, this is a good solution in the begining of your project but it is not efficient if we have many events.