Enhancing Inter-Service Communication: The Transactional Outbox Pattern Solution — Part 1

Gerardo Parra
Fever Engineering
Published in
11 min readFeb 23, 2024

--

Photo by Quino Al on Unsplash

Nowadays, it is really common to find software systems consisting of multiple microservices that collaborate to solve distributed business processes. A fundamental challenge to solve in these systems is to ensure the consistency of the information and the reliability of the communication between the different services.

Publication/Subscription architectures are a well-proven solution to support the inter-service communication in microservices systems. Message Brokers, like RabbitMQ or Kafka, provide resiliency and recoverability in communication failure scenarios but these same problems are not immediately solved in the service that is producing or consuming the information. Software design patterns must be implemented in the application’s development to solve these challenges completely.

This post will be the first part of a series of articles explaining Fever’s journey with the Transactional Outbox pattern and how it has been introduced as a critical component of our system at Fever. This post will discuss the pattern, our initial implementation, and the scalability challenges we faced after almost a year and a half of operating it. The following post will explain how the problem was addressed by taking advantage of our Change Data Capture (you should definitely check out this post for more information on that topic).

Transactional Outbox

Transactional Outbox is a Software Architectural pattern that aims to solve the transactionality problems that have their origin in the communication errors that can happen when trying to deliver a message to a Message Broker.

Typically, a Pub/Sub-system architecture can be represented by a diagram, as illustrated in the Figure:

Figure 1: Typical schema for Pub/Sub architecture

Any minimally functional Message Broker will offer a delivery semantic known as “At most once.” The communication between the Producer application and the Message Broker implies that the message won’t be considered delivered until the Producer has received confirmation from the Broker. This communication is done via network, which means that there might be different error scenarios.

Network Communication: Potential Error Scenarios

Omitted message

The message never reaches its destination.

Figure 2.1: Communication failure: omitted message

Delayed message

The message is received with a time delay, potentially impacting the order of reception.

Figure 2.2: Communication failure: delayed reception

Receiver process crash error

The receiver process experiences a crash error, resulting in the non-reception of any messages sent during that time.

Figure 2.3: Communication failure: The receiver process suffers a crash error.

As mentioned earlier, the Message Broker ensures a delivery guarantee for these error scenarios through a combination of communication retries and acknowledging mechanisms.

However, there remains a challenge that falls within the responsibility of the Producer application owners. This challenge involves addressing how communication errors can affect the Use Case’s performance and information consistency.

Producer Application Responsibilities: Managing Communication Errors for Performance and Consistency

Use Case Structure

A typical Use Case will follow this structure:

  1. Business validations: A series of business rules that will be checked against the input to confirm that the actions can be taken.
  2. Domain changes: A series of changes over domain entities (creations, modifications, deletions, etc) that will be made under a transactional scope. Doing them under a transaction is vital. These changes must be done atomically as the information consistency must be enforced to avoid negative or incomplete business side-actions.
  3. Domain event publication: The final step is to wrap up the domain information under an event and send it to the Broker so any Consumer can process and act upon these changes.

Now, what will happen if the message publication fails?

Figure 3: Use Case: message publication fails

There are two possible outputs:

  1. Connection retries work, and the message is finally delivered. This means that the Use Case actions are already completed, but the execution is stuck, causing a performance penalty.
  2. Connection retries are not enough; the Broker is still out of service, so the message cannot be delivered. In this situation, not only has there been a performance penalty, but also the event has not been published while the changes have been persisted, which means that distributed business processes are broken.

The Transactional Outbox pattern precisely tackles these two problems.

Transactional Outbox Structure

The pattern follows the structure shown in the next figure.

Figure 4: Transactional Outbox Structure

The pattern will divide the publication process into two parts:

  1. Use Case Transactional Publication: Introducing a new DB table to serve as an ‘Outbox’ for any event created during the Use Case execution.
  2. Message Relay Process: An independent process responsible for retrieving events from the Outbox table and facilitating their actual publication to the Message Broker.

This way, the pattern achieves the following advantages:

  • The event publication in the Use Case is a DB write, so the transactional context can be extended, covering also the event publication.
  • The performance penalty is out of the execution of the Use Case as is now in the Message Relay.
  • The Outbox table is a “Source of Truth,” as all events created during the Use Cases are in there. This improves the resiliency against crash-type errors in both the Producer application and the Message Broker.
  • Thanks to the previous point, the Message Relay can make multiple retries to deliver a message so “Eventual Consistency” is guaranteed as long as the error source is not the message itself.

In this article, the implementation presented addresses only a system that will use an SQL Database as persistence, although this pattern could be introduced with other persistence technologies. The baseline of this pattern is to achieve the above advantages through the division of the publication into two steps.

Fever’s Transactional Outbox implementation:

Fever’s design for the pattern can be represented by this diagram shown in the next figure:

Figure 5: Fever Design for the Transactional Outbox

As can be seen from left to right, the service will implement use cases covering the 1st phase of the publication by making transactional writes to the DB, including both the domain changes and the event in the outbox table. The next step is to implement an independent worker for the Message Relay. This worker will retrieve events from the outbox table that are waiting to be delivered and will attempt to publish them to the Message Broker one by one.

The Outbox Record Structure

Before moving on, it is necessary to understand how the database record will store the event information in the Outbox table. As depicted in Figure 5, the service employs Django’s ORM for all persistence operations. While the pros and cons of this choice won’t be discussed here, the Outbox Record is represented as a Django model. The objective of this entity is to transform the Event information, an object provided by our internal library for operating with RabbitMQ, into a domain representation of the Service. An example of how the Outbox table would look is shown in the next table:

The attributes used are the following:

  • Event id: A unique UUID identifier for each event.
  • Event FQN: This is an identifier for the type of event. Each event must be of one type, and that type will be identified by its fully qualified name.
  • Payload: This is the actual content of the event that will be serialized as a Postgres JSON type.
  • Created at: The timestamp representing the instant at which the event was generated.
  • Delivered at: The timestamp represents the instant at which the event was delivered to the Message Broker. The value will be NULL until the acknowledgment from the Broker is processed.
  • Delivery errors: An integer field indicating if there have been any failed attempts to deliver the event to the Broker.
  • Delivery paused at: This timestamp indicates whether and when the delivery attempts for an event have been paused. This mechanism is designed to prevent the Message Relay from getting stuck at a particular event, especially in cases where the record faces serialization issues, making it impossible to send to the Broker.

Understanding the Message Relay

The mission of this component is to get the events from the Outbox table and deliver them to the Broker. To accomplish that, it follows the next algorithm inside an infinite loop:

  1. Get N records not delivered from the Outbox table.
  2. Process them one by one:
    a. Reconstruct the event from the DB record.
    b. Send it to RabbitMQ.
    c. Update the status of the event in the Outbox table:
    i. Delivered at is set if the publication is successful.
    ii. Delivery errors, and/or delivery paused at fields are set if the publication has failed.

The implementation of the worker is relatively simple. The entry point is a Django command executed in its own Kubernetes deployment so it runs independently from the Service.

Transactional Outbox Scalability Problems

The presented implementation will cover the basics of the pattern and will achieve its objectives. However, it has some flaws that must be addressed. This is a critical component of the system, business-critical events will be directed through {”data”:value, …} here so the performance and reliability of the solution are a hard constraint.

There are two main scalability problems:

Database performance degradation will hurt the Message Relay latency:

The worker has a polling mechanism against the Outbox table. This means that the worker is generating a constant load on the database. Even if this load is not very demanding, the use of indexes is essential to ensure query performance as the worker will be executing this query many times per second. This is a guarantee to nominate this query into the group of the ones that are taking more resources from the database.

The problem is that not only is the query itself demanding for the database, but it is also very exposed to be affected by any other query. This database is serving the entire service and many services are not supposed to be stable, but are constantly changing with the addition of new features and the maintenance of existing ones. A quite plausible scenario is that when a new change is introduced, the resulting query may be far from optimal in performance, even if that change is rolled back or fixed, during the time it is active, it will have directly harmed other queries, including that of the worker. In short, the database dependency can lead to Message Relay latency degradation.

Outbox table congestion problems affect individual message delivery time:

The second flaw of this implementation is related to the number of messages generated by the service. The Outbox table will be filled by all events generated by the service, it is for granted that a mechanism to archive old events must be implemented. A naive approach to addressing this issue can be to implement a periodic process to remove from the table any event that has been delivered and has been in the table for more than 24 hours. Even with that consideration already covered, the problem is not fully tackled.

Depending on the nature of the events, the load received by the table may not be linear. For instance, if the Service is receiving purchase orders for concerts and, at the same time, an internal reporting process is executing, there will be concerts that will generate a massive load in a short time in addition to the events coming from the internal report. During that time, the ingress ratio of the table will spike, but the egress ratio may continue at the same velocity.

Adding a second worker with the current implementation will be a bad idea for two reasons: first, both workers will be making reads that will overlap on the same records, and second, the relative order of the events will be broken, so one worker could process an event that depends on another event that has not been processed by the other worker at that moment. A visual representation of this is added in the next Figure:

Figure 6: Processing order broken by multiple workers

So, with this implementation, there cannot be more than one Message Relay worker, which means that if the ingress ratio spikes, then each event will have more time waiting to be delivered. For the previous example, that means that a purchase order will need more time to be processed, which will probably cause timeout problems or that the user cancels the operation. This situation is directly affecting the SLOs associated with the purchase and must be considered as a major issue.

Wrapping Up: Key Takeaway and A Sneak Peek into the Next Article

The key takeaway from this article is that the Transactional Outbox is indispensable in any distributed architecture. After a year and a half of operating it at Fever, we are wholly convinced of its utility. Our implementation has saved us from information loss on multiple occasions and has significantly enhanced our ability to recover from severity-1 incidents. Of course, there are alternatives to this pattern, but they don’t fit our context or solve our pain points as well as the Transactional Outbox.

The simpler alternative would be to store publication errors/timeouts somewhere and reprocess them later. However, this solution does not solve the transactionality problems in the use case and the performance impact from the publication in case of communication problems with the Broker.

One alternative is the Two-Phase Commit (2PC). This is a strategy to coordinate a transaction distributed between two nodes. This strategy cannot be applied to our context at Fever and, in general, to any Publication/Subscription architecture. It can’t because it requires both services to know each other and also it blocks both services and their databases while the transaction is completed.

The other alternative is the Saga transaction pattern. This pattern is a way to implement distributed transactions in the form of local transactions in each implicated service. Each transaction is only completed when the service receives a success or failure event from the rest of the services involved. While this strategy could serve us well at Fever, most of our use cases are not highly dependent on previous or subsequent transactions. In summary, the Saga pattern will serve in some cases but it does not substitute the advantages we get from the Transactional Outbox.

As Fever continues to grow, so does the scale at which the solution needs to provide service. In the next article, we will explore the evolution of this implementation and how it has enabled us to tackle and resolve the scalability issues discussed in this article.

References

--

--