Designing Domain Events

Published in

CasaOne Engineering

8 min readMay 21, 2021

In a distributed system, events play a major role in decoupling different components of the system. In this post, we talk about how we improved the design of domain events to tackle multiple problems in our system.

Context

CasaOne leasing platform consists of multiple services built around its own bounded context. Last year, we rebuilt a service responsible for managing the leases. We took this opportunity to address some of the issues we had with our earlier design of events.

We will discuss the following aspects of the service

Publishing events: The service should publish an event whenever a notable change occurs to the domain model. Other services in the system should be able to subscribe to these events and react as per business rules. The event should be delivered to consumers at least once.
Audit trail: The service should store an audit trail of all the business actions performed in this service.

1.1 Publishing events: The past

We followed a simple approach in the past. The service published the event directly to the event bus. The business actions in the service performed the following steps in general

Execute a series of business logic.
Persist the state of domain entities in the database.
Publish an event on the event bus (AWS SNS topic in our case).

For example: when a customer initiates a lease, a “lease” entity is created in the database and an lease.created event is published to the SNS topic.

Note: The consumer services have an SQS queue subscribing to the SNS topic. A worker in the consuming service receives the message from the SQS queue and performs the relavant action. For example, a message is queued inassign-stocks-on-lease-created,send-confirmation-email-on-lease-createdqueues when alease-created event is published.

The above approach works for most of the cases but it leads to one of the following problems depending on the implementation.

Phantom event
Lost event

Implementation 1: As shown in the pseudocode below, the service publishes the event inside the transaction that saves the entity in the database. If the transaction commit fails after the event is published — due to faults such as network issues between service & database or the service crash, it leads to a “phantom event” being published to the event bus. The phantom event points to an entity that doesn't exist in the database.

// Example for the phantom event
function createLease(lease) {
    //...
    db.transaction(() => {
        leaseRepository.create(lease);
        eventBus.publish({
            eventType: 'lease.created',
            data: lease
        });
    });
}

If such a phantom event is processed by consumers in the above example

The system will reserve stocks for a non-existing lease. This leads us to spend thousands of dollars on buying new stock and unused stock in the warehouse.
The customer would be confused about a confirmation email the lease that didn’t go through.

Implementation 2: To overcome the above problem, we can publish the event outside the transaction as shown in the pseudocode below. If publishing the event fails after the DB transaction — due to faults such as network issues between the service & the event bus or the service crash, it leads to a “lost event”.

// Example for lost event
function createLease(lease) {
    //...
    db.transaction(() => {
        leaseRepository.create(lease);
    });    //Publish the event outside the transaction
    eventBus.publish({
        eventType: 'lease.created',
        data: lease
    });
}

The root cause of the above issues is — we are performing actions across multiple transactional boundaries i.e the database & the event bus. Ensuring consistency across distributed services would need complex solutions like 2-phase commit or compensating transactions. These complex solutions are feasible only if the participating services support them. In many cases, they don’t.

Let's look at how we solved this problem by eliminating the need for a transaction across services.

1.2 Publishing events: The Present

We came across the transactional outbox pattern as we started exploring solutions. In this approach,

The service saves domain entities and the event in the same database within a single transaction.
An external process will ship the events from the database to the event bus.

This ensures we will not run into consistency issues between events published and the state of the domain entities. The pseudocode for saving the domain entity and the event is shown below.

function createLease(lease) {
    //...
    db.transaction(() => {
        // Save entity and the event in the same db and transaction
        leaseRepository.create(lease);
        leaseEventRepository.create({
            eventType: 'lease.created',
            data: lease
        });
    });
}

In our example above, the event is stored in a table called lease_events in the same database. That was the first and the easier part of the problem. We needed a reliable and fast way to ship the events from the events table (transactional outbox) to the event bus. We had the following options:

Polling publisher: A background process polls the events table for new events and marks the event as published after publishing the event to the event bus.
Transaction log tailing: A background process subscribes to the database transaction logs, filters the logs for changes in the events table, and publishes the event to the event bus. This process maintains the offset of the processed transaction log to know which events have been published to the event bus so far.

Note: In both options above, a message can be published to event bus more than once in rare cases such as — when the process publishing the event fails to mark the event as published or fails to update the offset of the late event published. This is expected as per at-least-once delivery semantics. We’ll discuss about how to handle thees duplicate events in the subsequent sections.

We discussed and debated about the above approaches to arrive at the following conclusion

The polling publisher approach needed additional tracking data for each event in the events table. And, we would have to choose between — delay in publishing due to infrequent polling or increased load on the DB due to frequent polling.
The transaction log tailing approach is non-intrusive and allows reuse across multiple services. The mainstream relational DBs provided a way to subscribe to transaction logs using the mechanism they already use for data replication. Due to this, we went with the transaction log trailer approach using debezium which supports change data capture(CDC) from MySQL binlogs.

A high-level depiction of the solution is shown below

Transactional Outbox with Transaction Log Trailer

The above diagram abstracts the intricate details and challenges of shipping events from debezium to the Event Bus we used— AWS SNS. The finer details deserves a separate post by my colleague Ketan Mittal :)

The high-level design decisions we took were

Backward compatibility for the consumers: Debezium supported pushing events to Kafka or Kinesis but it doesn’t support SNS. We didn’t want to impose a new consuming pattern for the existing consumers using SQS queues. To support backward compatibility, Debezium pushed events to the Kinesis topic and a lambda function consumed events from the Kinesis topic and published it to a relevant SNS topic.
Deduplication: There is a possibility of receiving the same events multiple times from binlogs if debezium fails to update the log offset due to faults. We built an intermediate layer in the pipeline to deduplicate events based on event id.
Local development support: We wanted to make sure developers can test publishing and consuming events easily during development. We achieved this using combination of localstack (which we were already using) and scripts to run a minimal version of the above data pipeline.

The overall solution took us a significant one-time effort initially. But it has yielded us good returns as we onboarded more services with minimal effort as we progressed.

We had a separate platform team (our data engineering team) focusing on the transaction log tailing component. Our product engineering teams could focus on functional aspects of the service and publish events easily.

2.1 Audit trail: The past

In the past, we had different tables and APIs for maintaining the history of changes on aggregate root entity and the related entities in the service. Example: status_history , address_history, payment_method_history etc. This approach needed a lot of effort for tracking the history of a new entity or additional information for existing entities.

This approach was limited by the use of static columns in our transactional database(RDBMS). We wanted to leverage JSON column support in relational DBs to create a generic audit table and generic APIs on top of it to minimize the repeated efforts.

Eureka! The events table we created for the transactional outbox had many properties similar to the properties of an audit trail. This allowed us to enrich the events table(the transactional outbox) and use it as an audit trail. This solution is described in the next section (2.2).

2.2 Audit trail: The present

The transactional outbox for events (the audit trail table) was named as<entity>_events. The high-level structure of this table is as follows

event_id: A unique identifier for the event.
event_type: Eg: “lease.created”, “lease.started”.
schema_version: Eg: “v2”.
created_at: Event timestamp.
data: A JSON column containing relevant data for the event.
<entity>_id: Id of the aggregate root entity. This helped in tracking and filtering the events by a specific root entity.
<related_entity>_id: Optional column(s) for id of related entity(s). This helped in tracking and filtering the events by a related entity.

We built APIs to allow consumers to get a timeline view of the events across the service. The events could be filtered by the <entity>_id, event_type, time_range, etc. The events could be sorted by recent_first or oldest_first order based on event timestamp.

This helped us replace multiple history tables with a single audit table as we desired. We could develop additional history/audit trail features in few days instead of a week or more.

Summary

The above domain event architecture helped us to

Deliver events to consumers consistently without phantom events and lost events.
Provided a unified audit trail to track all the actions in the service.

Conclusion

In earlier stages of the product, we could simply start with a monolith architecture to move fast under the time and resource constraints of a startup. As the product and the team started scaling, we had to break the monolith system, and the team into smaller units was to move fast.

In earlier stages of the breaking monolith to services, a simple-and-naive approach allowed us to move fast and we could handle the one-off errors manually. As we started scaling further, the one-off errors became more expensive and it was necessary to invest time & resources to build a robust solution. This helped us reduce the time spent in managing the errors and spend more time on building useful features for our customers.

Finally, the solution we choose depends on the scale and the needs during that time. As always, there is no silver bullet.

We would love to hear about your experience and challenges in designing domain events. Please comment here to share your thoughts and feedback.

Acknowledgments

This was a collaborative effort with many of my colleagues at CasaOne. Special credits to Shashank Teotia, Ketan Mittal, Ankit Singh for leading the important aspects of the design and implementation.

If you liked reading this, you might like working with us on challenging problems. We are hiring — click here to join us :)