Breaking Down the Hype: Promises and Pitfalls of Event Driven Architecture

Published in

Zocdoc Engineering Blog

13 min readJan 9, 2020

Provider world

At Zocdoc, we offer a digital marketplace which connects patients and healthcare providers. While most visitors to Zocdoc are familiar with our patient-facing products that enable people to find doctors and book appointments online, they may be less familiar with our provider-facing tools.

On the provider side of the business, we are responsible for building and maintaining these systems that allow doctors to perform a variety of tasks such as confirming appointments, updating their visit reasons or accepted insurances and keeping track of their performance. Our goal is to help doctors get the most out of being a part of Zocdoc’s Marketplace.

Journey into Microservices

Since beginning the transition to AWS in 2015, most teams at Zocdoc have been migrating out from our monolithic application to a new microservices architecture. (The many drawbacks we have experienced while working in a monolith are well outlined here.) However, the pace of migration has lagged on the provider side due to the vast swath of tools we own and the tight coupling of our domains in the monolith.

This new architecture has shown us many benefits in terms of scalability, agility, and flexibility so this year we made a concerted effort to accelerate our microservice adoption. Thus, as we kept building out of the monolith (OOM projects as we fondly refer to them), we ended up with what we have coined the “ macroservice-first “ strategy.

Martin Fowler describes this as, “Start with just a couple of coarse-grained services, larger than those you expect to end up with. Then as boundaries stabilize, break down into finer-grained services.” This strategy worked well because it enabled us to quickly build out new experimental tools and features, starting with a simplistic version first, to see how providers responded to it before committing to a fully fleshed out scalable product or microservice.

The woes of distributed data

We continued to evolve this microservices architecture but encountered some complexities due to its distributed nature of data access. For example, a core piece of data that we track at the time of booking is provider-owned property like a doctor’s website. It is critical for us to correctly attribute this data to determine where the session was generated from, whether it be a practice’s website or through Zocdoc-owned channels like SEO, so we needed a way to stitch together our appointment-attribution and practice-website data.

However, in the microservices world, data owned by each service is private to that service and can only be accessed via the service’s API, as was the case with our appointment-attribution-service and practice-website-service. (Note: this encapsulation is necessary to ensure that the microservices are loosely coupled and can evolve independently. Otherwise, if multiple services were to share the same database, any schema updates would require time‑consuming, coordinated updates to all the services — a lesson we learned the hard way in the past). In addition, our different services running in AWS often use different kinds of databases, SQL or NoSQL-like Aurora, DynamoB, ElasticSearch-which makes data access even more complicated. This resulted in two major distributed data challenges:

1. Distributed transactions

The first challenge is maintaining data consistency for transactions that span multiple services. Unlike the monolith where we use an ACID-compliant database, we cannot simply use local database transactions. One way we can achieve consistency in distributed systems, is to use distributed transaction protocols such as a two‑phase commit (2PC). However, 2PC is usually very complicated and not a viable option in modern web services. The CAP theorem essentially dictates that you choose between availability and consistency. In our use case, availability was the better choice since we could not afford for interruptions to the Booking flow due to a secondary operation such as appointment attribution.

2. Queries

The second challenge arises when implementing queries that retrieve data from multiple services. If any of these services require data from multiple APIs for an operation, it could lead to complicated application-side joins typically involving many synchronous request/response calls. When microservices freely invoke other microservices as and when needed like this, it creates a tight coupling and a complex dependency graph that is hard to reason and scale (also something we have grappled with in the past). In addition, handling a failure in any of the dependent service APIs then requires careful consideration and fallback logic.

Evolution to Event-driven

The solution, for many applications like ours, is to use an event‑driven architecture. In such an architecture, a microservice asynchronously publishes an event when the entity or domain model changes, instead of synchronously calling an API. Other microservices subscribe to those events and update their own entities, leading to subsequent events being published. This is often referred to as the choreography based saga pattern. It enables data consistency across multiple services without using distributed transactions but offers much weaker guarantees such as eventual consistency (often referred to as the BASE model). Because event-driven systems are asynchronous by nature, they are generally more responsive than traditional REST (or API) architectures and can be activated by triggers fired on incoming events. This promotes loose coupling (and easy scalability) amongst services because the event producer does not need to know about the consumers and how the event will be processed. It also eliminates the blocking/waiting typical of request-response style APIs and frees up resources within the service.

Event consumers can also maintain their own “materialized views” or copy of the data to allow for ad-hoc querying or joining against their own data. For example, when our Appointment Attribution Service receives a PracticeWebsiteAdded event informing it of newly added urls, it updates its own copy of the PracticeUrl datastore that it can use to join against Booking data. While there are generally many patterns that fall under the umbrella of event-driven, this particular pattern coincides with what Martin Fowler refers to as Event-Carried State Transfer. This pattern generates a lot of duplicated data, but it's usually not a concern due to the low cost of storage these days. Plus, the added benefits of reduced coupling and better resilience outweigh concerns with data redundancy.

Event streams and Atomicity

For these event-driven systems to be reliable, however, a core condition must be satisfied-the service must atomically write to its database and publish an event. This often leads to these transaction-based patterns:

1. Transactional Outbox pattern

In this approach, you create an additional “outbox” table in the service’s database. Upon receiving a request for modifying/creating a business entity, you must update your entity table, and, as part of the same database transaction, also insert a record in the outbox table representing the event to be published. An asynchronous process then monitors that table for new entries and publishes these events out to a data stream or message broker.

2. ✔ Transaction log tailing or Change Data Capture (CDC)

The idea here is to instead tail the database’s transaction log and publish each change in the entity table as an event. As opposed to a polling-based approach, this type of log-based Change Data Capture (CDC) happens with a very low overhead in near-real time. Debezium is a popular distributed platform that comes with CDC connectors for several databases such as MySQL, Postgres and SQL Server. AWS provides an even simpler mechanism for CDC in the form of DynamoDB streams, if you’re using DynamoDb as your database. These streams provide a time-ordered sequence of item-level modifications in any DynamoDb table and store this information in a log for up to 24 hours.

At Zocdoc, we use DynamoDB a fair amount because it’s a fully managed serverless NoSQL database that provides seamless scalability and performance with up to 25GB of storage for free. Given the convenience of DynamoDB streams, it’s a perfect choice for event-driven services based on CDC. Once setup, these streams can invoke other AWS services such as the Simple Notification Service (SNS) or Lambda. From there, events can be fanned out to any number of subscribers. An extremely useful AWS serverless pattern (that we use heavily at Zocdoc) is to use SNS to publish events to one or more SQS queues. This gives us a reliable means to send events asynchronously through near guaranteed delivery and also provides the benefit of message throttling.

There is a drawback to this approach. While we publish all the changes to our datastore in the event stream, the datastore itself only captures the latest state of the data. As a result, we lose data by design. There is no replayability or auditability and no way to query the state of the data at a historical point in time. This was a salient limitation for us as we not only needed to determine appointment attribution on the fly but also retroactively (in case of any corrections). So, it was critically important for us to know the time the change was performed along with the details and the user identifier. We also considered storing the generated CDC events (like Inserts, Updates, Deletes with light transformations on top) in our Redshift analytics warehouse but it meant that our OLAP data would have diverged from our OLTP data-and there was no way to recover the data if it was lost during one of the many steps of our data engineering pipeline.

Events as the source of truth

This led us to an increasingly popular pattern that has emerged as an alternative to a traditional CRUD application — Event Sourcing.

Event Sourcing

Event Sourcing is the modeling of changes to an entity’s state as an immutable “log” of events. This “ event log” or “ event store” then becomes the source of truth, and the system state is purely derived from it. Since saving an event is a single operation, this pattern is inherently atomic and minimizes chances of conflicting data updates. An event here represents something that took place in the domain like: PracticeWebsiteAdded, PracticeWebsiteModified, PracticeWebsiteStatusChanged, AppointmentConfirmed, etc.

The event store typically publishes these events to an event stream (or message broker) allowing consumers to subscribe to these events and process them as needed. This can be achieved using CDC as mentioned above. If the semantics of the events from event sourcing is too low-level, one could consider publishing more high-level domain events instead using an additional event handler.

CQRS

Unfortunately, once you apply the Event sourcing pattern, the data can no longer be easily queried. This leads us to a different but closely related pattern called Command Query Responsibility Segregation or CQRS. The idea of CQRS is to segregate the responsibility between commands (write requests) and queries (read requests) and handle them differently in the application. You can even split up the data storage, creating separate read and write datastores allowing greater isolation and independent scaling.

Thus, in a typical event sourcing + CQRS application, you end up with an “events” table that the command side writes to in an append-only fashion and a “ projection” table (also called a “state table” or “ materialized view” or “persistent read model”) with a flexible schema to support fast and efficient queries. The service keeps this projection up-to-date by subscribing to domain events published by the event log. At Zocdoc, we’ve had good success implementing these kinds of projection stores in the past such as with our wasabi infrastructure.

In AWS land

We were able to put event sourcing + CQRS into practice for our practice-website-service by combining DynamoDB’s flexible key-value based data model and streaming feed of item-level activity. We ended up with two different DynamoDB tables supporting different query patterns and a DynamoDB stream linking the two. For the state table, our schema was fairly straightforward. We used practiceId as the hash (partition) key and Url as the range (sort) key, since we could only allow unique pairs of ( practiceId, url) values.

For events, we couldn’t employ the same primary key as there could be multiple events for a given ( practiceId, url) combination and ( practiceId, EventType) didn't quite work either as there could be multiple events of type UPDATED for a single entry. The natural choice was to use ( practiceId, EventId) as the primary key. We chose the Event ID to be a combination of a GUID (for uniqueness) and Timestamp (to allow sorting by EventId).

We then used AWS Lambda to create triggers that responded to events in the DynamoDB stream for the Events table. One trigger processed these events and updated our projection table to reflect the latest state. The other trigger published these events to an SNS topic to stream out to subscribing services.

Received trigger event:

The good and the bad

Event sourcing has several benefits. First, it provides a reliable audit log of the data and is an excellent choice for auditing, compliance, data governance and data analytics systems. It makes it possible to implement temporal queries that determine the state of a business entity at any point in time. As discussed above, it also solves some of the key problems in implementing an event‑driven architecture-atomicity and data consistency. Plus, because it persists events rather than domain objects, it mostly avoids the object‑relational impedance mismatch problem. When combined with CQRS, it provides a convenient way to independently scale your reads and writes.

However, it also has some disadvantages:

1. The programming model is more complex with a higher learning curve. It took our team a fair bit of time to get used to the idea of modeling and persisting events rather than entity state like in a CRUD application.

2. Your application will now have to handle eventually consistent data. With CQRS, the application can hit stale data if it reads from a projection that has not been updated yet. This can happen, for example, if the lambda that updates the projection hits the snooze button a.k.a the cold start problem, something that we encountered ourselves. In addition, our user interface for practice websites lived on a legacy webpage that reloaded after every edit making eventual consistency sometimes a bit too …eventual. As a work around, we setup a lambda warmer that invoked our lambda every 5 mins to reduce startup time and thus were able to reduce our latency issues to a small negligible percentage.

3. You have to employ the right level of granularity with the modeled domain events. Too coarse an event (like PracticeWebsiteEvent), and all consumers have to listen to these events and figure out if it affects them. Too fine an event (like PracticeWebsiteUrlChanged & PracticeWebsiteDomainTypeChanged), and consumers have to listen to multiple events in order to combine them before they can take action. In addition, events are never deleted. Even when application‑level failures or inconsistencies occur, you run compensating transactions creating further events. This can also lead to large volumes of data if the event scope is too small. Event granularity, in general, is a tricky problem to solve and involves investing a lot of time understanding the domain and talking to domain experts. The new talked-about technique of Event Storming can help solve this problem using a more Domain Driven Design approach.

4. Another drawback is that consuming services must detect and handle unordered, duplicate or missing events. While DynamoDB streams in combination with Lambda, SNS and SQS can alleviate these problems to a good extent, there are still gaps (particularly since SNS does not currently support forwarding messages to SQS FIFO queue).

Alternative approaches

If eventual consistency is of particular concern, there is an alternative approach you can follow. Instead of using a lambda to listen to DynamoDB streams and update the projection table, you can transactionally write to both the event store and the state table upon receiving a request, thereby eliminating any delay with read store updates. This is a hybrid between the Transactional Outbox and Event Sourcing patterns which provides the added benefit of allowing you to validate your commands using the state table before persisting any changes.

While DynamoDB does provide APIs for transactionally updating multiple tables, you must use their low-level APIs instead of the more fluent Document or Object Persistence Model. With this approach, you also lose the ability to scale reads and write independently and could increase the overall delay for your write API. As a result, we found it better to use the CQRS approach but may consider pivoting in the future (particularly if the AWS SDK starts supporting multiple transactions using the Object Persistence API).

There are other approaches to solving eventual consistency that may fit your use case as well depending on your needs like maintaining in-memory cache of the data and checking against the event store during read.

Summary

Event-driven architecture is a great choice for decoupling your microservices and achieving better scalability. When designing event driven systems, we need to model the events that cause changes to our domain entities and also publish them out atomically to event streams. In AWS, DynamoDB streams offer us a great way to do exactly that.

When we want to persist these events forever and use them as the source of truth, we can employ the Event Sourcing pattern. In this pattern, the events are persisted in an event store that acts as the system of record. It is commonly combined with the CQRS pattern to create materialized views from the stored events. While there are several benefits to this approach like full replay-ability, audit-ability and atomicity, it is not a silver bullet. You end up creating more complex systems and are forced to deal with eventual consistency.

About the author

Anwesha Das is an engineer on the Provider team at Zocdoc. Previously she worked in the financial industry building real-time trading applications. She has a background in Physics and a passion for all things cloud computing, system architecture and data engineering.

Originally published at https://www.zocdoc.com on January 9, 2020.