McDonald’s event-driven architecture: The data journey and how it works

Part two of event-driven architecture post.

By Vamshi Krishna Komuravalli, Director, Principal Architect and
Damian Sullivan, Senior Manager, Software Development

Last week, in the first of a two-part post, we explained how we in Global Technology. In this week’s post, we explore how actually works and how the data flows through the system.

Reliable event processing
Here is a typical data flow of how events are reliably produced and consumed from the platform:

  1. Initially, an event schema is defined and registered in the schema registry.
  2. Applications that need to produce events leverage producer SDK to publish events.
  3. When an application starts up, an event schema is cached in the producing application for high performance.
  4. The SDK performs schema validation to ensure the event is as per the schema.
  5. If the validation checks out, SDK publishes the event to the primary topic.
  6. If the SDK encounters an error, such as schema validation or a retriable error, it is routed to the dead-letter topic tied to that producer.
  7. If the SDK encounters an error, such as MSK being unavailable, it is written to the DynamoDB database.
  8. Applications that need to consume events leverage a consumer SDK to do so.
  9. SDK similarly performs schema validation to ensure the events are consuming are as per the schema.
  10. A successful consumption results in a committing of the offset back to MSK and proceeds to consume the next event from the topic.
  11. Events within a dead-letter topic are later rectified through an admin utility and published back into the primary topic.
  12. Events produced by our partners, or “Outer Events,” are published via an event gateway.

Data governance
A key problem in consuming systems is data integrity. When there are assurances in the integrity of the data, it saves a lot of time and complexity in the design of downstream systems. MSK along with a schema registry allow us to enforce data contracts between systems. A schema is defined as describing the expected data fields and types along with optional versus required fields. In real-time, each message is checked (via serialization libraries) against this schema for validity, or else the message is routed to a dead-letter topic for rectification.

The way the schemas are used is shown here:

On startup, the producers cache a list of known schemas to memory. The schema can be updated for a variety of reasons, including more fields or changing the data types. When the producer publishes a message, versioning information is stored in the topic using a custom magic byte at the beginning of each message. Later, when the messages are consumed, the magic byte determines with which schema the message is supposed to be used. This system helps reduce rolling updates and mixed message versions in the topics. If we need to roll back or make a new schema update, consumers are empowered to parse each message.

Using the schema registry this way validates data contracts across disparate systems and helps ensures data integrity in our downstream analytics systems.

Cluster autoscaling
While MSK provided autoscaling of storage attached to the broker, a solution had to be built for expanding the cluster. We created an autoscaler function that would get triggered when a broker’s CPU utilization goes beyond a configurable threshold, adding the broker to the MSK cluster, and then triggering another lambda function to move partitions across the brokers.

Domain-based sharding
For efficient scaling and minimalization of failures, we separate the events into multiple domain-based MSK clusters. The domain of the events determines in which cluster the topic would reside, and consuming applications have the flexibility to consume events from any of the domain-based topics. The platform is set up to support global deployments across regions with high-availability configuration in each region.

Looking ahead
We envision this platform to grow and evolve as we continue to strengthen the event processing in our architecture, looking forward to some of the features in our pipeline to add support for formal specification of events, moving to a serverless based approach for MSK, partition autoscaling, enhanced developer experience, etc.

References

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store