EXPEDIA GROUP TECHNOLOGY — DATA

Be Vigilant about Time Order in Event-Based Data Processing

How to handle the timing of events

Mingwei Li

Published in

Expedia Group Technology

6 min readSep 22, 2020

Image of paper with charts, notebook, and laptop computer. — Image provided by Lukas on Pexels

Clickstreams, user activity, time-series data, and other event data are often analyzed to add business value. As data infrastructures evolve and become more complex, honoring the very essential feature of event-based data: time and order, can become difficult. It can expose data inconsistency issues and may even have legal concerns when the event data represents a users’ privacy. In this post, we’ll look at such an example and talk about how it can be solved.

Let’s start with an example

Our data engineering team creates Vrbo’s (part of Expedia Groupᵀᴹ) promotional emails and one of the core data products provides access to recipient email activities: opens, clicks, unsubscribes, spams, bounces, and so on. While other activities are important for analytics and business models, unsubscribe events are critical in terms of user’s privacy preferences that you don’t want to get wrong. Also, putting user email activity aside, certain types of web activity can also trigger an automatic subscribe action to our promotional emails: booking requests, account sign ups, etc. A simplified subscribe/unsubscribe example can be illustrated like this:

Diagram showing user interaction flow between a user signing up for notifications and unsubscribing from them.

Events come from different places, different channels, and have different delays. They can carry different timestamps and values. If not processed carefully, this may result in incorrect privacy settings as shown in the example below.

This is one example showing a potential risk when processing event data coming from different channels. What if:

In a single stream event processing system, the order of events can’t be guaranteed?
Events processing is failed then retried?
A timestamp doesn’t mean what you think it does?
Using batches vs streams, what are the caveats and advantages?

In the next section, we’ll discuss these different scenarios.

Things to consider

Timestamps

Timestamp is the foundation of everything here. Without the reference of timestamps, you lose the truth of an event. But what are the different timestamp types? What are the semantics of the timestamps in your data pipelines upstream/downstream contracts? Consider the following field names:

createdAt
updatedAt
requestedTimestamp
timestamp
updateDate
createDate
event.header.time
body.timestamp
dateid
hourid
…

Consider documenting somewhere what each of your timestamps means in each system. Doing so can remove ambiguity and help to prevent errors when designing and updating systems.

Consider this timeline of events:

While this diagram looks straightforward and streamlined, in the real world, each of these actions is executed in different systems, using different technologies, with different schedules, and maintained by different teams.

Timestamp definitions vary under each of these steps above. For instance, updatedAt in “Processed and saved in stream” step can mean the timestamp when this event is saved to stream storage, and event.header.time in “Consumed and transformed” step means the timestamp when it’s “Fired by user”.

By the time it’s permanently saved at the final stage, the most important timestamp of all could be lost: timestamp when captured and generated. If any of these steps fail to keep the events in order for a single user, the final destination of those events may suffer the consequence.

Batch, stream

There are two different approaches to processing events: batch and stream. They treat the event timestamp in different ways.

Batch

In batch processing, timestamps and time-related fields can be defined like this:

For batch-based event processing:

Because of its timed scheduling, events will be grouped together for processing.
batchLastRunWatermark is important to remember the last event processed.
When one event fails, the entire batch needs to be retried. Batched retries inevitably result in sending duplicate events to downstream consumers unless there’s a cache or staged logging implementation that remembers the transactions in every batch.
Event order can be kept by sorting data by eventCreatedTimestamp in each batch, then sending downstream in sequence.
If an event with an earlier eventCreatedTimestamp timestamp arrives after the batch in which it should’ve been processed, this event will be lost forever.

Timeline illustrating how batch jobs bring events occurring over a window of time into the system in a single instant.

Stream

Stream processing tries to process events in real time:

Event processing order is the same as event arrival time.
Retries won’t cause duplicate events.
If the event arrival time is different from the actual event timestamp in the payload, the event processing will be out of order unless a local buffer or a mini batch is implemented to preserve a series of events and sort them in the correct order.

Timeline showing how a failed event publication results in a retry and a new, distinct event.

The real problem and possible solution

Despite the simplified data processing model introduced above, the eventual event order problem hasn’t been solved. Recall the email preference example we talked above: If the event arrives in the wrong order relative to its final destination database, data updates will be wrong:

Diagram showing the faulty scenario that an event of wrong order is saved in the database

To solve this problem, it is critical to carry the earliest user-fired event timestamp from beginning to end. At the end of the path, before updating the database, we need to check the timestamp to ensure that the event being processed is newer than what’s recorded in the database. In the final stage, we can implement a timestamp check like this:

Diagram showing the solution that corrects the events sent with wrong order

To align definitions across products and teams, use consistent naming conventions, time zone alignment, and schema consistency. Here’s an example:

userInitiatedTimestamp, eventCreatedTimestamp, batchLastRunWatermark, batchLastRunTimestamp, createdTimestamp, updatedTimesta

Clickstreams and user activities are at the center stage of our data product lines, yet handling detailed event data processing, especially with regard to timestamps and event order, is challenging. Different definitions of timestamps, batching vs. streaming processing, and technology choices all make this problem more complex to solve.

Implementing solutions to this problem requires rigor, especially when there are many middle layers and data products involved in between. The original timestamp is very easily lost if even one step from this pipeline fails to relay it in the payload. It needs precise coordination between teams, clear data product definitions, a well-unified data architecture, and ultimately respect for our customers’ data.

Learn more about technology at Expedia Group