EXPEDIA GROUP TECHNOLOGY — DATA
Be Vigilant about Time Order in Event-Based Data Processing
How to handle the timing of events
Clickstreams, user activity, time-series data, and other event data are often analyzed to add business value. As data infrastructures evolve and become more complex, honoring the very essential feature of event-based data: time and order, can become difficult. It can expose data inconsistency issues and may even have legal concerns when the event data represents a users’ privacy. In this post, we’ll look at such an example and talk about how it can be solved.
Let’s start with an example
Our data engineering team creates Vrbo’s (part of Expedia Groupᵀᴹ) promotional emails and one of the core data products provides access to recipient email activities: opens, clicks, unsubscribes, spams, bounces, and so on. While other activities are important for analytics and business models, unsubscribe events are critical in terms of user’s privacy preferences that you don’t want to get wrong. Also, putting user email activity aside, certain types of web activity can also trigger an automatic subscribe action to our promotional emails: booking requests, account sign ups, etc. A simplified subscribe/unsubscribe example can be illustrated like this:
Events come from different places, different channels, and have different delays. They can carry different timestamps and values. If not processed carefully, this may result in incorrect privacy settings as shown in the example below.
This is one example showing a potential risk when processing event data coming from different channels. What if:
- In a single stream event processing system, the order of events can’t be guaranteed?
- Events processing is failed then retried?
- A timestamp doesn’t mean what you think it does?
- Using batches vs streams, what are the caveats and advantages?
In the next section, we’ll discuss these different scenarios.
Things to consider
Timestamps
Timestamp is the foundation of everything here. Without the reference of timestamps, you lose the truth of an event. But what are the different timestamp types? What are the semantics of the timestamps in your data pipelines upstream/downstream contracts? Consider the following field names:
createdAt
updatedAt
requestedTimestamp
timestamp
updateDate
createDate
event.header.tim
ebody.timestamp
dateid
hourid
- …
Consider documenting somewhere what each of your timestamps means in each system. Doing so can remove ambiguity and help to prevent errors when designing and updating systems.
Consider this timeline of events:
While this diagram looks straightforward and streamlined, in the real world, each of these actions is executed in different systems, using different technologies, with different schedules, and maintained by different teams.
Timestamp definitions vary under each of these steps above. For instance, updatedAt
in “Processed and saved in stream” step can mean the timestamp when this event is saved to stream storage, and event.header.time
in “Consumed and transformed” step means the timestamp when it’s “Fired by user”.
By the time it’s permanently saved at the final stage, the most important timestamp of all could be lost: timestamp when captured and generated. If any of these steps fail to keep the events in order for a single user, the final destination of those events may suffer the consequence.
Batch, stream
There are two different approaches to processing events: batch and stream. They treat the event timestamp in different ways.
Batch
In batch processing, timestamps and time-related fields can be defined like this:
For batch-based event processing:
- Because of its timed scheduling, events will be grouped together for processing.
batchLastRunWatermark
is important to remember the last event processed.- When one event fails, the entire batch needs to be retried. Batched retries inevitably result in sending duplicate events to downstream consumers unless there’s a cache or staged logging implementation that remembers the transactions in every batch.
- Event order can be kept by sorting data by
eventCreatedTimestamp
in each batch, then sending downstream in sequence. - If an event with an earlier
eventCreatedTimestamp
timestamp arrives after the batch in which it should’ve been processed, this event will be lost forever.
Stream
Stream processing tries to process events in real time:
- Event processing order is the same as event arrival time.
- Retries won’t cause duplicate events.
- If the event arrival time is different from the actual event timestamp in the payload, the event processing will be out of order unless a local buffer or a mini batch is implemented to preserve a series of events and sort them in the correct order.
The real problem and possible solution
Despite the simplified data processing model introduced above, the eventual event order problem hasn’t been solved. Recall the email preference example we talked above: If the event arrives in the wrong order relative to its final destination database, data updates will be wrong:
To solve this problem, it is critical to carry the earliest user-fired event timestamp from beginning to end. At the end of the path, before updating the database, we need to check the timestamp to ensure that the event being processed is newer than what’s recorded in the database. In the final stage, we can implement a timestamp check like this:
To align definitions across products and teams, use consistent naming conventions, time zone alignment, and schema consistency. Here’s an example:
Clickstreams and user activities are at the center stage of our data product lines, yet handling detailed event data processing, especially with regard to timestamps and event order, is challenging. Different definitions of timestamps, batching vs. streaming processing, and technology choices all make this problem more complex to solve.
Implementing solutions to this problem requires rigor, especially when there are many middle layers and data products involved in between. The original timestamp is very easily lost if even one step from this pipeline fails to relay it in the payload. It needs precise coordination between teams, clear data product definitions, a well-unified data architecture, and ultimately respect for our customers’ data.