Preparing Event Driven Data for Analysis

Published in

Walmart Global Tech Blog

7 min readNov 9, 2020

Introduction

Historically, data came in nice rows that were easy to analyze. These days, event driven architectures (EDA) can create several uncorrelated event streams, which can leave the data scientist scratching their head to figure out the correct way to bring the data together. Often this data will remain unanalyzed until it is put into a simpler format. This article will demonstrate a simple process for consolidating event driven data into data sets that can be easily analyzed in a Spark environment. The business example is to pull together several event streams into a single data set that can be used by data scientists to get the big picture of Zoom usage.

Image source from https://publicdomainvectors.org/en/free-clipart/Female-manager/70861.html (Text is mine)

When an engineer naively first looks at the Zoom data, they hope to find a simple record of a meeting showing the time and attendance. Instead the engineer will find Zoom meeting data is spread across many (40+) different event streams. Zoom meeting data requires that several different event streams be joined together before a usable dataset can be derived.

To clearly define terms, each ‘event stream’ tracks individual events of a specific event type. While much of corporate consumer data can be thought of as event streams (events including “Viewed Page”, “Added to Shopping Cart”, “Purchased Item”, “Created Account” etc.), many internal systems are rapidly becoming more event based, such as service ticketing, location updates, directory services, … Corporate Zoom data is an excellent example of ‘event stream’ data. For example, the zoom_meeting_participant_joined event stream has a record for every time a person joins a meeting. So, if a person joins (then leaves) a meeting three times, then there are three records in that table.

Each event stream stores data in a separate table, where each row represents 1 event.

The Process

Combining event streams into a single usable dataset follows a general process.

Understand the business process — I know it sounds cliché to say, but “understanding the business process” is the first step. When looking at the Zoom data, it may be easy because we use it daily. However, if we were analyzing a data set that we were not familiar with, such as incident/problem tracking data, we would want to develop a flowchart that showed when the events are triggered. Similar to the Zoom event flow below:

Zoom Event flow — (Simplified view. There are actually over 40+ Zoom event types for meetings and webinars.)

*Image created inside Walmart by Mike Roberts*

2. Figure out which data set to start with.

In order to get to “a table of who went to which meetings” we will need to merge several different event streams. Naturally, the first question, “Which set of events should be the base that we build the table“

The logical place to start would be the beginning of the event flow process, when the meeting was created. While combining event streams in chronological order is a good strategy for event streams, this approach is not the best for event Zoom event data because of the event structure. If we start with meeting_created, we find many meetings are updated or deleted, which creates quality problems. So in this case, a better place to start is the participant_joins_meeting because this gives us a record of every meeting that someone attended. This table also contains a references to the meetingID, which will help pull in data form either upstream or downstream events. For example, we can use the meetingID in the participant_joins_meeting table to filter the meeting_created table to find the host and creation date.

When starting the process of joining and merging different event streams, consider the following:

Which data set is the start of the event flow? This is the chronological order of events.
Which data set has most transactions? Starting with a base table that contains many events, more connections can be made as the base table is joined to other data sets.
Which data set has most links to other types of events? Event data sets that have keys to other data set can facilitate easy joins between data.
Does a particular event type align closely with the business requirement? Is there a data set that contains most of the answer to the business question?

3. Putting the data together

The general problem with joining event streams is that there is not always 1-to-1 relationship to either upstream or downstream events.

In the case of Zoom data, one meeting might have many join/leave events for the same person at the same meeting. This event stream complicated by the fact that a person may join the meeting before it starts or leave the meeting after it is ended. Working with the business partners we can develop a rules to handle these kinds of situations. For example, “If a person has joined multiple times, keep only the first time a person joined the meeting. Assume that they left the meeting at their last leave time or the meeting end time, whichever is earlier.”

To meet the business requirement if “keep only the first time a person joined the meeting” we will need to create a ‘window’ for each participant, then select the earliest record of them joining a meeting. Apache Spark.Window functions allow users to calculate results over a range of input rows. It is similar to groupBy() functionality, but without aggregation. In our case for the Zoom data, we will use to create a window of the participant join times then we will select the earliest record. There is too much information about Windows to cover in this article, but here is an article to get you started on Spark windows. Generally the code has two steps, first to set the window, then apply it to the dataframe. Below is a simple example that uses a window to take the first item in a dataframe.

from pyspark.sql import DataFrame, Window
from pyspark.sql.functions import first
# Example dataframe
df = spark.createDataFrame([("A", 1),("A", 2),("A", 3),("B", 10),("B", 20),("B", 30)],   ("id", "count")  )# Define the Window for each 'id', and the order by the 'count'
windowSpec = Window.partitionBy("id").orderBy("count")  
 
# use window to select the first value in the window
df.select("*", 
         first('count').over(windowSpec).alias("first_in_window"))
         .show()+---+ -----+ ---------------+
| id| count| first_in_window|
+---+ -----+ ---------------+
|  B|    10|              10|
|  B|    20|              10|
|  B|    30|              10|
|  A|     1|               1|
|  A|     2|               1|
|  A|     3|               1|
+---+ -----+ ---------------+

Once you have your data windowed properly for participant_joins_meeting table to get the first join for each person, then you will need to apply similar logic the participant_leaves_meeting table to get the last leave event. While this example shows windowing, there may be some tables that require aggregation instead. For example, a quality_of_service event is created every time a person leaves a meeting. It may be desired to aggregate the quality of service issues for a particular person in a meeting, as part of the analysis.

4. Joins and Validation

The next steps of joining and merging many tables is a standard process. Join the data, keep the fields you need, repeat until you have the final table.

Because different event streams are not correlated, I found it useful to run various statistics on each of the different joins, so I was clear about what records were falling out at each join. Often event tables have missing or duplicate values that can give strange results. We want to see the vast majority of the events to be joined. We would expect for a few items to fall out because the different event streams are not correlated. If the events could not be joined well, then we would need to start investigating the root cause. Similarly, it can be useful look at the records that fall out in the windowing process. For example, we know that sometimes people have difficulty joining a Zoom meeting, so it is not surprising to see 5–15% of the people joining a meeting several times because of computer problems. However, if we found that the windowing process reduced the data set by 50%, then we should investigate why.

5. Be prepared to discuss what is not in the model.
When aggregating the event streams into a data set that is going to be used by others, it is import to be able to identify what did not go into the model. Business partners will need to be guided by the strengths and limitations of the data. For example, if we combine the Zoom event streams shown in the diagram above, we have a pretty good summary of “who went to which meetings”. However, this analysis does NOT tell us:

who was invited to a meeting and did not attend?
the number of meetings created and then canceled?
the meetings that were scheduled but no one attended?

If you are proactive in this step, you can use this analysis to propose new ideas, provide work effort estimates, and help guide the business users.

Summary

With modern application architectures producing streaming events, there is a need to combine these events into simpler tables so that they can be used by analysts and data scientists. Pulling event streams together is not ‘rocket science’ , but ‘data science’ if you are using a clear process:

Understand the event workflow
Identify a dataset to use as the base, then build on that
Clear analysis on deduplication and join steps
Document what is not known about the model

Preparing Event Driven Data for Analysis

Written by Mike