Streamlining Machine Learning Development with a Feature Store
By John Thomas
In UFC, fighters prepare for their time in the Octagon with intense training regimens. Fighters and their teams will spend months analyzing their opponent’s style for any exploitable weaknesses, fine-tuning their body and their technique to offer the best possible match.
Data scientists at UFC’s parent organization, Endeavor, approach machine learning predictions with no less determination. While fighters throw down on camera, the data related to fans watching the fight also gets analyzed critically — the competition to attract and retain viewers is fierce. Customers need an engaging experience that keeps them hooked, and Endeavor has become a rising leader in leveraging machine learning and digital expertise to meet the challenge. Below we dive deeper into Endeavor’s work in maximizing viewership insights with their implementation of a feature store in streamlining machine learning development at UFC.
Many data scientists reading this know that the bulk of the work going into building machine learning models comes from feature creation. Features (i.e., predictors or attributes) like aggregated characteristics (e.g., days since last purchase or customer country of residence) are a necessary part of predictive machine learning models. For a model to make accurate predictions, it needs high-quality features that can explain sufficient variation in the output (in other words, determine what category your prediction belongs to). But with scaling machine learning operations, one may find it challenging or repetitive to upkeep an ML ecosystem. Furthermore, your team can risk definition inconsistency across your codebase by repeating the same efforts over and over. Endeavor has solved this problem through a Feature Store which standardizes and centralizes features to ensure consistency, discoverability, and reusability.
Features can take a significant time to develop, depending on the complexity of the calculations. You can spend weeks aggregating the raw data on hand into a digestible format that makes sense for your model. Some features can be common across several use cases — lifetime ticket sales or most-used streaming devices, for example — which can be predictive for both transactional and browsing behavior models. If multiple data scientists work on models using the same source, problems like definition inconsistency can emerge when the same features are created multiple times for different objective functions (i.e., the predictive question you want to be answered). Data engineering efforts can be streamlined, using a common hub of features that can be used across multiple clients and multiple models.
Enter the feature store.
A feature store is a common code repository where multiple contributors can define features for model development.
In a feature store, you can have a schema consisting of several feature tables, each providing a glance at the common identifier you are trying to predict upon. We can walk through an example using the work done with Endeavor Streaming. Endeavor Streaming brings leading technologies and digital media expertise to the streaming world and provides an array of analytical capabilities for sports streaming giants. With each client (defined as the business Endeavor partners with to stream their content), Endeavor Streaming effectively collects data on their streaming and transactional behavior, which can then enhance the customer experience. Data scientists on our team then use these raw data streams to provide machine learning predictions for clients. If you were doing machine learning for a data source on video streaming data, you could have the following breakdown, for example:
Potential Objective Functions:
- Churn — What is the likelihood a customer will cancel their subscription?
- Pay-Per-View Re-Purchase -What is the chance a customer who bought a Pay-Per-View license buys one again?
- Customer Re-Engagement -What is the probability that a customer would be likely to stream a specific piece of content?
All these questions are very different to answer as far as marketing actionability and product insights are concerned, but these questions all pertain to a common identifier: the customer. If we know enough about the customer, we can determine answers for all these objective functions using the data on hand. Below, we outline how this feature store structure could be used for the UFC and how data on viewership during the weekly fights could be leveraged accordingly. We can utilize knowledge of engagement activity in minutes for a churn problem and continue using this aggregation for a Re-Engagement problem. Continuing with the example of streaming, one of these tables could look like:
Table 1: Example Viewership Raw Data Stream for UFC
Endeavor leverages Snowflake databases to store our data and we maximize efficiency by utilizing DBT models to store queries of pertinent data aggregations. Using DBT, we could build test and inference datasets at the customer level (identified by the customer_EXID to answer the objective functions on hand.
Some features that could be derived from these datasets:
- Devices Used by Customer
- Minutes Watched by Customer
- Country of Viewership by Customer
- Content Viewed by Customer
- First Main Card Content Viewed
On the surface, you could use these traits individually and make the three models above for however many clients you have. In this example, we will assume Endeavor Streaming has two clients:
This “works.” The data manipulation step of the viewership data for two clients can produce two data frames (test and inference) for each objective function (3). This brings us to twelve distinct models, all drawing from similar sources and producing similar output. Notice that not all features are used in every model, so this is specifically accounted for in each individual model’s creation via DBT. This however, is inefficient for a few reasons:
- Feature Definition Drift — As the same features get recreated for each of these models, there are no mechanisms in place for the definition of that feature to remain consistent across models
- Repetition — As new clients get introduced, the process of recreating these exact models for those clients must be repeated.
- Model Upkeep — As the data changes or new depth on the target variable is exposed over time, the smallest change must be manually implemented across all models one by one.
The feature store solves these issues by storing all features in one place for all data sources to feed into and all models to draw from.
We are now able to mediate the above concerns by centralizing feature development in one location. All features, no matter the use case, live in the feature store. One model can simply “check out” any feature they like from the store, simply by joining on the row level identifier (customer). The beauty of this design doesn’t stop here; any client that is added henceforth will simply follow the logic of its predecessors went through to create the features. Since the raw viewership data structure coming into the database is identical, if we have a new “Client 3” onboard, we can simply apply the same logic as existing clients and generate these same features for Client 3 with relatively low effort. This also enables us to quickly build models (test and inference) for these new clients.
This is useful for a number of reasons:
- Feature development work can be easily accessed by other data scientists
- Computation of features is now automated for all use cases
- Training and inference datasets are consistent in features
- Replication of models across different use cases is quickly scalable
The entries in the feature store are still SQL Features extracted via DBT Models, each living as a feature table. Each of these feature tables has several attributes about the customer in them, all at the customer level. Devices Used by Customer (
feature_devices) could have features inside them that describe the customer like
multi-device_user, etc. These are all features that can now be chosen to help predict whichever objective function you are developing. Once the features exist, we would just need to join them all together based on our common identifier (customer).
While the data being processed from the unique clients serviced by Endeavor Streaming exist in the same feature store, the data becomes federated in model development simply by filtering the feature store based on the client. In the UFC example, the machine learning models can only be developed after UFC-specific training sets are generated, thereby guaranteeing no cross-client data leakage. Data security is built into the design and proves to be another benefit of the feature store here at Endeavor Digital.
We want to get all tables into a common row-level aggregation, and in the use case for the work with Endeavor Streaming, we care to boil things down to the customer level, denoted by
CUSTOMER_EXID. Aggregations of the viewership start time/end time can be very useful and worthy of a feature table in our store.
WITH minutes AS (
'minute', session_start, session_end
) AS session_length,
DATE(session_start) AS session_date
AVG(session_length) AS avg_session_length,
COUNT(DISTINCT session_date) AS distinct_days_active
Table 2: Feature Table of Minutes Related Data (feature_minutes)
And conversely, we can do the same for our knowledge of what devices they use to stream.
MODE(device) AS most_used_device
Table 3: Feature Table of Device Related Data (feature_devices)
Because Endeavor Streaming’s data streams are consistent in their structure, we can rely on these feature tables to automatically compute these features once it’s created.
With Churn, we could have a list of a subset of historic customers (represented as
CUSTOMER_EXID) as well as an identifier as to if they have churned, or cancelled their subscription, or not (they are currently active).
Table 4: Customer Table of Churn Status (churn_status)
Our goal with this objective function is to find quality predictors that can best classify a customer’s status as active or expired. While this part of the magic happens in the actual machine learning, we must create a dataset. But the beauty of the feature store is we can actually very quickly create a dataset to train based on the tables we already have. We do not need to make a complicated query to make this table, rather we just need to join all the features we want together.
COALESCE(B.avg_session_length, 0) AS avg_session_length COALESCE(B.distinct_days_active, 0) AS distinct_days_active,
COALESCE(C.most_used_device, 'Invalid') AS most_used_device
LEFT JOIN feature_minutes B ON A.customer_exid = B.customer_exid
LEFT JOIN feature_devices C ON A.customer_exid = C.customer_exid
Table 5: Training Dataset to Predict Churn Status (ToTrain_churn_status)
As you can see, we now have a dataset that we can make predictions of
status on, using all my features as predictors at the
CUSTOMER_EXID level. You may also notice something critical here: not all customers will appear in each of your feature tables. A customer who has an account but never viewed any content would never appear in the
viewership table, or any feature tables which stem from it. As a result, you want to ensure you add
COALESCE() to most if not all of your joins to account for these otherwise null values.
If another data scientist wants to predict the likelihood that a customer who bought a Pay-Per-View license buys one again, they can check out features that were already created in the churn use case for their objective function. This significantly reduces model development time and ensures definitions of features are common across models. Furthermore, when it comes time to productionize your models, you can pull from the feature store again when building your inference sets. In using a feature store, you optimize development of ML training and create a robust ecosystem for your data science team to move both effectively and efficiently.
Wanna chat more about data with the author? Contact John Thomas directly on LinkedIn!