Building Real-Time ML Pipelines with a Feature Store
Transition from batch to real time with an integrated feature store
The Next Stage of Feature Stores
The buzz around feature stores has increased in machine learning circles in the last few months, and the topic indeed deserves attention. The most painful challenge in the ML lifecycle is dealing with data — or in other words, feature engineering.
But feature engineering for batch and real time are two very different animals.
Real-time feature engineering is much more complex for real-time than for batch. And the business demand for real-time use cases is gaining momentum.
Feature stores must move beyond batch to keep up with the market demand. Since most feature stores are batch-oriented (they’re easier to develop!), next-level feature stores should be designed to address the growing challenge of calculating real-time features with low-latency requirements.
Those real-time use cases reach across industries. Some common examples are fraud detection, real-time product recommendation, predictive maintenance, dynamic pricing, voice assistants, chatbots and more. In this post I will revisit the concepts and the practices of building a real-time feature store.
If you’re not familiar with the concept of a feature store you can find an intro in my previous post:
What are Feature Stores and Why Are They Critical for Scaling Data Science?
What is a feature store?
First, let’s review the difference between real-time and batch feature stores…
Batch: Slow, suitable for training processes that can take hours
Real-time: Fast, time-sensitive, need to calculate features within a few milliseconds
Business characteristics of real-time use cases:
- Time sensitivity
- Customer-facing impact
- Adherence to business SLAs
- Low-latency access and processing
- Scalability: supporting lots of requests and processing
- Access control
- An easy API for creating and accessing features
- Easy-to-use interface
The data scientist’s skill set is around understanding data and creating complex algorithms to solve business problems. They don’t need to be data engineers. They need to hunt for and create features as part of their job. The features they create are for training models in a development environment ONLY. Once the model is ready to be deployed in production, the data scientists’ code for training is not ready to run in a real-time production environment. Data engineers step in to re-write the feature to make it production-ready. This is a key part of the MLOps process (machine learning operationalization).
This siloed process creates longer development cycles and introduces the risk of training-serving skew. Those code changes can (and often do) result in a less accurate model in production.
Real-time pipelines also require an extremely fast event processing mechanism while running complex algorithms to calculate features in real time. Industries like Finance or AdTech need application response times in the range of milliseconds.
Rapid real-time event processing requires the right design and the right set of tools. ML teams cannot use the same tools for real-time processing as they do for training (e.g. Spark).
For example, a very common feature that is used to identify fraud transactions is the Z-score of the “Payment” field for any transaction. Z-score is calculated by taking the difference between the number and the mean (average) and then dividing the difference obtained by the standard deviation. Doing it as a batch process, though it takes many hours, is not hard to do. But the operational challenge of calculating it in real time on a huge number of transactions per second, while also keeping this feature up to date for a quick serving, is huge.
Use a Feature Store as a Data Transformation Service for Both Training and Serving
To tackle the challenges above, we need a very powerful and fast data transformation service. A feature store is not simply a convenient catalog of features with a nice management interface; it’s fundamentally a transformation service designed to solve a complex problem of feature engineering. Ideally, it should be able to handle real-time feature engineering.
The desired approach for feature engineering, real-time or batch, is to have one single logic to generate features for training and serving. The concept should be to build it once and then use it for both offline training and online serving.
We need a unified feature store for both the training and serving layers.
Typically, the architecture of machine learning pipelines consists of two layers: one for training and one for serving, with two different engines managing features.In the common machine learning lifecycle, model training is done with an “offline” feature store, and the inference runs in real time using a separate online store. Having two different engines can lead to training-serving skew and therefore poor business outcomes. So a key advantage of a modern feature store is its ability to unify the logic of generating features for both training and serving, ensuring that the features are being calculated in the same way for both layers.
Real-time Feature Store Must-haves
There are a few showstoppers for teams who want to do real-time feature engineering.
First, some context:
There are two dimensions for feature calculation: frequency and complexity.
Frequency determines the time of update of a given feature. These can either be real-time (event-driven) or batch (scheduled).
Complexity describes the computational scope required to generate the feature.
For example, calculating a simple arithmetic function on a record versus an entire data pipeline. Features can be both real-time and complex, such as updating the analytical aggregation “standard deviation of the current price vs. the average monthly price” on an individual transaction.
Consider the following aspects when building a real-time service with low-latency response time:
- Access latency: how long does it take to query the last value of a feature?
- Feature calculation: how long does it take to calculate a feature?
Real-time processing also means that streaming becomes a first citizen class as this is the primary source for real-time events. However, in addition to a streaming engine (e.g. Kafka) there are a few essential building blocks needed to process the events and provide the required functionalities for managing features in real time:
- Streaming framework as the event source (e.g. Kafka, event hub, kinesis, Iguazio streaming)
- High speed serverless function for reading and processing events such as Nuclio, Apache Flink
- Fast queuing framework serving as a bus between several computation processes running via serverless functions (e.g. Kafka or v3io stream)
- Fast key value database such as Redis, Iguazio KV, Cassandra, etc.
- Event processing engine: a framework for calculating events in real-time with strict performance requirements and built-in HA. This processing engine enables feature extraction in real-time and at scale — e.g. Iguazio Storey (async streaming library), Apache beam.
- Transformation service supporting:
- Built-in aggregations: Calculate functions such as min, max, std, count, etc.
- Sliding window: Calculate data over a given time period every x minutes (e.g. average click numbers in the last hour)
- Grouping: Group events by a certain field
- Joins: Enrich data with other sources with an option to do “as of join” to handle time mismatch
- Custom function: Add custom logic to the feature calculation (e.g. multiply a given field by x)
- Filter: Filter events based on a given rule
- Libraries (optional): Leverage a set of predefined libraries for common tasks such as null removal, geocoding, time conversion, etc.
The diagram below shows a real-time pipeline that gets streaming events, runs a function for pre-processing, captures it as a stream and then creates features using various steps via an async event processing library. This library is built-in within the feature store, enabling real-time event processing. The feature store provides a layer of abstraction for the data scientist so they don’t need to write complex SQL queries or spark jobs, they can leverage a simple SDK to create aggregation, join with other datasets or transform the data without any heavy lifting of data engineering.
The calculated features are stored in both the real-time database and the historical database for training and serving.
Note that in order to calculate those events in real time, this library handles async events, processes the data in the memory and provides a way to create a graph that comprises various steps.
Iguazio developed an asynch event processing library called Storey. Here is an example of such a real-time graph:
Here’s a code example that takes a stream of events, enriches it, groups data by key and creates several aggregations and eventually writes the result to the feature store as a stream and Parquet file.
Deliverables for Feature stores
A typical feature store uses two types of data sources for storing features:
- A Key value database (e.g. Redis, Cassandra, Iguazio KV) for fast retrieval of an online serving.
- A Historical database or data store for fetching features for offline training as a batch operation. This is usually done by leveraging a database or a columnar format such as Parquet files.
However, we can go beyond those typical cases and add additional target stores for various purposes:
- Time series table: Stored features in a time series database can be leveraged to create real-time dashboards showing real-time features and statistics.
- Streaming: A real-time pipeline may consist of a feature calculation that is actually served as an input for another step in the pipeline. In this case, storing the feature in a stream is more efficient than storing it in a database.
Monitoring and Re-training Features
The Importance of Integrating a Feature Store with Monitoring and Training Processes
Model deployments are not finished when you deploy your model for the first time. The models in production need to be monitored on an ongoing basis as their prediction may turn out to be less accurate over time. This is called model drift and one of the reasons for model drift is data drift.
Sometimes the data that was used to train the model is no longer similar to the data in production leading to a drift in the model. What if we could capture the feature vectors and the prediction along with its statistics in real-time and then store it again in the feature store as a “production” feature set?
We get three things by doing that:
1. Identify feature drift in real time by comparing the statistics of the trained features and the statistics of the actual features.
2. Re-train our model with fresh data. This allows us to store the feature vector along with the predictions in the feature store for training purposes. Meaning, we have the recent production data ready for re-training.
3. Generate a real-time dashboard showing the feature vector and its statistics for monitoring and troubleshooting purposes.
Real-time feature engineering is crucial for any modern feature store. To address the complex challenges of creating and managing real-time features, there are certain capabilities and frameworks that must be part of such a solution.
The feature store is not stand-alone functionality. Rather, integration with other parts in the machine learning pipeline, such as monitoring and training in particular, is a key for a comprehensive solution that can substantially ease the process of deploying new machine learning business.