Feature Stores: The Data Side of ML Pipelines

Sarah Wooders

Published in

riselab

6 min readApr 6, 2021

We need a principled way of managing state in real-time ML pipelines.

Written by Sarah Wooders, Peter Schafhalter, and Joey Gonzalez

The RISE of Feature Stores

As more models are deployed in real-world pipelines, the recurring lesson is that data and data featurization matters above all else. The last generation of big data systems scaled ML to real-world datasets, and now feature stores are quickly emerging as a new frontier for connecting models to real-time data [1].

Keeping features up-to-date is critical for model accuracy, but expensive and hard to scale.

Feature stores, as the name implies, store features derived from raw data and serve them to downstream models for training and inference. For example, a feature store might store the last few pages a user browsed (i.e., a sliding window over the clickstream) as well as the current predicted user demographics (i.e., a model prediction) both of which would be high-value features for an ad targeting model.

Unfortunately, many feature stores being built today are Frankensteinian amalgamations of batch, streaming, caching, and storage systems.

In this post, we (1) define what feature stores are and how they are used today, (2) highlight some of the design limitations of the current generation of feature stores, and (3) describe how innovation in feature store designs could transform production machine learning by managing state across training and inference pipelines in a more principled way.

Background

Why feature stores?

A simple ML pipeline trains a model from a static dataset, then serves the model to respond to user inference requests.

Predictions are generated from model parameters and request data.

However, in order to adapt to a continuously changing world, modern ML pipelines need to make decisions which depend on real-time data [2]. For example a model predicting ETA might use features like the recent order fulfillment times of a restaurant, or a content recommendation model could consider a user’s most recent clicks. Model training and inference therefore rely on real-time features derived from joining, transforming, and aggregating over incoming streams of data. Because the featurization step can be expensive, features need to be pre-computed and cached to avoid redundant computation and to meet tight prediction latency requirements.

Predictions also rely on features derived from live streams of data.

What are feature stores?

Feature stores are used to store and serve features across multiple branches of the pipeline, allowing for shared computation and optimizations. While different feature stores vary in their functionality, they typically manage the following:

Serving features to meet varying query latency requirements — Features are usually placed in both a fast “online store” (to query during inference) and durable “offline store” (to query during training).
Making features composable and extensible — Once a feature is defined, it should be easy to connect it to downstream models, derive additional features from it, or redefine the feature’s schema or featurization function.
Maintaining features derived from real-time data — Maintaining features is resource intensive, but stale features can negatively affect prediction performance.

Certain features (e.g. a 1 minute time window aggregate) are very sensitive to staleness and need to be ultra-fresh, while others (e.g. 30 day windows) may only need periodic batch updates. As the system that interfaces with both updates to and requests for features, feature stores are well positioned to optimize the tradeoffs between freshness, latency, and cost.

Feature Stores Today: Challenges & Limitations

Many companies today have implemented feature stores internally to make features accessible to models deployed in production.

Example workloads for feature stores

Feature stores today are built atop existing streaming, batch, caching, and storage systems. While each of these systems solve challenging problems in isolation, their constraints are problematic for feature stores.

Batch processing systems like Spark enable complex queries over static datasets, but introduce excessive latency when serving features and trigger total recomputations when backfilling data.
Streaming systems such as Flink and Spark Streaming enable low-latency pipelines, but fall short when asked to maintain large amounts of state. Lambda architectures combine both batch and streaming systems, but result in costly duplicate computation and complex maintenance of both streaming and batch codebases.
Streaming databases with materialized views can offer advantages of both rapid computation and storage, but these are difficult to adapt to arbitrary featurization operations. Their query latencies may also be too high for prediction serving.
In-memory key-value stores like Redis provide a fast way to access features, but these are typically difficult to update in a consistent manner and expensive to scale.

Many of the requirements for feature stores can be met with a combination of these systems. However, the resulting pipeline is rigid and hard to optimize end-to-end. For example, prioritizing featurization tasks based on their impact on overall prediction accuracy would require coordination between the data store receiving queries, the streaming system pushing live updates, and the batch processing system for processing historical data. Rather than awkwardly combining multiple compute engines with multiple databases to meet multiple latency targets, feature stores should take advantage of their access to incoming events and query patterns to optimize latency, compute cost, and prediction accuracy in a centralized way.

The Future of Feature Stores

We believe feature stores can offer centralized state management for ML pipelines, and have exciting potential for:

Lineage Management: Feature stores open the door to a new, data-centric abstraction for developing and tuning machine learning pipelines. The complexity of existing machine learning pipelines often makes it difficult to ensure basic reproducibility, apply pipeline changes, or perform optimizations across the pipeline. While meticulous versioning and synchronization can solve these problems to a certain extent, applying these techniques to constantly evolving datasets and pipelines is simply hard to think about. A data-centric view on pipelines (for example, treating data pipelines as materialized views) has the potential to introduce new abstractions which simplifies the process for propagating data and operator changes.
End-to-End Optimization: Feature stores are well positioned to enable new end-to-end optimizations across ML data pipelines. Current systems restrict computation to running in either an event-based or request-based manner, making it difficult to schedule tasks in a way that optimizes common metrics like prediction performance and cost. Practitioners should be able to configure their pipelines to optimize for cost saving (lazy computation/updates, approximate results), inference latency (eager computation), or overall prediction performance (update features with most impact).
Scalable State Management: Feature stores indicate the need to scalably maintain and persist state within ML pipelines. Real-time, production ML pipelines often need to maintain tens of million of features derived from multiple, dense incoming streams of data. Feature sets may be too large to persist in memory or update with every incoming stream event as a stream processing system would by default, but also need to be updated more frequently than a batch processing system allows.

Conclusion

We’re actively studying the design of feature store systems, so let us know if you’re interested in staying up-to-date or collaborating!

If you’d like to get involved with our research, feel free to reach out to wooders@berkeley.edu.

Notes

[1] By “real-time data”, we are referring to data that needs to be processed in real time, both in the context of online prediction serving and maintaining data freshness for features.

[2] Updates for “real-time” data typically need to be on the order of seconds, but can vary between workloads.

Acknowledgments

Thank you to Manmeet Gujral, Gaetan Castelein, and Kevin Stumpf from Tecton, as well as Joe Hellerstein, Natacha Crooks, Simon Mo, Richard Liaw, and other members of the RISELab for providing feedback on this post.