Jim Dowling
Feature Stores for ML
3 min readMay 19, 2020

--

Photo by John T on Unsplash

Feature Stores have become the key piece of data infrastructure for machine learning, connecting models to their data. They manage the whole lifecycle of features: from training models, to batch inference, to providing low-latency access to features by online-applications for online inference. This article introduces the Hopsworks Feature Store, the first open-source and only Python-native Feature Store.

What is a Feature Store for ML?

The Feature Store for machine learning is a feature computation and storage service that enables features to be registered, discovered, and used both as part of model training pipelines as well as inference pipelines. Feature Stores typically store both large volumes of historical feature data for model training and provide low latency access to features for online applications. As such, they are typically implemented as a dual-database system: a low latency online feature store (typically a key-value store or real-time database) and a scale-out SQL database to store large volumes of feature data for training and batch applications.

In Hopsworks Feature Store, feature data is written to Feature Groups. Feature Groups store the data in tables for offline and/or online access. In Hopsworks, offline features are stored either in an External Table (e.g., Snowflake, BQ, Redshift, Parquet/S3) or cached as Hudi Tables on S3. Online features are stored in RonDB, a distributed highly available database previously known as MySQL Cluster. In Hopsworks, data is read using Feature Views (a view (selection) of features from many Feature Groups)

Feature, Training, and Inference Pipelines

Feature Stores enable ML Systems to be factored as feature pipelines, training pipelines, and inference pipelines, making systems more maintainable, flexible, and understandable.

The Feature Store enables ML systems to be refactored from a single unwieldy monolithic ML pipeline into three maintainable, independently schedulable pipelines:

  • feature pipelines that apply model-independent transformations to raw data to compute features that are then stored in the feature store;
  • training pipelines that read features from the feature store, optionally apply model-specific transformations, and store any trained models and experiment results in a model registry;
  • batch inference pipelines that read batches of features from the feature store and download the model and produce predictions that are stored in some sink (e.g., a database table or object store), to be later consumed by dashboards or operational systems;
  • online inference pipelines that read features from the feature store at low latency, optionally compute features from client-supplied input parameters, and build a feature vector that is used to make predictions with the model. Predictions are returned to the client.

The online feature store enables online applications to enrich feature vectors with near real-time feature data before performing inference requests. The offline feature store can store large volumes of feature data that is used to create training data for model development or by batch inference pipelines.

Feature pipelines write to Feature Groups in Hopsworks. A Feature View is the selection of features that are used for training and prediction with a model. Training pipelines and batch inference pipelines read from the Offline API to a Feature View. Online inference pipelines read from the Online API to a Feature View. Model-independent transformations happen in feature pipelines to maximise feature reuse. Model-specific transformations are performed in training and inference pipelines and need to be consistent across both pipelines, to prevent training/serving skew.

Feature Store Properties

The Feature Store solves the following problems in ML pipelines:

  • reuse of feature pipelines by sharing features between teams/projects;
  • enables the serving of features at scale and with low latency for online applications;
  • ensures the consistency of features between feature, training, and serving pipelines — features are engineered once and reused everywhere;
  • ensures point-in-time correctness for features — when a prediction was made and an outcome arrives later, we need to be able to query the values of different features at a given point in time in the past.

References

This article was originally written in May 2020, and edited in February 2022. The orginal version is available on Logical Clocks’ Blog.

--

--

Jim Dowling
Feature Stores for ML

@jim_dowling CEO of Logical Clocks AB. Associate Prof at KTH Royal Institute of Technology Stockholm, and Senior Researcher at RISE SICS.