The Basics of a Feature Store
For the last year or so, feature store is a concept that has been gaining popularity in the world of artificial intelligence and machine learning. But what is a feature store? It may seem hard to define, as the trendy term has been thrown around here and there when referring to the optimization of the machine learning workflow.
It is exactly what it sounds like. It is a storage system for the data used to train and serve models. Similar to a refrigerator, bank, or granary but for features. They can turn raw data into features that will be read by models, keep track of the data for later, and serve the data for use in machine learning platforms.
The feature store is becoming an important aspect of taking machine learning projects to production. It allows data scientists to reuse features within their organization and increases the efficiency of their feature engineering efforts. Similar to a model registry, the feature store is centralized and scalable, enabling AI-driven enterprises to accelerate the deployment of their machine learning models.
Feature stores are most effective for machine learning projects where models are constantly evolving. And engineered features are taken from features stores and fed into machine learning platforms such as InfinStor, where data scientists can begin inferencing and training models. The versions of a feature used in an experiment can be tracked by an MLOps platform like MLflow. MLflow in conjunction with a feature store provides complete provenance of the experiments including feature data.
Features could be a sensitive or critical dataset and therefore strong authorization protection is necessary in an enterprise setting. InfinStor’s enterprise-grade MLflow service offers appropriate access control and authorization semantics for the protection of data, along with high availability and scalability.
Definition of a Feature
Features are variables that data scientists use as input for machine learning models. They are derived from raw data. These variables are interpreted by models and used to make predictions.
The quality of models and their predictions are directly influenced by the quality of the features being fed. These inputs are commonly in a number vector format, known as a feature vector.
In order for the models to understand the features, the raw data must be converted to feature vectors, through feature engineering. In most use cases involving feature engineering, the raw data is generally structured or semi-structured data, which many ML platforms can handle. In other cases, InfinStor can come in handy for raw unstructured data.
How is a feature store used?
- The first step is converting raw data into features.
- Features are used to train models and create data sets.
- Models are deployed into production and inferencing begins.
- Experiment results and outputs are reengineered into features.
- Rinse and repeat.
Raw data is currently converted into features in an outdated way. It poses challenges for the scalability of a machine learning workflow that involves numerous features.
This is where a feature store comes in as it optimizes this process. Features are engineered once and are accordingly committed to a central repository for future use. This is convenient for AI-driven enterprises as all data scientists within the organization can access this store to reuse existing features and engineer even more features, similar to a model registry. Features are created once and are readily available for data scientists to use for their machine learning projects.
Benefits of a feature store
With a feature store, data scientists can track and share features in a shared repository for everyone in the organization. Due to its centralized nature, there will be minimal data leakage among the feature values. Features are readily available for serving as soon as they are engineered.
A feature store can accelerate the time to market. It makes it simple to monitor all the data going in and out. Combined with tools like InfinSlice, it can also protect against data drift and other problems in the machine learning pipeline. A clean organized feature store is also useful for data compliance. By enabling more accurate models, higher data quality, and a unified team of data scientists and engineers, feature stores are essential for the modern AI-driven enterprise.
Examples in the industry
In the industry, pioneers who have incorporated the feature store in their machine learning tools include Databricks, Tecton, Feast, Amazon Sagemaker, and Hopsworks. Descriptions are taken from respective websites as well as Feature Store for ML.
The Databricks Feature Store UI, accessible from the Databricks workspace, lets you browse and search for existing features and displays information about feature lineage — including data sources used to compute features and the models, notebooks, and jobs that use a specific feature.
Feature Factory (Databricks): Spark/AI Summit 2019
Tecton is a fully-managed feature store built to orchestrate the complete lifecycle of features, from transformation to online serving, with enterprise-grade SLAs.
Feast is an open-source, self-managed feature store built for serving pre-computed features for training and online inference.
Feast (GoJek): HasGeek TV Talk 2019
SageMaker Feature Store provides a unified store for features during training and real-time inference without the need to write additional code or create manual processes to keep features consistent.
The Hopsworks feature Store is a data management system for managing machine learning features, including the feature engineering code and the feature data.
Hopsworks (Logical Clocks): Bay Area AI Talk 2019