Feature Stores for MLOps Dummies
Getting started with feature stores from a data engineer’s perspective
Having a large amount of data doesn’t necessarily mean Machine Learning (ML) models can be trained on it. According to a survey by Forbes, data scientists spend 80% of their time on data preparation, which involves transforming or scaling input columns (a.k.a feature) to make the data usable for Machine Learning. This process of enriching the raw data into features that can be fed to ML models is called feature engineering.
The feature store is where these features are stored and organized for the explicit purpose of being used to either train models (by Data Scientists) or make predictions (by applications that have a trained model)
In this article, we will see :
- What is a Feature Store?
- What are the generations of ML platforms and the need for a Feature Store?
- What are some of the best frameworks to deploy Feature Store?
What is a Feature Store?
For a basic understanding, it can be said that feature stores are like data warehouses for data science. Their primary goal is to enable data scientists to short-circuit the time it takes to go from data ingestion to ML model training. The below image depicts the layout of a feature store.
A feature store is a central vault for storing documented, curated, and access-controlled features that can be used across different ML models across the organization.
A Feature Store is a data management system that manages and serves features to ML models, and acts as a data management layer for ML features. It ingests data from various sources and executes defined transformations, aggregation, validation, and other operations to create features. A feature store registers available features and makes them ready to be discovered and consumed by ML training pipelines.
How does a feature store differ from a traditional data warehouse?
In the table below, we see an overview of the key differences between feature stores and data warehouses.
Data warehouses can be used to store pre-computed features, but they do not provide much more functionality beyond that for ML pipelines. When Data Scientists need to create train/test data using Python or when online features (for serving features to online models) are needed at low latency, we need a feature store. Some of the existing solutions available for catering data at low latency are Amazon Redshift, Hbase, Redshift, Cosmos DB, etc.
What are the generations of ML platforms and the need for a Feature Store?
If we were inclined to attempt to solve an ML problem from scratch we would essentially need below four key pieces:
The resulting outputs are predictions, which inevitably is what the business is interested in. We can classify different generations of ML platforms, based upon which of the four elements listed above they focus on:
Generation 1 is code- and environment-based: The focus is on writing code, collaborating, and making it easy to execute that code. Notebooks were, and continue to be, one of the main tools that data scientists use on a day-to-day basis. Love them or hate them, they have entrenched themselves in the ML landscape in a way that no other editor technology has. Although Gen 1 ML platforms have their use in development cycles, the time has proven them to be poor systems for production work.
Generation 2 is model-based: The focus is on quickly creating and tracking experiments, as well as deploying, monitoring, and understanding models. On the surface, model-based solutions look great, however, cobbling together point solutions has its pitfalls like difficult integration of tools, difficult troubleshooting, a team of experts required, etc.
Generation 3 is data-based: The focus is on the construction of features and labels — the truly unique aspect of most use cases — and automation of the rest of the ML workflow. The idea is that AI has advanced enough that we should be able to simply provide a set of training data to our platform, along with a small amount of metadata or configuration, and the platform will be able to create and deploy our use case into production in hours which will reduce the need of coding, pipelining and managing DevOps tools.
Register the features and relationships. Automate feature engineering. Collaborate with peers so we don’t have to recreate the wheel every time we need to transform data. Let the feature store figure out how to serve data for training and inference.
“Paradoxically, data is the most under-valued and de-glamorized aspect of AI”
The urgency and opportunities that lie in Data-First AI are also reinforced by Google, who concluded in a recent paper that data is the most under-valued and de-glamorized aspect of AI.
What are some of the best frameworks to deploy Feature Store?
There are many frameworks that will help us in automating the entire feature engineering process and producing a large pool of features in a short period for both classification and regression tasks. MLOps community has a great comparison of different available feature stores. We can find it here. The leaders of the market are Feast, Hopsworks, AWS SageMaker Feature Store, and Google AutoML.
Michelangelo
Uber was one of the first big companies to publish the concept of a feature store. This was a set of services that helped users 1) create and manage shared features and 2) allow for unified references to both online and offline versions of a feature to help eliminate the need to reproduce code between offline training and online serving.
Feast
Feast is an open-source feature store jointly developed by Gojek and Google Cloud. Here’s a link to get started with the Feast.
Hopsworks
Logical Clocks added a feature store as part of their Hopsworks framework. It mostly focuses on the offline training portion but probably has the most clearly / simply presented architecture.
Databricks
Databricks in June 2021 released a public preview of their feature store implementation that is supported on the Azure platform
Google Cloud AutoML
Cloud AutoML is a suite of ML products that enables developers with limited ML expertise to train high-quality models specific to their business needs. It relies on Google’s state-of-the-art transfer learning and neural architecture search technology. Cloud AutoML leverages more than 10 years of proprietary Google Research technology to help ML models achieve faster performance and more accurate predictions.
Key offerings of Google AutoML are:
- Vertex AI — Unified platform to help you build, deploy and scale more AI models.
- AutoML Tabular — Automatically build and deploy state-of-the-art ML models on structured data.
- AutoML Image — Derive insights from object detection and image classification, in the cloud or at the edge.
- AutoML Video — Enable powerful content discovery and engaging video experiences.
- AutoML Text — Reveal the structure and meaning of text through machine learning.
- AutoML Translation — Dynamically detect and translate between languages.
FeatureTools
Featuretools is a framework to perform automated feature engineering. It excels at transforming temporal and relational datasets into feature matrices for machine learning. Featuretools integrate with the ML pipeline-building tools we already have.
Some other interesting solutions for feature engineering include AutoFeat, ExploreKit, OneBM, and TsFresh.
Check out this amazing article to understand which feature store is most suitable for your use case.
Conclusion
In this blog, we have understood the importance of feature stores, their evolution over time, and why we need feature stores for empowering the next-gen ML platforms. We also went through the best-in-class frameworks which utilize feature stores for simplifying ML processes and reducing time to market by accelerating ML experiments.
References
Know more about Feature Stores:
- Are Feature Stores The Next Big Thing In Machine Learning?
- Feature Store vs Data Warehouse
- Feature Store 101
- The Best Feature Engineering Tools
Latest Advancements in MLOps:
- Databricks Announces the First Feature Store Co-designed with a Data and MLOps Platform
- Alteryx announces new AutoML product and Intelligence Suite
- Splice Machine Launches the Splice Machine Feature Store to Simplify Feature Engineering and Democratize Machine Learning
- Molecula raises $17.6 million for its AI feature store technology