Feature Stores for MLOps Dummies

Yash Chaurasia
Globant
Published in
7 min readJun 17, 2022

Getting started with feature stores from a data engineer’s perspective

Photo by Venti Views on Unsplash

Having a large amount of data doesn’t necessarily mean Machine Learning (ML) models can be trained on it. According to a survey by Forbes, data scientists spend 80% of their time on data preparation, which involves transforming or scaling input columns (a.k.a feature) to make the data usable for Machine Learning. This process of enriching the raw data into features that can be fed to ML models is called feature engineering.

The feature store is where these features are stored and organized for the explicit purpose of being used to either train models (by Data Scientists) or make predictions (by applications that have a trained model)

In this article, we will see :

  • What is a Feature Store?
  • What are the generations of ML platforms and the need for a Feature Store?
  • What are some of the best frameworks to deploy Feature Store?

What is a Feature Store?

For a basic understanding, it can be said that feature stores are like data warehouses for data science. Their primary goal is to enable data scientists to short-circuit the time it takes to go from data ingestion to ML model training. The below image depicts the layout of a feature store.

A Feature Store

A feature store is a central vault for storing documented, curated, and access-controlled features that can be used across different ML models across the organization.

A Feature Store is a data management system that manages and serves features to ML models, and acts as a data management layer for ML features. It ingests data from various sources and executes defined transformations, aggregation, validation, and other operations to create features. A feature store registers available features and makes them ready to be discovered and consumed by ML training pipelines.

How does a feature store differ from a traditional data warehouse?

In the table below, we see an overview of the key differences between feature stores and data warehouses.

Key differences between a Datawarehouse and Feature Store

Data warehouses can be used to store pre-computed features, but they do not provide much more functionality beyond that for ML pipelines. When Data Scientists need to create train/test data using Python or when online features (for serving features to online models) are needed at low latency, we need a feature store. Some of the existing solutions available for catering data at low latency are Amazon Redshift, Hbase, Redshift, Cosmos DB, etc.

What are the generations of ML platforms and the need for a Feature Store?

If we were inclined to attempt to solve an ML problem from scratch we would essentially need below four key pieces:

Four key pieces of ML

The resulting outputs are predictions, which inevitably is what the business is interested in. We can classify different generations of ML platforms, based upon which of the four elements listed above they focus on:

Generation 1 is code- and environment-based: The focus is on writing code, collaborating, and making it easy to execute that code. Notebooks were, and continue to be, one of the main tools that data scientists use on a day-to-day basis. Love them or hate them, they have entrenched themselves in the ML landscape in a way that no other editor technology has. Although Gen 1 ML platforms have their use in development cycles, the time has proven them to be poor systems for production work.

Generation 2 is model-based: The focus is on quickly creating and tracking experiments, as well as deploying, monitoring, and understanding models. On the surface, model-based solutions look great, however, cobbling together point solutions has its pitfalls like difficult integration of tools, difficult troubleshooting, a team of experts required, etc.

Generation 2: Model-based AI approach

Generation 3 is data-based: The focus is on the construction of features and labels — the truly unique aspect of most use cases — and automation of the rest of the ML workflow. The idea is that AI has advanced enough that we should be able to simply provide a set of training data to our platform, along with a small amount of metadata or configuration, and the platform will be able to create and deploy our use case into production in hours which will reduce the need of coding, pipelining and managing DevOps tools.

Generation 3: Data First AI approach

Register the features and relationships. Automate feature engineering. Collaborate with peers so we don’t have to recreate the wheel every time we need to transform data. Let the feature store figure out how to serve data for training and inference.

“Paradoxically, data is the most under-valued and de-glamorized aspect of AI”

The urgency and opportunities that lie in Data-First AI are also reinforced by Google, who concluded in a recent paper that data is the most under-valued and de-glamorized aspect of AI.

What are some of the best frameworks to deploy Feature Store?

There are many frameworks that will help us in automating the entire feature engineering process and producing a large pool of features in a short period for both classification and regression tasks. MLOps community has a great comparison of different available feature stores. We can find it here. The leaders of the market are Feast, Hopsworks, AWS SageMaker Feature Store, and Google AutoML.

Michelangelo

Uber was one of the first big companies to publish the concept of a feature store. This was a set of services that helped users 1) create and manage shared features and 2) allow for unified references to both online and offline versions of a feature to help eliminate the need to reproduce code between offline training and online serving.

Uber’s feature store — Data preparation pipelines push data into the Feature Store tables and training data repositories. (source: https://eng.uber.com/michelangelo/)

Feast

Feast is an open-source feature store jointly developed by Gojek and Google Cloud. Here’s a link to get started with the Feast.

How Feast works | Image by Feast

Hopsworks

Logical Clocks added a feature store as part of their Hopsworks framework. It mostly focuses on the offline training portion but probably has the most clearly / simply presented architecture.

Source: https://www.logicalclocks.com/feature-store/

Databricks

Databricks in June 2021 released a public preview of their feature store implementation that is supported on the Azure platform

Google Cloud AutoML

Cloud AutoML is a suite of ML products that enables developers with limited ML expertise to train high-quality models specific to their business needs. It relies on Google’s state-of-the-art transfer learning and neural architecture search technology. Cloud AutoML leverages more than 10 years of proprietary Google Research technology to help ML models achieve faster performance and more accurate predictions.

Dataset → AutoML (automatically search through Google’s whole model zoo) → Generate Predictions with a REST API call. Image Source, Cloud AutoML

Key offerings of Google AutoML are:

  • Vertex AI — Unified platform to help you build, deploy and scale more AI models.
  • AutoML Tabular — Automatically build and deploy state-of-the-art ML models on structured data.
  • AutoML Image — Derive insights from object detection and image classification, in the cloud or at the edge.
  • AutoML Video — Enable powerful content discovery and engaging video experiences.
  • AutoML Text — Reveal the structure and meaning of text through machine learning.
  • AutoML Translation — Dynamically detect and translate between languages.

FeatureTools

Featuretools is a framework to perform automated feature engineering. It excels at transforming temporal and relational datasets into feature matrices for machine learning. Featuretools integrate with the ML pipeline-building tools we already have.

Source: https://www.featuretools.com/

Some other interesting solutions for feature engineering include AutoFeat, ExploreKit, OneBM, and TsFresh.

Check out this amazing article to understand which feature store is most suitable for your use case.

Conclusion

In this blog, we have understood the importance of feature stores, their evolution over time, and why we need feature stores for empowering the next-gen ML platforms. We also went through the best-in-class frameworks which utilize feature stores for simplifying ML processes and reducing time to market by accelerating ML experiments.

References

Know more about Feature Stores:

Latest Advancements in MLOps:

--

--