What are Feature Stores and Why Are They Critical for Scaling Data Science?
A data science feature that everyone needs to be aware of.
What is a feature store?
If data is the new gold (overused, but true nonetheless) I would say that features are actually the Gold bullion and therefore need to be treated accordingly. In order to get to the gold, you need to do some digging and hard work, which is also true for finding the right features.
The process of creating features is called feature engineering which is a pretty complicated yet critical component for any machine learning process. Better features mean better models resulting in a better business outcome.
Generating a new feature takes a tremendous amount of work — and creating the pipeline for building the feature is just one aspect. In order to arrive at that stage you probably had a long process of trial and error, with a large variety of features, until you got to a point where you were happy with your singular new feature. Next, you needed to calculate and store it as part of an operational pipeline, which then differs, depending if the feature is either online or offline.
On top of that, every data science project starts with searching for the right features. The problem is that, for the most part, there isn’t a singular, centralized place to search; features are hosted everywhere. So first and foremost, a feature store provides a single pane of glass for sharing all available features. When a data scientist starts a new project, he or she can go to this catalog and easily find the features they are looking for. But a feature store is not only a data layer, it is also a data transformation service enabling users to manipulate raw data and store it as features ready to be used by any machine learning model.
Offline and online features
There are two types of features: online and offline
Offline features — Some features are calculated as part of a batch job. For example, average monthly spend. They are mainly used by offline processes. Given their nature, creating these types of features can take time. Usually, offline features are calculated via frameworks such as Spark or by simply running SQL queries against a given database and then using a batch inference process.
Online features — These features are a bit more complicated, as they need to be calculated very fast and are often served in millisecond latency. For example, calculating a z-score for real-time fraud detection. In this case, the pipeline is built by calculating the mean and the standard deviation over a sliding window in real time. These calculations are much more challenging, requiring fast computation as well as fast access to the data. The data can be stored in memory or in a very fast key-value database. The process itself can be performed on various services in the cloud or on a platform such as the Iguazio Data Science Platform that has all of these components as a part of its core offering.
Here is an example of an online and offline pipeline using a feature store. This was designed by Uber as part of their Michelangelo platform:
● Faster development
Ideally, data scientists should focus on what they studied to do and what they are best at — building models. However, they often find themselves having to spend most of their time on data engineering configurations. Some features are expensive to compute and require building aggregation, while others are pretty straightforward. But this really isn’t something that should concern data scientists or stop them from harnessing the best features for their model. Hence, the concept of a feature store is to abstract all those engineering layers and provide easy access for reading and writing features.
As mentioned previously, online and offline features have different characteristics. Under the hood, offline features are built mostly on frameworks such as spark or SQL, where the actual features are stored in a database or as parquet files. Whereas online features may require data access using APIs for streaming engines such as Kafka, Kinesis or in-memory key-value databases such as Redis or Cassandra.
Working with a feature store abstracts this layer, so that when a data scientist is looking for a feature, instead of writing an engineering code he can use a simple API for retrieving the data that he needs. It could be simple as running the following:
df = feature_store.get(“transaction_volume”).filter_by (transaction_id)
● Smooth model deployment in production
One of the main challenges in implementing machine learning in production arises from the fact that the features that are being used for training a model in the development environment are not the same as the features in the production serving layer. Therefore, enabling a consistent feature set between the training and serving layer enables a smoother deployment process, ensuring that the trained model indeed reflects the way things would work in production.
● Increased model accuracy
In addition to the actual features, the feature store keeps additional metadata for each feature. For example, a metric that shows the feature impact on the model it is associated with. This information can help data scientists tremendously when selecting features for a new model, allowing them to focus on those that have achieved better impact on similar existing models.
● Better collaboration
As the old saying goes — sharing is caring! The reality today is that almost every new business service is based on machine learning, so the number of projects and features is growing exponentially. This reduces our ability to have a good comprehensive overview of the features available, since there are just so many. Instead of developing in silos, the feature store allows us to share our features along with their meta data with our peers. It is becoming a common problem in large organizations that different teams end up developing similar solutions, simply because they are not aware of each other’s tasks. Feature stores bridge that gap and enable everyone to share their work and avoid duplication.
● Track lineage and address regulatory compliance
In order to meet guidelines and regulations, especially in cases where the AI models being generated serve industries such as Healthcare, Financial Services and Security, it is important to track the lineage of algorithms being developed. Achieving this requires visibility into the overall end to end data flow to better understand how the model is generating its results. As features are being generated as part of the process there is a need to track the flow of the feature generation process. In a feature store, we can keep the data lineage of a feature. This provides the necessary tracking information that captures how the feature was generated and provides the insight and the reports needed for regulatory compliance.
Feature store and MLOps
MLOps is an extension of DevOps where the idea is to apply the DevOps principles on machine learning pipelines. Developing a machine learning pipeline is different from developing software, mainly because of the data aspect. The quality of the model is not only based on the quality of the code. It is also based on the quality of the data — i.e. the features — that are used for running the model. According to Airbnb, around 60%-80% of data scientists’ time goes into creating, training and testing data. Feature stores enable data scientists to reuse features instead of rebuilding these features again and again for different models, thus saving valuable time and effort. Feature stores automate this process and can be triggered upon code changes that are pushed to Git or by the arrival of new data. This automated feature engineering is an important part of the MLOps concept.
Some of the largest tech companies that deal extensively with AI have built their own feature stores (Uber, Twitter, Google, Netflix ,Facebook, Airbnb, etc.). This is a good indication to the rest of the industry of how important it is to use a feature store as a part of an efficient ML pipeline.
Given the growing number of AI projects and the complexities associated with bringing these projects to production, the industry needs a way to standardize and automate the core of feature engineering. Therefore, it is fair to assume that the feature store is positioned to be a first level citizen of any machine learning pipeline.