Some of you may not be that familiar with the concept of an ‘enterprise feature store’ in the world of data science, but I am pretty sure you have either interacted with or may have designed a feature store to solve some of the requirements for your project.
Let’s take an example. One of the most common use cases in machine learning world is building recommender systems in which users’ engagement data coming from various user facing frontend applications are sent to an ‘online’ database via streaming engines such as Kafka or Kinesis and processed in real time by stream processors such as Spark or Flink. Historical data and non engagement data such as user profile(history), purchase, transaction and other domain specific data is available in the data warehouse or data lake. This ‘offline’ storage is populated with such information from enterprise data sources such as Oracle in the case of transactional data or Salesforce in case of leads or other marketing data and are processed in regular batches and archived.
In the diagram above, it’s not always necessary that online database be a no SQL only but I tend to incline for a NoSQL database like Redis, Cassandra or MongoDB as they provide, key value based retrieval, different data storage options (as compared to RDBMS), high volume throughput and have very low response latency.
An online store (yellow in the diagram) is a low latency storage that stores fresh data and provides data to the serving layer where as an offline store (green in the diagram) is a typical warehouse having huge volume of data for batch training, operational analytics, reporting, archiving etc. They are updated or synced using various data pipelines across the enterprise.
So, What’s new?
Here, imagine an enterprise Feature Store system that does not only provide storage functionality but form an essential component of your Machine Learning ecosystem. Some of the important reasons why you need a feature store are:
- ML Data Store: Production Models in serving layer and CI/CD/CT pipeline needs consistent access to data without worrying about changes in source data infrastructure. Basically, this decouples ML systems from traditional infrastructure
- Re-Implementation: Typically in an enterprise, data scientists work on feature engineering and model building by going through their train-test cycle offline. When this model is ready for production, engineers have to reimplement the feature engineering required for productionizing
- Point-in-time Data : Models need point-in-time data for better accuracy, and to avoid feature leak as part of training process.
- Reusability: Feature engineering is one of the most important phase of the modelling processes. Features created as part of data exploration, importance with respect to target variable or based on domain specific expertise are siloed as part of that project leading to duplication of efforts in engineering same features in a different project.
- Productivity: Unavailability of referencing of features, friction between engineers and scientist on feature lifecycle implementation and increased time-to-market due to hardened serving layer requirements decreases productivity of ML team
Feature Stores are a component of ML Data ecosystem that runs data pipelines, stores the features and consistently serves data for training and scoring purposes.
Components of a Feature Store:
- Feature Registry
Feature Registry is a critical component of a feature store. It provides search & discovery of features available in the feature store and maintains a centralised repository of feature definitions and metadata. Data science team uses it as a central interface to interact as a common catalog to share, define and publish new features
- Operational Monitoring
Feature store describes data correctness and data quality. Model drift and concept drift can be analysed by comparing the data on which the model was trained with the latest feature values. They can also provide interfaces to external systems that monitors the model performance in a serving environment
All the ML applications need a data transformation pipeline that processes raw data variables into features. Feature stores manage as well as orchestrate these pipelines.
There are 3 types of transformations that are mainly applicable in ML world:
1. Batch — data at rest, archived data typically in a data warehouse such as user transaction history
2. Stream — data in motion, typically in a PubSub engine such as no of clicks in current session
3. On-Demand — data available at that time, cannot be pre-computed and available from frontend application such as user IP Address
Offline and online storage are provided by feature stores. Offline feature data is usually stored in warehouse such as Redshift, Snowflake, S3, BigQuery or HDFS (Hive/ Impala). On the other hand, features that are used for inference are stored in a low latency storage systems such as Redis, Cassandra, MongoDB, DynamoDB, Elasticsearch, Solr etc and store latest values for entities of feature store
One of the key elements of a feature store is abstraction of feature generation logic and its processing. When features are accessed for training purposes, they are commonly accessed through Jupyter notebooks for instance and feature store provides point-in-time view of the features. For online serving purpose, it provides a feature vector of featue values with the latest data which are served through a low latency database.
There are 2 very popular feature store platforms available in the market are:
- Feast that stands for Feature Store and is an open source system for machine learning needs. It provides registry, storage and serving components, however, at the time of writing this, it does not provide transformation pipelines yet but provides pluggable interfaces for existing transformation pipelines to integrate with Feast
- Tecton a fully managed software as a service platform built for enterprise and supports transformation layer and feature pipelines can be managed and orchestrated from within the platform.
However, there are several implementation of feature stores designed by various companies. A comparison of some of the popular feature stores is listed below:
More details about feature stores can be explored at https://www.featurestore.org/