Feature Store: Challenges and Considerations

Ritesh Agrawal
Engineering @Varo
Published in
6 min readMay 24, 2021

Authors: Ritesh Agrawal, Brandon Lee

At Varo, our mission is to help millions of Americans achieve financial well-being by building a bank for all of us. We are investing in building Varo’s AI/ML platform to build, train and deploy models. While machine learning (ML) involves mathematics and algorithms, it is an equally challenging engineering and operational problem. Putting ML models in production requires many components working in sync. A core piece of this puzzle is the feature store. A feature store is responsible for providing input data to the trained model during training and production. This post explains the need for a feature store and our design and implementation of it.

Why a Feature Store

Machine learning models require data to make inferences. Consider a model to detect fraudulent transactions. Some of the information this model might use are the following:

1. Transaction time
2. Transaction amount
3. Vendor age, i.e. how long the vendor existed
4. Customer age, i.e. how long the customer has been associated with
the bank
5. Median 28-day account balance
6. Distance between user home address and transaction address
7. Customer transaction history

When a credit card transaction occurs, the information payload directly associated with the transaction will contain some of the above information. In this case, these are transaction time, transaction amount, vendor name, and customer ID. But, there will be other data the model needs that are not part of the transaction data, possibly from multiple data sources like a Data Lake or a third-party vendor system. Rather than dealing with each data source separately, a model would like to deal with single data abstraction. This is the feature store. It is the responsibility of the feature store to provide the model with a normalized interface for the data it needs.

Another important role of the feature store is to provide consistent information as required during development and production. If data goes through a different process during training versus production, this will cause problems in the accuracy or validity of the model. Subtle differences can have a significant impact on the performance of the model. For instance, during research and development, the model used “account balance” as of yesterday. But in production, the model is provided with the current “account balance” of the customer. In some cases, such subtle differences can cause significant changes in the prediction outcome. A feature store provides consistent feature data during research, training, and production.

Given the above usages in mind, the next section discusses our feature store's design and implementation considerations.

Feature Store Design

The architecture of our Feature Store: We separately compute batch and real-time features and push them to our online feature store. Changes in the online feature store are logged and sent to our data lake, which is then used for model training.

The above is a high-level architecture diagram of our feature store. Our implementation is based on lambda data processing architecture. Lambda architecture leverages different systems for processing online and batch features. Further, we use Kafka to provide a publisher and subscriber system to achieve loosely coupled modules.

Our batch features are computed at regular intervals in our data lake using SQL engines or PySpark and further stored and partitioned by event date and feature name. This enables us to easily onboard any batch feature and backfill features on demand. This is essential for new features and to correct incorrect features. Once all of the features are computed, they enter the feature store via Kafka.

Some features are critical to be updated in real-time. For instance, assume a model uses whether a customer’s credit card is locked or not as one of the input features in our transaction fraud detection model. In this case, the model cannot rely on batch processing with up to a day latency to detect fraudulent transactions. For such features, we use a Kotlin based service that listens to different Kafka topics, computes the required features, and updates them in the feature store. These features are again announced to Kafka and eventually consumed by our online service.

The system above is designed to support online real-time models, and as such, it leverages a simple lookup schema and maintains only the current state of a feature. However, the historical state of these features is crucial for training and debugging purposes. So the feature store leverages Debezium on Kafka, a change data capture (CDC) system, to relay all changes back to our Data Lake. All the changes are processed and converted into a clean table that our data scientist can leverage for model training and debugging purposes.

Below is the information flow diagram for batch features. For online features, the only difference is that the features are computed in a real-time service.

Lessons Learned:

Facts versus Features: The term “feature” means different things to different teams. From the business team perspective, features are facts or attributes such as user demographics, device characteristics, etc. From the machine learning perspective, features are specific numerical representations computed by applying transformations on the above attributes. In our case, the feature store is concerned with storing attributes rather than transformations for multiple reasons. First, a single attribute, such as country code, can be converted into many different features by applying different transformations, such as one-hot encoding or target encoding. Storing different transformations of the same attribute is not space-efficient. Second, applying transformations at the feature store level breaks the encapsulation principle. Transformations are part of the model development process, and therefore it makes sense to serialize transformations along with the model parameters.

The general rule of thumb we use is: store the fact and not the model-specific feature. In other words, we want to store the basic fact or feature from which many features can be derived. Sometimes, this isn’t possible, and specific features must be stored. For example, in storing the average weekly and monthly transaction volume, the time horizon is baked into the feature because you would otherwise store all the underlying transactions.

Prefer Batch Over Real-time Features: Managing real-time features presents several challenges. Often you have to maintain intermediate data attributes. For instance, computing the mean transaction amount in the last week requires keeping track of total transaction amounts, the number of transactions and constantly adjusting the two numbers for transactions in a constantly shifting window of time. The second challenge with real-time features is backfilling real-time data from a streaming source. Many companies have strict restrictions in sharing data from different environments, introducing technical and process controls, resulting in extra steps and time lags. Lastly, it is more difficult to debug real-time features for obvious reasons. Some of these challenges can be addressed via delta architecture that leverages Spark streaming sessions. But, in general, it is more challenging to compute and maintain features in real-time.

A Feature Store isn’t a Monolithic Piece of Software. We grew our feature store organically, building and modifying it as we went along, and we are still in the process of building it. You probably have many pieces of a feature store running in your organization already. One way to think about a feature store is in terms of the abstractions and services it provides to models in production and development. It is a key component in leveraging the data you have in your business and making it available to use in your machine learning models. It provides solutions to the typical problems of effectively using your data in both batch and real-time environments. We’ve discovered that it is an essential tool and process in running and developing useful models.

--

--

Ritesh Agrawal
Engineering @Varo

Senior Machine Learning Engineer, Varo Money; Contributor and Maintainer of sklearn-pandas library