Feature Stores — Data Engineer’s take on Feature Engineering

Guha Ayan
Slalom Australia
Published in
6 min readMay 25, 2021

Machine learning and AI are not part of some obscure academic research anymore. Wide proliferation of tools and technologies coupled with some real investments by big technology players made “..subtle science and exact art..” of machine learning commonplace.

Or, has it?

There is an old saying : Garbage In, Garbage Out. In this fascinating new era of novel data applications, it has slightly been altered: Garbage (Feature) In, Garbage (Model) out. In other words, a ML model is as good as how reliable the features are. “Good” is not same as accuracy. It is a combination of reliability (has the model perform adequately to be used in production?), scalability (can we use the model for our use cases?) and manageability (Can I monitor and manage the model in a production setting?). As we see shortly, a feature store can help in all those aspects.

But, first things first, what is a feature and why do we need a feature store?

If we visualise ML models as math equations , features are simply independent variables. In simple words, features are things we “know” about something, and based on them we try to know “something unknown” about it.

Let’s say, we want to estimate a reading for a smart meter. For such problem, following features can be good choices: Actual Read percentage Last Month, Total Consumption Last Qtr in KwH, Consumption Segment, Changed Retailer Last Year.

One thing to notice here: all features are related to a single meter id. This is important to understand.

There can be 100s or 1000s of such features going into the equation above. Based on those features generated from past data, and using ML algorithms, and with a little bit of luck, data scientists “train” a model for the given data set and given problem.

But that is only half the problem. The whole point of training a model is ability to predict. This is called “inference” phase. This is the phase when model is deployed in the wild and doing its work, often in real time.

In many real life scenario, training and inference systems are built differently, with different set of constraints and almost always with different data processing capabilities. But remember this: The model needs EXACTLY same features to be computed while producing inferences. So how to bridge this gap?

Also, in a large enterprise, it is not unusual to run multiple, even competing models. It is very common a global organisation may have data scientist sprinkled across business units across the globe and each of them trying to compute exactly same feature. Often taking data from single data lake. It is massive waste of both human and hardware resources. Is there any way to stop such waste?

Enter feature stores!! It must solve 2 problems discussed above, and some more in the bargain.

So what we need from a feature store?

  1. Scalable and Flexible storage of features at enterprise scale
  2. Consistent access patterns during training and inference phases
  3. A register of features so data scientists can discover and reuse features
  4. Point in time correctness

Core Concepts

Offline and Online stores

As ML problems can vary widely so as the nature of features. There should not be any assumption imposed on how many or how few features can be assigned to a business identifier.

To resolve training and inference processing dichotomy, it is preferred to have synchronised stores for offline and online usage. As training process uses historical data with heavy data processing workload profile it is efficient to use offline store. All data required for fast inference can be served from online store with most up to date data in real time or near real time.

Also, many of the fast aggregates (eg: Clicks within last 5 mins) lose their value once they are older, so they can be transferred from online store to offline store.

Data Structure

Feature Data Model

Typically, both online and offline store follow a simple data model. In case of online store, we can think of the PKs forming the lookup key. In case offline, timestamps typically are stored in order so it can be easy to query either latest or up to a past timestamp. A more concrete example below:

Typically, each business entity is stored in different containers called Feature Groups. For example, Meters and Customers may be stored in separate feature groups.

Note: It is super important to keep the timestamps at very granular value to accomodate any future need. Pay attention to timezone, or store data in single time zone like UTC. Use business timestamps, not processing timestamps.

Reproducibility

A feature store should support reproducibility, ie, the model training and inference should be reproducible if we assume to go back in past. A simple byproduct of this is not to store every feature for every time window if the value have not changed, rather bring forward everything up to the time window of the query.

Reproducibility

As an example in the picture, let us assume 4 features are stored across 4 time windows (T1 → T4). If you notice you can see Feature 3 is not stored for T2, because it has same value in T1 and T2. In this case, if a query is issued to provide features till T2 window, Features 1,2 and 4 will provide values computed at T2, while Feature 3 will produce value computed at T1.

Consistent Access Patterns

One key requirement for a feature store is consistent access pattern. In the implementation level a feature store must expose either API or libraries (or both) for consistent access during training and inference pipelines.

An optional access pattern of the feature store API is to represent features in a vector format. This is a fancy name for converting required features from row format to column format. A sample of such format can be represented below:

This conversion is very handy to get the transposed vector structure and directly plug into ML “fit” method.

Feature Register

A feature store must have a register of features with a usable search capability. Bigger your feature store and larger number of users, more important this discovery capability becomes.

Audit, Profile, Integration

Also, it is important o have an audit and profile of features themselves. This can be as simple as an audit of most popular feature, or as complex as a historical trend analysis of features to isolate any systemic problem. In addition, a seamless integration between feature register and ML model registers is super useful (I have not seen this in field yet, please let me know if you know about it?)

Everything put together, an high level view will be something like below

Implementation

There are far too many implementations of feature stores, a comprehensive list can be found here. Couple of them are mention-worthy as they are backed by major cloud providers: Feast And Sagemaker

Limitations

Feature store is super useful for large enterprise with large number of business entity driven features. However in following cases this store may not work very well

  1. Image classification or similar where each entity may have millions or more features (like RGB value of every pixel in an image)
  2. Text and NLP scenarios where documents can have large number of features (like bigrams, word frequencies)

Also it is good to note that feature stores typically store base features. Model specific feature engineering transformations eg producing higher order features (like pow(x,2) or x1 * x2 or similar interactions) , one-hot-encoding et al will still remain at the model level

In modern cloud based data architecture feature store has a distinct place. Regardless of the maturity of your ML journey, this simple but powerful addition can foster agility, consistency and real business value.

As usual, please drop a comment and/or a note in case you want to discuss further. Thanks for reading.

--

--