Using Feature Stores to Accelerate Model Development

Photo by Malgorzata Bujalska on Unsplash

A wise Data Scientist once said a machine learning model is only ever as good as the data you feed into it. Hence at Gousto we’re constantly seeking ways to improve how we work with data and extract more value from it.

With this in mind we decided to build our first Feature Store on Databricks earlier this year. The results speak for themselves. Massively reduced time to deployment. Highly reusable features suited to a variety of business problems. Less waiting around for queries to run and more time to invest in model development and research. The list goes on.

In this guide I want to share with you our motivation for building a Feature Store, some points to consider before creating your own and the steps for building a Feature Store on Databricks.

What’s wrong with using SQL queries?!?

The status quo for model development at Gousto involved running large SQL queries against our databases to gather all the features and labels needed to train our models. This is often accompanied by some further processing in Python before eventually handing over our clean data to a machine learning model. Whilst this may sound trivial most Data Scientists will tell you that this process occupies the majority of their time (disappointingly the exotic machine learning part can often be reduced to a few lines of code).

Enter Feature Stores.

Feature Stores are curated tables of model features which can be passed onto a model with little-to-no preprocessing. They are highly suited to tabular data (our primary data source at Gousto) and can be updated periodically (e.g. when our menu refreshes). Here are the main benefits we found of using Feature Stores over existing ETL methods:

  1. One store, multiple problems

Features should be highly reusable across different problems and domains. To facilitate this it helps to have distinct processes for constructing features and generating labels. This way features need only be created once and can be centrally owned.

The old way of working could lead to situations where different teams were generating the same features in separately managed queries. If the logic in one query changed this would not be automatically reflected in the others, which could lead to unintended errors.

Reusable features also reap benefits for the wider Data Science team. Features can be shared across tribes and applied to new problems by changing the labels alone. At Gousto we have a variety of machine learning use cases such as recommending recipes, forecasting orders or predicting churn to name a few. These problems can be distinguished by their labels (numeric targets for forecasting and binary labels for classification) yet they can all draw from the same set of features.

High level architecture demonstrating separation of labels and features

2. Up to 4x speed-up from raw data to deployment

Incorporating the Feature Store into our model development process reduced development time for some of our models from 6–8 weeks to ~2 weeks. This should come as no surprise since data wrangling is typically the most time-consuming part of a Data Science workflow and Feature Stores essentially abstract away many of these steps. This meant more time for us to focus on the fun aspects of Data Science such as model exploration, reading recent research papers and experimentation. Deployment is also made easier because the feature extraction code is largely unaltered between training and inference.

3. Less downtime

An additional benefit of separating feature construction from label generation is that the Feature Store can run independently of model training. At Gousto we have written scheduled jobs (also hosted on Databricks) which periodically update our feature tables overnight. This means that during the day, when we want to do model exploration, we have data ready to use and are not adversely affected by heavy usage of our databases by other Goustonians.

Tips for getting started

By now you should be convinced that Feature Stores are the future for Data Science teams (did I mention 4x speed improvements?).

But where to start? Here are some platform-agnostic tips for planning your feature store.

  1. Document the features you want to include in your Store

Before starting make a list of all the features you want to include in your Feature Store. Think beyond just a single use case as we want features to be highly reusable across different problem spaces. Collate any historical queries which may have been used in the past and identify all the required data sources.

2. Decide on a level of aggregation for your feature tables

One of the reasons Feature Stores are so convenient to use is that tables have consistent levels of aggregation. At Gousto we like to think of things in terms of users and menu weeks (each week we have a new delicious set of recipes for customers to choose). We therefore chose to aggregate data at a per-user_id, menu_week level. This gives us the flexibility to keep the time dimension in future models (think LSTMs, Transformers) or construct our own moving averages to collapse data to a per-user_id level.

3. Sense-check you historical queries

I can’t overstate the importance of this step. A Feature Store with inaccurate data can do more harm than good particularly if we intend to reuse features across teams. Business logic can change over time and SQL queries may need to be amended accordingly. Therefore I would recommend taking time to revisit the logic in your historical queries. Are values within their expected range? Are values strictly positive when they should be? Are there any unintended duplicates?

This is by far the most time-consuming part of creating Feature Stores but doing these checks will ultimately increase confidence in your data. It’s difficult to achieve 100% data accuracy but any errors mitigated at this stage will make downstream debugging much easier.

4. Rewrite your SQL queries into PySpark code

This part is optional and depends on your own team’s requirements. In our case many of the historical queries we wanted to use involved multiple CTEs, countless joins and spanned hundreds of lines of code. Queries in this form are difficult to read and are sub-optimal in terms of efficiency of execution.

Using PySpark facilitates a more programmatic approach to querying databases, allowing greater modularisation and resulting in cleaner code. The good news is that if you are already a master of Pandas the jump to PySpark is very slight. Once you’ve familiarised yourself with the main commands, writing queries in PySpark is just as easy as SQL.

Building a Feature Store with Databricks

Now for the fun part! First load your features, at your chosen level of aggregation, into a PySpark DataFrame (in the example below we call our DataFrame features). Creating a Feature Store can then be done with 2 function calls. First we instantiate the Feature Store API:

Next, we create a feature table using features by calling fs.create_feature_table with the following arguments:

[Hint: Choose a common query predicate as your partition for quicker execution when you retrieve your features]

Future updates to the table can be executed using fs.write_table(). Selecting mode='merge' performs an upsert. This means that new UID entries are appended and existing ones are updated if the features have changed:

After this step you will be able to view a whole host of information and meta-data about your tables on Databricks’ dedicated Feature Store UI. Tables are written in Delta format by default which means you can also view a transaction log of any changes to tables over time and even revert to an older state if need be.

Using the Feature Store

We also make use of Databricks’ Feature Store API to retrieve features for model training. Firstly we load the labels we are trying to predict. Recall because of the separation between labels and features we can load the labels from a separate table rather than the Feature Store. As a bare minimum the labels table should also contain the lookup keys defined in your feature tables, in our case user_id, menu_week. This will serve as the joining key.

We are now ready to retrieve features from the Feature Store. We do so by defining a list of FeatureLookup objects. Each object represents a single feature table and defines the lookup key to join on as well as the subset of features to use (note: the features here are purely for illustrative purposes).

Once we define our lookups we call fs.create_training_set. To load this into a PySpark DataFrame we call .load_df() on this object.

At this point data in all_features_dfwill be at a user_id, menu_week level. If we want to retain this time dimension we can call .toPandas() to convert this to a Pandas DataFrame and proceed with modelling. If however you want data to be at a per-user_id level you can treat this as a normal PySpark DataFrame and construct moving window aggregates before calling .toPandas().

Hopefully this guide has served as a useful introduction to Feature Stores and their benefits for Data Science teams, irrespective of which platform you use. At Gousto the use of Feature Stores has vastly shortened our model development cycle, allowing us to rapidly test our products ‘in the wild’ and iterate on solutions. It is therefore an area we will continue to invest in going forward.

References

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store