Feature stores are an important new concept in data science and machine-learning. They provide a way to organise much of the data preparation required when building a machine-learning model, and do it in a way that is repeatable and easy to deploy. They make data scientists more productive, and ease the path between research and production. We think they will be particularly useful when applied to time-series problems such as forecasting.
The bird’s-nest of code
Most data scientists have been there. A new project starts with neatly organised datasets and a set of scripts to train models and generate predictions from new data. As the project progresses, things get more complicated: extra datasets get bolted on, existing data modified and transformed, new ideas tested. Eventually we end up with a bird’s-nest of scripts and notebooks — and a massive task ahead of us if we want to turn this model into a production-ready system.
This is understandable: data science is an iterative process, and it is hard to know at the outset which datasets and features will be most useful. But this approach is problematic for three main reasons:
- They are hard to understand — pity the poor data scientist who has just joined the team and needs to understand how the model works;
- They are error-prone and difficult to debug; and
- It is difficult and time-consuming to turn the model into a real-world, production-ready system, because your engineering team will need to spend months understanding and implemented all of the different data transformations to feed data into the model.
Lack of data science tools
While a lot of attention has rightly gone into developing fantastic new algorithms and techniques for machine-learning, the ‘MLOps’ tools required to help data scientists and engineers build machine-learning systems have lagged behind.
Recently, however, feature stores have emerged as new concept to help prepare data for use in machine learning. Rather than simply storing raw data in a database, a feature store is a system that organises data into pre-engineered features, ready to feed straight into a machine learning model.
Raw data usually needs to be transformed before it is fed into a machine-learning model. The transformed data are known as features.
Some key advantages of this approach are:
- The code to compute features is organised and run on the feature store, meaning that this logic is kept separate from ML code;
- The same features can be served either in batches (for training) or in real-time (for production ML models), making it easier to deploy ML systems;
- Features can also be shared across different teams in an organisation, meaning that the effort spent on feature engineering can be re-used across multiple ML models.
We think feature stores will be particularly useful when working with time-series data — everything from financial market prices and energy demand to weather forecasts and retail sales information. This is because time series present some unique difficulties when trying to implement a robust machine-learning system. For example:
- The time-travel problem — how do you make sure that you only train your model on data that was available at the time a prediction would have been made? This crops up a lot when handling weather data, where historical forecasts can be difficult to handle in a reliable way.
- Complex feature engineering — for example, rolling windows of past data are often required for time-series forecasting models. Keeping this code separate from the machine learning model helps make it more maintainable and reusable.
We think that feature stores with simple, easy-to-understand APIs will allow data scientists to be much more productive when tackling time-series problems. For engineers, these models will be much easier to implement into production-ready systems.