Making feature stores simple

Toby Coleman
bytehub-ai
Published in
3 min readMar 29, 2021

Data scientists are familiar with the challenges involved in accessing, preparing, and using data for analytics projects and machine-learning models. We often hear that:

  1. Raw data requires complex preparation before it can be used;
  2. Data preparation code is often difficult to share between data scientists and engineering teams; and
  3. Data pipeline tools can help solve part of the problem, but are complex to setup and maintain, and often beyond the reach of smaller data science teams.
Feature stores provide the interface between raw data and models.

In this article we’ll explore how ByteHub addresses these challenges using our python-based feature store.

Challenge 1: Simplifying data preparation

Typical data science workflows often involve create Python notebooks which are used to:

  • Fetch and prepare raw data from wherever it is stored or from external APIs;
  • Carry out any feature-engineering steps required before it can be used; then
  • Train and validate a model.

Combining all these tasks in one notebook can result in code that is complex, hard to maintain and difficult to understand and share.

Feature stores are a technology designed to provide an interface between data and models. ByteHub’s feature store is designed to be easily accessible from a script or notebook and allows data scientists to store and organise raw data alongside preparation code. With this simple change we can dramatically simplify our model training code by eliminating the data prep and simply requesting pre-prepared training features from the feature store.

Time-series datasets present specific challenges when it comes to making sure features are consistently aligned and resampled to the required time-interval. We’ve integrated these functions into our feature store, making it well-suited to a variety of machine-learning problems in energy, finance, retail, and other sectors.

Challenge 2: Sharing data

Simplifying data preparation and engineering becomes even more useful when it can be re-used and shared, for example:

  • Allowing different data scientists to save time by re-using data and features between different projects; and
  • Simplifying deployment and engineering time by allowing production models to share exactly the same data preparation used during model training.

To achieve this, we allow descriptions and metadata to be attached to each feature. Data scientists can quickly search for features and pull them into different models.

The underlying data can be stored on any cloud-storage provider, meaning that is easy to access from both data scientist laptops/VMs and other cloud services/model deployments.

Challenge 3: Keeping it simple

A range of data engineering tools exist to help build data pipelines, so why the need for another? We found that many tools are often complex to set-up and use, making them difficult for data scientists to adopt as part of their workflow. Data science projects often start as small proof-of-concepts, without the support of a large engineering team to provision infrastructure and maintain a complex data engineering tool.

ByteHub’s feature store can be installed locally, allowing you to get a feature store up-and-running in seconds. When you outgrow this, it is simple to install on a cloud database, or access from a managed service.

ByteHub can be installed easily in any Python environment, and is up-and-running within seconds.

Once installed, ByteHub works with Pandas dataframes, allowing data to be saved, transformed and used in a familiar format.

Like what you’ve read?

--

--

Toby Coleman
bytehub-ai

Data Scientist and ML Engineer. Interested in time-series modelling and forecasting problems. Current project: https://github.com/bytehub-ai/bytehub/