Building Real-Time ML Pipelines with a Feature Store

Transition from batch to real time with an integrated feature store

Adi Hirschtein
Jan 13 · 9 min read
Image for post
Image for post

The Next Stage of Feature Stores

The buzz around feature stores has increased in machine learning circles in the last few months, and the topic indeed deserves attention. The most painful challenge in the ML lifecycle is dealing with data — or in other words, feature engineering.

  • Customer-facing impact
  • Adherence to business SLAs
  • Scalability: supporting lots of requests and processing
  • Versioning
  • Access control
  • An easy API for creating and accessing features
  • Easy-to-use interface

Key Challenges

The data scientist’s skill set is around understanding data and creating complex algorithms to solve business problems. They don’t need to be data engineers. They need to hunt for and create features as part of their job. The features they create are for training models in a development environment ONLY. Once the model is ready to be deployed in production, the data scientists’ code for training is not ready to run in a real-time production environment. Data engineers step in to re-write the feature to make it production-ready. This is a key part of the MLOps process (machine learning operationalization).

Solution

Use a Feature Store as a Data Transformation Service for Both Training and Serving

To tackle the challenges above, we need a very powerful and fast data transformation service. A feature store is not simply a convenient catalog of features with a nice management interface; it’s fundamentally a transformation service designed to solve a complex problem of feature engineering. Ideally, it should be able to handle real-time feature engineering.

Image for post
Image for post

Real-time Feature Store Must-haves

There are a few showstoppers for teams who want to do real-time feature engineering.

  • Feature calculation: how long does it take to calculate a feature?
  1. High speed serverless function for reading and processing events such as Nuclio, Apache Flink
  2. Fast queuing framework serving as a bus between several computation processes running via serverless functions (e.g. Kafka or v3io stream)
  3. Fast key value database such as Redis, Iguazio KV, Cassandra, etc.
  4. Event processing engine: a framework for calculating events in real-time with strict performance requirements and built-in HA. This processing engine enables feature extraction in real-time and at scale — e.g. Iguazio Storey (async streaming library), Apache beam.
  5. Transformation service supporting:
  • Sliding window: Calculate data over a given time period every x minutes (e.g. average click numbers in the last hour)
  • Grouping: Group events by a certain field
  • Joins: Enrich data with other sources with an option to do “as of join” to handle time mismatch
  • Custom function: Add custom logic to the feature calculation (e.g. multiply a given field by x)
  • Filter: Filter events based on a given rule
  • Libraries (optional): Leverage a set of predefined libraries for common tasks such as null removal, geocoding, time conversion, etc.
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Deliverables for Feature stores

A typical feature store uses two types of data sources for storing features:

  • A Historical database or data store for fetching features for offline training as a batch operation. This is usually done by leveraging a database or a columnar format such as Parquet files.
  • Streaming: A real-time pipeline may consist of a feature calculation that is actually served as an input for another step in the pipeline. In this case, storing the feature in a stream is more efficient than storing it in a database.

Monitoring and Re-training Features

The Importance of Integrating a Feature Store with Monitoring and Training Processes

Model deployments are not finished when you deploy your model for the first time. The models in production need to be monitored on an ongoing basis as their prediction may turn out to be less accurate over time. This is called model drift and one of the reasons for model drift is data drift.

Image for post
Image for post

Conclusion

Real-time feature engineering is crucial for any modern feature store. To address the complex challenges of creating and managing real-time features, there are certain capabilities and frameworks that must be part of such a solution.

Feature Stores for ML

AI, Data, and everything in between

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store