Lift your MLOps Pipeline to the Cloud with Feast and Astra DB

Author: Stefano Lottini

DataStax
Building Real-World, Real-Time AI
7 min readJul 26, 2022

--

When scaling machine learning in production, using a feature store can significantly reduce repeat work and standardize feature definitions. This post looks at how to configure DataStax Astra DB as an online feature store with Feast.

Upgrade your Feast Online feature store with Cassandra

It’s common for high-growth apps and services to use machine learning (ML) to deliver their core services. Although many artificial intelligence and machine learning (AI/ML) teams struggle to find an efficient way to scale ML practices to production in a robust way.

As a set of best practices emerged (now collectively termed “MLOps”), the concept of a feature store has appeared as a central piece of software essential to modern data architecture.

Of the several feature stores available today, Feast is gaining popularity due to its flexibility, ease of use, and the fact that it’s open source (under the Apache 2.0 license)—unlike most other equivalent solutions.

The new feast-cassandra plugin lets developers configure DataStax Astra DB, as well as any Apache Cassandra® cluster, as the online data store for Feast, an open source feature store for ML.

Feast’s modularity enables third-party database technologies to integrate with its core engine. Switching Feast to use Astra DB or Cassandra as an online store is simply a matter of editing a couple of lines in the store’s configuration file.

To get you started, here’s a brief introduction to the concept of a feature store, the Feast plugin, and how to configure it for Astra DB.

What’s a feature store?

The typical task in ML consists of building a model and training it with some initial labeled data (the learning phase). This enables the model to provide predictions when presented with new input.

Think of a model to detect fake reviews on your e-commerce website. First you “show” it many reviews along with their legitimate/fake actual status. Once you have “trained” it well enough, you can employ it to improve the quality of your service by automatically weeding out unwanted contributions.

A crucial concept is that of a feature: a specific value that describes an input to the model. This could be, for instance, the count of all-caps words in the review text, the presence or absence of a particular adjective, or the time the review was submitted. Choosing which features to use when designing a model is essential, and a good deal of a data scientist’s time is spent on “feature engineering.”

Feature stores are not themselves databases, but tools that manage data stored in other database systems. A feature store aims to solve some of the most vexing problems data engineers struggle with, namely:

  • Different teams having to “reinvent the wheel” (the features and the data pipelines) and not sharing their efforts easily within an organization.
  • Lack of systematic control over the data used in training versus in running a model (“training-serving skew”).
  • The search for a unified engine to track “backfill” computations and similar ordinary data administration tasks.
  • The need for incremental updates to the data set while retaining the possibility of retrieving point-in-time data snapshots.
  • An abstract layer for data access that “speaks” the language used by the rest of the ML stack, regardless of the actual backing infrastructure.

These needs are addressed by the introduction of a feature store as the central hub for a production data pipeline. At its heart, a feature store handles the relationship between the “offline” store (which retains all data history) and the “online” store (where only the latest value of features is kept). The latter is typically used in providing predictions and must be accessed fast and reliably.

What’s Feast?

Feast is an open-source feature store solution that can be configured to use several data sources (from databases to streams) and several online data stores.

Figure 1. Feast-at-a-glance (Source: Feast.dev)

Feast has a Python API that makes it easy to get started quickly, without having to learn any domain-specific language. Integration with existing ML tools is usually a matter of loading the features differently, and little more. Even most of the “configuration,” such as defining the features of a project, is done in the Python programming language.

Other tasks, such as syncing the offline store into the online store for the latest data, can be scheduled on a periodic basis (and simply consist of system commands to be executed).

Feast enables quick and standardized access to the latest features for a dataset, as well as historical retrievals (point-in-time queries, such as “what were the feature values as of last month?”), which are crucial for reproducibility, versioning, and accountability.

This is just the core idea. Feast ships with other components and layers making it a valuable pillar of a modern AI/ML data stack. For a deeper understanding of how Feast is structured, please refer to the concepts page in Feast’s documentation.

Online and offline Feature Stores

Offline storage layers store months or even years of historical feature data in bulk, for the purpose of training ML algorithms. This data is typically stored by extending existing data warehouses, data lakes and/or cloud object storage to avoid data silos.

According to Feast’s blog, “Online storage layers are used to persist feature values for low-latency lookup during inference. Responses are served through a high-performance API backed by a low-latency database. They typically only store the latest feature values for each entity, essentially modeling the current state of the world. Online stores are usually eventually consistent, and do not have strict consistency requirements for most ML use cases. They are usually implemented with key-value stores.” For more on this topic, please read this excellent explanation by Feast.

The Cassandra/Astra DB plugin for Feast

The feast-cassandra plugin enables the usage of Astra DB, or any Cassandra cluster, as an online feature store for Feast. When using Astra DB in particular, Feast users can run a model in production that relies on Astra DB’s architecture for providing the data on which predictions are evaluated. This has several advantages:

  • Astra DB is a database built on the robust Cassandra engine with its blazing fast write (and read) operations, 100% uptime and linear scalability.
  • Astra DB also inherits Cassandra’s eventually consistent database design with configurable consistency levels.
  • Astra DB supports multi-region with active-active configurations, making it possible for a single database to guarantee very low latencies from anywhere in the world.
  • Astra DB is serverless. There is no fixed cost in using it; billing is based solely on storage, network, and read/write operations. It autoscales to zero when unused, making costs extremely low on any cloud.
  • Astra DB is an intelligently auto-scaling cloud service, so there is zero operational burden. This frees busy data engineers to focus on things other than database operations or capacity planning.

As with any Cassandra-based system, you have the assurance that the size of the feature set will not be an issue, as the core design is to handle big data.

The feast-cassandra plugin, at its core, is an implementation of the OnlineStore interface (an “abstract base class,” in Python language), which offers four methods to be “driven” by the core Feast runtime:

  1. update, called to alter the nature of the stored entities (e.g. when adding a new feature).
  2. teardown, called when the whole storage is to be removed.
  3. online_write_batch, called when saving features in the online store (for instance, as the result of a “materialize” operation that syncs the latest value of the features from the offline storage).
  4. online_read, used to retrieve the features associated with a given set of entities.
Figure 2. Online store lifecycle operations.

Create your first Astra DB-backed feature store

Here’s how to get started with Feast and Astra DB as its online store backend:

  1. Install Feast and the Cassandra plugin (requires Python 3.7+. We suggest using a virtual environment).
  2. Create an Astra DB instance that will be managed by the plugin, as far as table creation and data I/O are concerned.
  3. Take note of the Database Token (with “DB Administrator” role), created along with the DB: it will supply authentication credentials to the plugin.
  4. Download the “Secure connect bundle” to access the database.
  5. Initialize a Feast feature store.
  6. Edit the feature_store.yaml file to set Astra DB. The online_store section of the file should look like the following:

In the above, replace the location of the secure bundle file downloaded earlier. Make sure the KeyspaceName matches what was chosen when creating the database, and provide the Client_ID and Client_Secret found in the DB token. At this point, the rest of the Feast quickstart can be executed to see Feast in action with Astra DB.

For a more in-depth walkthrough, watch the video on how to configure Feast for Astra DB.

Final notes

With that, you’re officially off and running! With Astra DB, your noSQL skills aren’t stuck on a specific cloud provider so you only have to learn it once. Also, Cassandra-based systems offer the best of Amazon’s Dynamo distributed storage and replication techniques combined with Google’s Bigtable data and storage engine model. You’re welcome to Feast with autoscaling that doesn’t require multiple other commercial cloud services to autoscale!

As a final note, if you’re not using Astra DB yet, register for a free account (no credit card required), for a monthly allowance that’ll cover even small production workloads.

Follow the DataStax Tech Blog for more developer stories. Check out our YouTube channel for tutorials and follow DataStax Developers on Twitter for the latest news about our developer community.

Resources

  1. Configure Feast for Astra DB in 5 min
  2. feast-cassandra plugin
  3. DataStax Astra DB
  4. Awesome Astra on Github
  5. Python Astra SDK
  6. Free courses | DataStax Academy
  7. DataStax Developers YouTube channel

--

--

DataStax
Building Real-World, Real-Time AI

DataStax provides the real-time vector data tools that generative AI apps need, with seamless integration with developers' stacks of choice.