Building machine learning features in Snowflake with open-source feature store Feast

We are excited about the recent integration announcement between Snowflake and Feast, the popular open source feature store. The integration will streamline the way in which data teams can securely and reliably store, process and manage machine learning (ML) features in Snowflake.

By bringing ML features into Snowflake, data science and ML teams can access the performance and scalability of the Snowflake engine to quickly refresh feature values to generate training data sets (with point-in-time correctness) and provide models running in batch with the data they need for inference at any scale. And if you are running any real-time inference for use cases such as recommendation engines or fraud detection, Feast keeps your low-latency store and Snowflake in sync.

If you are new to the concept of a feature store, I recommend you get started here.

Now that you have the basic concepts down, let me show you how you can get started. You will need:

  • A Snowflake account
  • A code/text editor to modify .yml & .py files
  • A basic understanding of pulling code from GitHub
  • Some experience in python + installing packages
  • Some experience working with a command line tool

What this article is NOT, is a primer of what and how to use a feature store. Please checkout the main feast page for that information, including the basics.

1.) Pip install feast and setup a project

To get started with this new integration, we will need to grab the feast package. My recommendation would be to create a Python3.9 virtual environment. In your terminal, this can be done by executing the following command: pip install ‘feast[snowflake]’

Adding …[Snowflake] tells pip to include all the dependencies for the snowflake integration.

Once installed, several CLI commands will become available for us to do things like create and teardown a feast project. In this case, we don’t have a feast project yet, so we need to create one. This can be done by executing in your command line: feast init -t snowflake

“feast init” will create a random name project for you. “feast init {name}” will create a project with that name. “feast teardown” will teardown infrastructure that was setup to operate the feature store.

feast init -t snowflake will automatically create a new randomly named folder, with the main configuration files needed to successfully operate a feature store. But first, it will prompt you to connect your Snowflake account to your feast project. Go ahead and enter into your Snowflake details (I have provide sample answers). Lastly, enter Y to approve our project to upload a sample dataset that we will work with to demonstration the rest of the integration.

In this example, “balanced_pup” is the name of the feast project created. A project is a folder with a collection of files that we will further explore.

To check if the data upload was successful, you can use the Snowflake web user-interface, login to your account, and look for the new table “balanced_pup_feast_driver_hourly_stats” in the database you chose, then in the PUBLIC schema.

And here is a look at some of the sample data.

If you zoom in, you will see we have some columns with timestamps, floats, and regular integers.

Going back to our terminal, we can now navigate to the feast project name folder created to start inspecting the pre-created artifacts. Execute the following: cd {your_project}

Then: ls

We can see that we have 1 folder, data, and 3 files: feature_store.yaml, driver_repo.py, and test.py.

  • The data folder has a sample dataset we will use for demonstration purposes, as well as contain some other local artifacts that will be generated for this project.
  • feature_store.yaml is where we will define configure the infrastructure to support the feature store operations
  • driver_repo.py is where we will define the physical features in our feature store
  • test.py is a test file that tests if all of our feast components are working with Snowflake.

To recap, we now have successfully setup a feast project, and we have uploaded sample data to a table in a database that we are going to play with later on. Before we get to that, let’s look at our feature_store.yaml file, which is the link between feast and snowflake.

2.) Instrumenting our feast project to use Snowflake as an Offline Store

To use Snowflake as our offline store, feast will want us to set it so in our feature_store.yaml file, which we have already done per the template we filled out earlier. Lets take a look: nano feature_store.yaml

Snowflake objects by default are all UPPER CASE, unless specifically defined. Options are available to define these params in a config file and point FEAST to that offline store config file. Other provider/online store configurations, such as AWS/Redis will work with Snowflake as an Offline Store.

Here we can see things like project name, registry, provider, offline store. These all mean something special when it comes to a feast project. They are fortunately all interoperable. We will continue with the current configuration in using a snowflake offline store, local provider, and local online store.

We are now ready to start doing feature store “things.”

3.) Declaring the features to the feature store.

The primary objective of a feature store is for you to declare your feature definitions once, such that a common feature definition can be used across both model training and model scoring scenarios.

What that means is we need to provide configuration files that tell us what the columns of data in your tables actually represent.

To get started, open up the driver_repo.py in your text editor:

from datetime import timedeltaimport yamlfrom feast import Entity, Feature, FeatureView, SnowflakeSource, ValueTypedriver = Entity(
name="driver_id",
join_key="driver_id",
)
project_name = yaml.safe_load(open("feature_store.yaml"))["project"]driver_stats_source = SnowflakeSource(
database=yaml.safe_load(open("feature_store.yaml")). ["offline_store"]["database"],
table=f"{project_name}_feast_driver_hourly_stats",
event_timestamp_column="event_timestamp",
created_timestamp_column="created",
)
driver_stats_fv = FeatureView(
name="driver_hourly_stats",
entities=["driver_id"],
ttl=timedelta(weeks=52),
features=[
Feature(name="conv_rate", dtype=ValueType.FLOAT),
Feature(name="acc_rate", dtype=ValueType.FLOAT),
Feature(name="avg_daily_trips", dtype=ValueType.INT64),
],
batch_source=driver_stats_source,
)

To walk through this code, we created a new type of source object, a SnowflakeSource, which maps to the table sitting inside of your Snowflake account. We created an entity object, which defines what the unit of analysis is for each row in the table. “Unit of analysis” can be thought of as how the data was measured. In this example, the unit of analysis or the way we measured the data is by someones “driver_id,” which we can assume is a unique driver on the road. This is an important concept that one should delve into deeper. Lastly, we take our two objects and create a feature view, a set of features that relate to the source data and the entity (which could be not present sometimes).

Now that we have provide a meta data layer to the raw data in our tables, we are ready to let feast handle the rest and actually start consuming the feature store from the perspective of data scientist looking to train models.

4.) Serving features for training.

From here, we will be working in python. You can open up a jupyter notebook if you prefer. I like to run things straight from the command line.

To kick off in your command line python, type: python

Let’s import the packages we will need, again make sure your working directory is still your feast project folder. You can copy/paste the following into your command line and press enter:

from driver_repo import driver, driver_stats_fv
from datetime import datetime, timedelta
import pandas as pd
from feast import FeatureStore

At this point in time, we have created our configuration files, but no actions have happened yet to “sync” everything up. Feast will handle this for you. First we will instantiate an “empty” feature store object, and then apply() our specific configurations. Thus, our project feature definitions will now be registered in the feature store. You would run this command every time you add new/change your tables/feature definition files. Execute the following:

fs = FeatureStore(repo_path=".")
fs.apply([driver, driver_stats_fv])

The result of that execution wont yield an output as all were are doing is update our metadata registry.

With the configuration data applied, we are finally ready to generate a training dataset. The way to think about going to your feature store and grabbing the features you care about for your machine learning use case is a little different than the traditional “just load in a csv with the data”.

To grab the data we care about, we need to create a “entity” dataframe. This entity dataframe needs to contain the target we want to predict, the timestamps when the target was recorded, and, if they exist, an entity. This could be a simple SELECT target FROM {table} command that you move to a pandas dataframe. For the purpose of this demonstration, we are going to create our own random one. Run the following code:

entity_df = pd.DataFrame(
{
"event_timestamp": [
pd.Timestamp(dt, unit="ms", tz="UTC").round("ms")
for dt in pd.date_range(
start=datetime.now() - timedelta(days=3),
end=datetime.now(),
periods=3,
)
],
"driver_id": [1001, 1002, 1003],
}
)

If you print(entity_df), you can see we have some timestamps and associated entity ids related to the “driver_id” entity id we created in our feature definition file. You actually don’t need a target variable per se. The main job of the feature store is to only return you values that happen no later than each timestamp we care about. This is all done to save you from target-leakage.

From there, the feature store will handle the rest. All we need to do now, is create a list of features in our feature store that we want to augment our spine dataframe with and the feature store will return that for us using the get_historical_features() method. Execute the following code:

features = ["driver_hourly_stats:conv_rate", "driver_hourly_stats:acc_rate", "driver_hourly_stats:avg_daily_trips"]training_df = fs.get_historical_features(
features=features, entity_df=entity_df
).to_df()

This will just take a moment as Snowflake computes on the backend. You can see when we print(training_df), we get our original dataframe back plus the features that we called for in our list. From here, in theory, you could now go train a model and deploy it in your environment of choice, behind an API perhaps.

To give you an idea of what is happening behind the scenes, in your Snowflake account, navigate to your “History” tab. You should see your last query look like the following, which was automatically generated for you by:

Beautiful aint it?

Here is a subset of all the SQL code generated, that you would otherwise had to develop in order to have a point-in-time correct query. Whats awesome is Snowflake will do the heavy lifting for you, and then automatically suspended your virtual warehouse when its done running the query.

5.) Serving features for inference.

The very last piece here, now that we have our theoretical model deployed, is to serve the features that model requires in production. More concretely, we need to grab the latest values of the features in our feature store related to a specific “driver id” that we care about at prediction time. Run the following code:

fs.materialize_incremental(end_date=datetime.now())

online_features = fs.get_online_features(
features=features, entity_rows=[{"driver_id": 1001}, {"driver_id": 1002}],
).to_dict()

I am going to ignore the first line of code here. See the comments below. What is awesome here is , to get the latest feature values for my model, I just need to pass in the same feature list that I used when creating my training dataset into the get_online_features() method, and it returns the latest feature values in a dictionary format, which a typical model routing API would expect.

materialize_incremental() is the method that moves our relevant feature data from the Offline Store (Snowflake) to the Online Store (local). Snowflake tables are not currently designed for ultra low latencies, which some ML systems require.

Through this demo, you now have a basic understanding of how to operationalize your Snowflake data for machine learning use cases in a declarative and scalable way. As Snowflake continues to evolve with features such as support for Snowpark for Python and feast continues to add new capabilities, we look forward to having an even more comprehensive solution for feature stores.

If you are interested in a managed feature store solution, you should check out our recent announcement with Tecton’s enterprise feature store.

--

--