Enabling highly scalable feature store with Teradata Vantage and FEAST

Mohammad Harris Mansur
Teradata
Published in
10 min readJan 25, 2023

Minimize data replication, provide reusable features, high-level concurrency and retrieval

Photo by Shubham Dhage on Unsplash

In this blog, you will see how to build a highly scalable feature store with Teradata Vantage and FEAST. However, let us first start with some basic concepts.

What is an Enterprise Feature Store (EFS)?

An EFS is a curated, managed repository of features that have been used in successful predictive models, materialized as tables in an Analytical RDBMS. An Analytical RDBMS is highly complex and large in volume and is specifically designed to support Business Intelligence and analytical applications which is why it is better suited for an EFS as compared to a transactional database. The hard, time-consuming work of data preparation, data integration, and feature engineering can be done once, creating features that can be reused to both train and score multiple different models. Time and care will need to be taken in creating features that have utility as well as predictive value, and in cataloging each one, but this initial investment quickly pays off as subsequent projects can easily reuse existing, well-documented features. This is critical for machine learning and AI to deliver their promised value if they are to become ubiquitous, and that means significantly reducing the 80% of cost and effort currently consumed by data preparation and management.

Why do we need an Enterprise Feature Store?

There are many reasons to start thinking about an Enterprise Feature Store. Some of the most important ones are listed below:

  • A Feature Store enables the use of the same metrics across subsequent processes and models
  • Minimizes data replication, redundancy, and errors
  • Reduces chances of error due to incorrectly coded logic as the features are already created, documented, and tested, which reduces the testing effort
  • Modeling processes can be fully accountable for the precise conditions when a prediction was made which can then be robustly recreated
  • Creates reusable assets from analytic projects saving time, increasing productivity, and reducing time to production
  • Features are cataloged to ensure availability for reuse and research
  • Reduces overall processing cycles required for new analytics pipelines as computationally intensive features are already materialized
  • Data scientists and analysts can utilize their time on analyzing data and solving business problems, rather than repeating data preparation steps
  • An Enterprise Feature store allows businesses to easily drive value from data by analyzing readily available information
  • Production processes are simpler as the data is already in the correct format for scoring with live data and the scoring is carried out in the same place as the live data

What is Feast?

Feast (Feature Store) is a flexible data system that utilizes existing technology to manage and provide machine learning features to real-time models. It allows for customization to meet specific needs. It also allows us to make features consistently available for training and serving, avoid data leakage and decouple ML from data infrastructure.

Different components of Feast:

· Feature View:

Feature views contain features that are properties of a specific object. It consists of a Data Source, entities, name, schema, metadata, and a TTL (time-to-live) which limits how far back Feast will look when generating historical datasets.

· Data source:

A Data Source, as the name implies, is where the feature data is stored. This can be a parquet file (which is stored locally), a Teradata Vantage platform, a Google Cloud Platform bucket, Snowflake, Amazon Redshift, or an S3 bucket. Feast also allows you to have multiple data sources.

· Feature Service:

Feature Services come into play when you want to create logically related groups of feature views. Hence, a feature service is an object that contains features from one or more feature views.

· Entity:

A collection of semantically related features.

· Timestamps:

In the context of Feast, timestamps are used to store the features in chronological order.

Understanding the Infrastructure of Feast:

Offline store:

An offline store is where features are stored. These features can then be used for analysis or training. Data can both be written and retrieved from an offline store. It is called an ‘offline’ store because it is located outside a Feast environment.

Online Stores:

An online store is where features are stored for low-latency access. And because they can be quickly retrieved, they make the process of inference faster.

Feature Repository:

A Feature Repository defines the feature store. The definitions include what the features are, where they are stored and how we can access or retrieve them.

Registry:

The Feast feature registry is a catalog of all feature definitions and their metadata. It can either be stored on your local machine or your DBMS service provider.

Why Teradata as a connector to Feast?

Teradata Vantage’s ability to process massive amounts of data in a scalable and timely fashion is the single factor most relevant for an organization looking to develop and maintain a successful Enterprise Feature Store. Many systems simply aren’t capable of handling the degree of data manipulation, complex joins, and full table scan processing that are required for the generation and updating of an Enterprise Feature Store.

An Enterprise Feature Store consists of “modeled data” that is, the data is not simply a collection of flat files, objects, or data frames. Teradata has a long history of correctly modeled data schemas such as the Financial Service Logical Data Modelling. Semantic layers built on top of normalized data schemas such as the Industry Analytic Schemas (IAS) allow the building blocks of an Enterprise Feature to be produced quickly in a standardized, documented way. (this is important because it shows that not only have we done this many times before but we have a documented best-practice methodology)

Given the capabilities of the Teradata Vantage analytics platform and the advanced processing features it possesses, it doesn’t make sense to extract data and execute processing outside of the data warehouse environment, particularly when the processing involves joining large datasets and transforming them during the process of Feature Engineering.

A feast connector to Teradata allows BI/Analytics teams to use the same features for reporting as the Data Science team and vice versa, while the IAS ensures that the most valuable data assets are available for reuse in the Feature Store.

How to use Feast with Teradata?

We will now walk you through an example of how this connector works with Feast.

Firstly, to get started, install the feast-teradata library using the following command:

pip install feast-teradata

Following the previous command, if you run the following command:

feast-td init-repo

This command will ask you to input the name of the repository, Teradata host URL, username, password, database name, and the connection mechanism as shown in the screenshot below. It will then create the folder structure required for you to run the demo example.

By entering the relevant information above, the feature_store.yml file will automatically be updated. In the case that you would want to fill (or change) that information, you can do so in the feature_store.yml file.

The script driver_repo.py will contain the definition of our Feast feature store. Feast will use the script to register our entities, file sources, and features in a feature repository.

Our entity corresponds to the column driver_id in our dataset and is of the datatype INT64. There is a parameter description (not shown) that simply stores the description of the entity for reference purposes. You can see the parameter join_keys which specifies the column that is used to join the dataset while fetching the relevant features. Another parameter called name defines the identifier of this Entity, driver.

After defining the entity, we define the data source for our features. In our case, it is a Teradata Source. A Teradata source object contains four important parameters. These include the parameter database which is the name of the database in Teradata, a table parameter which refers to the name of the table which has the data stored inside it, and timestamp_field which indicates which field has the timestamps stored inside it. The timestamp_field is used for point-in-time joins and for ensuring only features within the TTL (time to live) are returned.

We then move on to defining the Feature View. In the code block below, we define:

name: This is the name of the feature view

entities: This is a list of entities that specifies the keys required for joining or looking up features from this feature view.

ttl: ttl allows for eviction of keys from online stores and limits the amount of historical scanning required for historical feature values during retrieval

schema: The list of features defined (as shown below) as a schema to both define features for the materialization of features into a store, and are used as references during retrieval for building a training dataset or serving features

source: This is the data source that has the required data. In our case, it is the Teradata Source that we defined earlier

tags: Tags are user-defined key/value pairs that are attached to each feature view

The folder structure should look like the following screenshot:

demo/
feature_repo/
driver_repo.py
feature_store.yml
data/
driver_stats.parquet
test_workflow.py

The data folder contains a parquet file with dummy data that will be the source of all features in this example.

From within the demo/feature_repo directory, execute the following feast command to apply (import/update) the repo definition into the registry. You will be able to see the registry metadata tables in the teradata database after running this command.

feast apply

The above command will register feature definitions and build the infrastructure for feature storage and serving.

Feast-Teradata SQL Registry

Another important thing is the SQL Registry which will create the following tables in your database:

  • entities
  • data_sources
  • feature_views
  • request_feature_views
  • stream_feature_views
  • managed_infra
  • validation_references
  • saved_datasets
  • feature_services
  • on_demand_feature_views

We use the Feature Store class from Feast to instantiate a Feature Store

We provide the location of the feature store as the repo_path parameter, with the location being relative to the directory where the script is run.

Now coming on to another important part, we extract historical features using SQL. Since we would like to extract features between particular dates, we also need to provide it with a start and end date parameter. The query is designed to be dynamic such that it gets the table name, and start and end date as parameters.

Below is the query:

Feast uses entity DataFrames to combine different feature views. To maintain point-in-time accuracy, Feast aligns both the entity names and event timestamps in the feature views with those in the entity DataFrame. Feast will consequently use the entity keys and event timestamps in our entity data frame to correctly join the targets with their corresponding feature values.

The get_historical_features method combines the entity dataframe containing labels with the specified features using the features parameter and returns a RetrievalJob object which can be used to access the data. The feature views and individual features are specified using the feature_view_name:feature_name format. In the example shown above, the feature view name is driver_hourly_stats and the name of the features that we want to extract are conv_rate, acc_rate, and avg_daily_trips.

Once you have obtained the RetrievalJob object, you can either save it for later use or immediately inspect the features by calling the to_df method, which returns a Pandas dataframe.

Online Store usage

There are a few mandatory steps to follow before we can test the online store:

CURRENT_TIME=$(date +'%Y-%m-%dT%H:%M:%S')
feast materialize-incremental $CURRENT_TIME

The command materialize_incremental is used to incrementally materialize features in the online store. If there are no new features to be added, this command will essentially not do anything. With feast materialize_incremental, the start time is either now — ttl (the ttl that we defined in our feature views) or the time of the most recent materialization. If you’ve materialized features at least once, then subsequent materializations will only fetch features that weren’t present in the store at the time of the previous materializations.

Next, while fetching the online features, we have two parameters features and entity_rows. The features parameter is a list and can take any number of features that are present in the df_feature_view. The example above shows all 4 features present but these can be less than 4 as well. Secondly, the entity_rows parameter is also a list and takes a dictionary of the form {feature_identifier_column: value_to_be_fetched}. In our case, the column driver_id is used to uniquely identify the different rows of the entity driver. We are currently fetching values of the features where driver_id is equal to 5. We can also fetch multiple such rows using the format: [{driver_id: val_1}, {driver_id: val_2}, .., {driver_id: val_n}] [{driver_id: val_1}, {driver_id: val_2}, .., {driver_id: val_n}]

The get_online_features method returns an OnlineResponse object, which contains a method called to_dict that can be used to convert the feature values to a Python dictionary.

Finally, there is an internal feast command called feast teardown that does exactly what the name suggests. It deletes the existing entities and feature views that you first instantiated using feast apply.

An Enterprise Feature Store reduces time to value in key phases of the analytics workflow. It helps to improve productivity and time-to-market issues. Teradata’s connector to Feast allows one to leverage Teradata’s massively parallel architecture in a Feature Store which further improves efficiency.

Further References

Teradata Vantage Documentation (2.4): https://docs.teradata.com/r/Teradata-VantageTM-User-Guide/June-2022/Introduction-to-Teradata-Vantage/Teradata-Vantage-Overview

Feast Official Documentation: https://docs.feast.dev/

Python Library (Teradata): https://pypi.org/project/feast-teradata/

Github: https://github.com/Teradata/feast-teradata

Getting-started page: https://quickstarts.teradata.com/modelops/using-feast-feature-store-with-teradata-vantage.html

Authors

Mohammad Harris Mansur, Mohammad Taha Wahab and Will Fleury

--

--