VertexAI’s Feature Store for dummies

How to create an E2E automated Feature Store to serve ML services

Sebastian Montero
Beamery Hacking Talent
13 min readMar 31, 2022

--

Photo by Ricardo Gomez Angel on Unsplash

Within Edge AI (Beamery’s applied data science team) we tested VertexAI’s Feature Store to understand the process of serving online features to machine learning APIs used by engineering teams. Vertex AI was officially launched in early 2021, making the online documentation and experience with this tool very limited. Therefore, we have decided to create a document that explains the E2E process of setting up a VertexAI Feature Store and creating a pipeline that feeds data from BigQuery into the Feature Store to serve live data to AI services.

This blog will walk you through what a feature store is, how to create one, ingest data into one and make requests using Google VertexAI’s Feature Store API. We expect the reader to understand the machine learning development process in order to ensure that they will find value in using a feature store when it comes to model serving and deployment. The goal of this document is to provide an overview of why data scientists should use feature stores, how we are using them within Beamery, and show the reader how to use Google Cloud Platform’s (GCP) capabilities to serve features within their models.

Definitions

This section will provide an overview of the definitions of the tools and technologies mentioned throughout this article. We will also link other useful articles that were used as sources or references.

What is a feature store?

A feature store is used within an organisation to ensure there is one common repository where data scientists can publish and retrieve features during the machine learning lifecycle process. A feature store differs from typical data warehouse as it is made up of two databases, each used in different steps of the machine learning lifecycle process: offline feature stores and online feature stores.

The offline feature store allows you to serve a high volume of features in batches and it is typically used in the early stages of the machine learning development process, especially during data exploration and model training.

The online feature store, on the other hand, allows you to serve features with low latency (almost real-time). This feature store is usually used within online applications to serve the features needed to make predictions during production.

The table below compares both types of feature stores side by side.

❗ In the GCP environment Online feature stores are referred to as Streaming feature store and Offline feature stores are referred to Batch feature stores.

Further comparisons between Data Warehouses and Feature Stores can be found in the Logical Clocks article here.

What is VertexAI?

GCP’s VertexAI was launched in May 2021 with the goal to bring all of GCP’s ML services into one location. The goal is to “Build, deploy, and scale ML models faster, with pre-trained and custom tooling within a unified AI platform”. VertexAI is an extensive suite for all-things-ML; the feature store is just one part of this. More information about all of VertexAI’s capabilities can be found in the product overview here.

What is VertexAI’s Feature store?

VertexAI’s feature store provides a centralised repository for organising, storing and serving ML features within the Google Cloud Platform environment. VertexAI’s feature store provides different APIs to create a schema of entities and features, ingest data and make API requests serving through the online or offline feature store. In the Hands-on example section we will demonstrate how implement an online feature store.

The VertexAI Data Model

Within VertexAI’s Feature store, there are different hierarchical elements that need to be defined before ingesting and serving features. These elements follow this hierarchical structure: Feature store → EntityType → Feature.

  1. Feature store: The Feature store is the top level container for EntityTypes and Features. Typically an organisation creates one shared feature stores to be used across different teams or projects; however different feature stores can also be created to isolate development environments like experimentation, testing and production.
  2. EntityType: An EntityType is a collection of related features sharing a similar level of granularity. These are defined by the data scientist, as an example these can include all features related to users or companies. The EntityType creates an Entity instance; each Entity requires a unique identifier that will be used as a lookup to return a vector with all of the requested features.
  3. Feature: A Feature is the property or attribute of an Entity. When you define a Feature(s) you need to define value types. Value types can be found in the API documentation here. The Feature will be the container for all Feature Values for each Entity.

Further documentation about Feature Store concepts and data models can be found here.

Using Feature Stores at Beamery Edge

As of this writing, feature stores within Edge (Beamery’s Applied Data Science team) are still in their early stages of development. So far, we have benefitted from the use of online feature stores by allowing us to serve features within the deployed model container, instead of sending the features in a request payload.

Although sending features within the request payload was the original option, there was an increased dependency on Engineering teams, external to the Edge team, to calculate those features and ensure they were available in our production databases.

This process has given us the ability to develop and iterate quicker allowing for easier model maintenance as we can adjust the underlying data definitions and feature calculations used to train the model and replicate the logic to our online feature store used to make predictions.

👾 Throughout the next sections you will see callouts with the 👾 emoji. These are insights on how we have implemented each step within Edge.

Hands-on example

In this demo we will show how to create a Feature store, EntityType and Features and then, how to call these in an API. The process is the following:

  1. Create a Feature store using cURL
  2. Create an EntityType using cURL
  3. Define the Features using cURL
  4. Ingest features for each entity (manually cURL or automated within GCP)
  5. Serve the features (in cURL or Python)

We will show you how to run through all of these steps using cURL and show you additional information needed to set user and Service Account permissions, automating the process within GCP and making requests through the Python SDK.

👾 In Edge, we use cURL requests in our development environment. The auth token can be retrieved using gcloud auth print-access-token (as seen on the scripts below), and it returns the access-token for the logged in gcloud Data Scientist with the VertexAI access. For production workflows using Python requests and Cloud Scheduler jobs we load the Service Account to ensure reliability.

Permissions required

For this demo we provided the roles/aiplatform.featurestoreAdmin for our personal gcloud account and for the VertexAI Service Account; however, this access be limited using the following roles are available:

  • roles/aiplatform.featurestoreAdmin: Grants full access to all resources in Vertex AI Feature Store
  • roles/aiplatform.featurestoreDataViewer: When applied to an EntityType, this role provides permissions to view data of any Features in that EntityType.
  • roles/aiplatform.featurestoreDataWriter: When applied to a EntityType, this roles provides permissions to read and write data for any Features in that EntityType.
  • roles/aiplatform.featurestoreInstanceCreator: Administer of Feature Store resource, but not the child resources under Feature Store.
  • roles/aiplatform.featurestoreResourceEditor: Manage all resources within a Feature Store, but cannot create or update the Feature Store itself.
  • roles/aiplatform.featurestoreResourceViewer: Viewer of all resources in Vertex AI Feature Store but cannot make changes.

Further Access Control documentation can be found here.

Accessing VertexAI’s Feature store

The VertexAI Feature store can be accessed under the Features tab inside the VertexAI product suite. The link can be found here. Once in, you can create an EntityType or see ingestion jobs; however all the Feature store, Ingestion, and Feature creation jobs need to be done either using cURL or Python.

Creating a Feature store

The first step when we get started with VertexAI’s Feature store is to create the actual Feature store. For this, we need to define the LOCATION, PROJECT, and FEATURESTORE_ID in our bash script. These variables should remain constant for all other requests. Note that the LOCATION variable is the Feature store location. As of this writing, there are four locations (regions) in which we can create feature stores.

👾 In Edge, we have created a feature store for each machine learning project, allowing us to separate the different Entities and Feature Values that will be used as feature lookups to make predictions. The benefit of this system is that we can get features in real time and we maintain the MLOps workflow within the Applied Data Science team

The two code blocks below are the API endpoint to create a feature store and the request payload that will define the the configuration of our Feature store.

01_create_featurestore.sh

01_request.json

Creating an EntityType

The second step is to create an EntityType. The EntityType is the holder for the all Entities. We create the EntityType by defining an ENTITY_TYPE_ID and making a POST request to the EntityTypes url with the 02_request.json file as our payload. The payload will only require a description for our EntityType.

👾 In Edge, we have different EntityTypes that relate to the HR Tech space by project. Some examples include candidate_attributes, candidate_interactions, vacancies, and pools.

02_create_entity_type.sh

02_request.json

Creating the Features

Third, we will add features. Features are defined in the 03_request.json payload file. For each feature we need to input the description, valueType and featureId. Value types can be found in the API documentation here. Note that in this step we are creating the feature schema with names and datatypes as placeholders but there are still no feature values within the Feature store.

👾 In Edge, we try to limit our features to BOOL , INT64, or DOUBLE as quantitative features can be used directly on our model without additional data extraction (e.g. as we would in NLP-focused projects)

03_add_features_to_entity_type.sh

03_request.json

Ingesting features from BigQuery into the Feature store

Manually creating an ingestion job

To create an ingestion job from BigQuery to the Feature store we need to define in the 04_request.json payload file several variables:

  • First we need to define what is the entityIdField that will be used as the Unique Id for each of our features. Note that the UniqueId should not be added as a feature.
  • We can also define a featureTimeField that will tell us how recent this feature is.
  • After this we have to define where our data lives. In the case of this example we are using the bigquerySource and defining a BigQuery table, but data can also be added from Google Storage.
  • Finally we define the key of the features in BigQuery to the Features in the Feature store. Note that these features require the same datatypes from BigQuery to the Feature store. In most cases you can use the same names between BigQuery and the Feature store; however the naming conventions differ slightly.

Depending on how many features you are ingesting, you can also adjust the workerCount.

04_ingest_bq_data.sh

04_request.json

After a job is submitted you will see it under the Ingestion jobs tab.

Further documentation on Batch Ingestion can be found here.

Automating the ingestion jobs

👾 The example below is the process we have used in Edge to automate feature ingestion jobs from BigQuery, however depending where the source data lives this process could change. This process is managed by our Service Account to ensure reproducibility across Data Scientists.

Although we can create an ingestion job manually, in practice we want to automate this process to ensure we always have the latest data in our online feature store. The way we approach this is to automate the BigQuery query that generates our features using the BigQuery Scheduler and then automate the ingestion jobs triggers using GCP’s Cloud Scheduler. More details below.

  1. BigQuery Scheduler

To create a BigQuery scheduled query you need to click under Scheduled queries . This page allows you to create a query and transfer the results into a new BigQuery table.

2. Cloud Scheduler

In order to create a Cloud Scheduler job, you need to configure the job to send a POST request to the address used in the 04_ingest_bq_data.sh script. Then you need to define the contents of the payload in the Body section using the contents of 04_request.json. After this is done, you need to include a Service Account with the right permissions that will allow you to trigger this Cloud Scheduler job. Remember to add the Content-Type header as well.

Making a real-time request to the online Feature store

Once the Ingestion jobs finish running, your features will be available to query in the Feature store. In the next two sections we show you how to make read requests using cURL and Python. From our experience, the cURL methodology is useful to understand how the API works; however the Python client allows you to add this module into a machine learning Python repository keeping programming language consistency across the repo.

cURL

For the cURL methodology we use the streamingReadFeatureValues endpoint to query the features in real-time. The 05_request.json payload defines an array of uniqueIds that we want to query for, these are the Entity Ids. Then you need to define the features Ids within the idMatcher using an array of feature names.

Streaming requests will return a feature vector with all the features for each Entity Id that was requested.

05_make_request.sh

05_request.json

Additional documentation on Online serving can be found here.

Python

In Python, you can follow a similar approach where you load the GCP Service Account first and then you start the FeaturestoreOnlineServingServiceClient. With this client, you can run the streaming_read_feature_values method which accepts a StreamingReadFeatureValuesRequest object that includes the entity type, the entity ids and the feature you want to load. This endpoint returns an iterable response_stream that will require transformations into the desired shape.

In this example the config.FEATURES variable is an array of feature names.

👾 In Edge, we have included a feature serving module within our machine learning API in order to read features in real-time. This process is managed by our Service Account to ensure reproducibility across Data Scientists.

In this Jupyter Notebook you can see examples on how to use the Python API client to manage the end to end process of Feature store, EntityType, and Feature creation, run ingest jobs and make requests.

Further documentation about the google-cloud-aiplatform Python APIs can be found here.

Final thoughts

In this article we provided an overview on how to use Featurestores in Google’s VertexAI platform and the importance to include these in your machine learning development and production workflows. The hands-on examples will also help you create your first Feature store, ingest features and serve them within your machine learning applications.

Our experience with VertexAI’s Feature Store was overall very positive. We can rest assured that the internal AI APIs that rely on this service will provide the latest features for the model in a reliable manner. Even though the system requires the in-and-out knowledge on how to use the gcloud CLI properly, there are a myriad of benefits of using this tool if your company team works within the Google Cloud Platform environment.

VertexAI’s Feature store is still in beta. This means that changes could be made along the way to overall GCP API as well as to the Python library. Any updates to the underlying libraries should be tracked to ensure that scripts are updated accordingly.

If you have any questions, comments or thoughts reach out to me on Twitter @sebastianmont__

Interested in joining our Engineering, Product & Design Team?

We’re looking for Data Scientists, Software Engineers Front & Back Mid/Senior/Director, SRE Platform Engineers, Engineering Managers, Tech Leads, Product Operations/Managers/Designers across all levels and even more roles across London, Remote, USA & Berlin! to join us — apply here!

--

--