VertexAI’s Feature Store for dummies
How to create an E2E automated Feature Store to serve ML services
Within Edge AI (Beamery’s applied data science team) we tested VertexAI’s Feature Store to understand the process of serving online features to machine learning APIs used by engineering teams. Vertex AI was officially launched in early 2021, making the online documentation and experience with this tool very limited. Therefore, we have decided to create a document that explains the E2E process of setting up a VertexAI Feature Store and creating a pipeline that feeds data from BigQuery into the Feature Store to serve live data to AI services.
This blog will walk you through what a feature store is, how to create one, ingest data into one and make requests using Google VertexAI’s Feature Store API. We expect the reader to understand the machine learning development process in order to ensure that they will find value in using a feature store when it comes to model serving and deployment. The goal of this document is to provide an overview of why data scientists should use feature stores, how we are using them within Beamery, and show the reader how to use Google Cloud Platform’s (GCP) capabilities to serve features within their models.
Definitions
This section will provide an overview of the definitions of the tools and technologies mentioned throughout this article. We will also link other useful articles that were used as sources or references.
What is a feature store?
A feature store is used within an organisation to ensure there is one common repository where data scientists can publish and retrieve features during the machine learning lifecycle process. A feature store differs from typical data warehouse as it is made up of two databases, each used in different steps of the machine learning lifecycle process: offline feature stores and online feature stores.
The offline feature store allows you to serve a high volume of features in batches and it is typically used in the early stages of the machine learning development process, especially during data exploration and model training.
The online feature store, on the other hand, allows you to serve features with low latency (almost real-time). This feature store is usually used within online applications to serve the features needed to make predictions during production.
The table below compares both types of feature stores side by side.
❗ In the GCP environment Online feature stores are referred to as Streaming feature store and Offline feature stores are referred to Batch feature stores.
Further comparisons between Data Warehouses and Feature Stores can be found in the Logical Clocks article here.
What is VertexAI?
GCP’s VertexAI was launched in May 2021 with the goal to bring all of GCP’s ML services into one location. The goal is to “Build, deploy, and scale ML models faster, with pre-trained and custom tooling within a unified AI platform”. VertexAI is an extensive suite for all-things-ML; the feature store is just one part of this. More information about all of VertexAI’s capabilities can be found in the product overview here.
What is VertexAI’s Feature store?
VertexAI’s feature store provides a centralised repository for organising, storing and serving ML features within the Google Cloud Platform environment. VertexAI’s feature store provides different APIs to create a schema of entities and features, ingest data and make API requests serving through the online or offline feature store. In the Hands-on example section we will demonstrate how implement an online feature store.
The VertexAI Data Model
Within VertexAI’s Feature store, there are different hierarchical elements that need to be defined before ingesting and serving features. These elements follow this hierarchical structure: Feature store → EntityType → Feature
.
Feature store
: The Feature store is the top level container for EntityTypes and Features. Typically an organisation creates one shared feature stores to be used across different teams or projects; however different feature stores can also be created to isolate development environments like experimentation, testing and production.EntityType
: An EntityType is a collection of related features sharing a similar level of granularity. These are defined by the data scientist, as an example these can include all features related tousers
orcompanies
. The EntityType creates an Entity instance; each Entity requires a unique identifier that will be used as a lookup to return a vector with all of the requested features.Feature
: A Feature is the property or attribute of an Entity. When you define a Feature(s) you need to define value types. Value types can be found in the API documentation here. The Feature will be the container for all Feature Values for each Entity.
Further documentation about Feature Store concepts and data models can be found here.
Using Feature Stores at Beamery Edge
As of this writing, feature stores within Edge (Beamery’s Applied Data Science team) are still in their early stages of development. So far, we have benefitted from the use of online feature stores by allowing us to serve features within the deployed model container, instead of sending the features in a request payload.
Although sending features within the request payload was the original option, there was an increased dependency on Engineering teams, external to the Edge team, to calculate those features and ensure they were available in our production databases.
This process has given us the ability to develop and iterate quicker allowing for easier model maintenance as we can adjust the underlying data definitions and feature calculations used to train the model and replicate the logic to our online feature store used to make predictions.
👾 Throughout the next sections you will see callouts with the 👾 emoji. These are insights on how we have implemented each step within Edge.
Hands-on example
In this demo we will show how to create a Feature store
, EntityType
and Features
and then, how to call these in an API. The process is the following:
- Create a
Feature store
using cURL - Create an
EntityType
using cURL - Define the
Features
using cURL - Ingest features for each entity (manually cURL or automated within GCP)
- Serve the features (in cURL or Python)
We will show you how to run through all of these steps using cURL and show you additional information needed to set user and Service Account permissions, automating the process within GCP and making requests through the Python SDK.
👾 In Edge, we use cURL requests in our development environment. The auth token can be retrieved using
gcloud auth print-access-token
(as seen on the scripts below), and it returns the access-token for the logged in gcloud Data Scientist with the VertexAI access. For production workflows using Python requests and Cloud Scheduler jobs we load the Service Account to ensure reliability.
Permissions required
For this demo we provided the roles/aiplatform.featurestoreAdmin
for our personal gcloud account and for the VertexAI Service Account; however, this access be limited using the following roles are available:
roles/aiplatform.featurestoreAdmin
: Grants full access to all resources in Vertex AI Feature Storeroles/aiplatform.featurestoreDataViewer
: When applied to an EntityType, this role provides permissions to view data of any Features in that EntityType.roles/aiplatform.featurestoreDataWriter
: When applied to a EntityType, this roles provides permissions to read and write data for any Features in that EntityType.roles/aiplatform.featurestoreInstanceCreator
: Administer of Feature Store resource, but not the child resources under Feature Store.roles/aiplatform.featurestoreResourceEditor
: Manage all resources within a Feature Store, but cannot create or update the Feature Store itself.roles/aiplatform.featurestoreResourceViewer
: Viewer of all resources in Vertex AI Feature Store but cannot make changes.
Further Access Control documentation can be found here.
Accessing VertexAI’s Feature store
The VertexAI Feature store can be accessed under the Features
tab inside the VertexAI product suite. The link can be found here. Once in, you can create an EntityType or see ingestion jobs; however all the Feature store, Ingestion, and Feature creation jobs need to be done either using cURL or Python.
Creating a Feature store
The first step when we get started with VertexAI’s Feature store is to create the actual Feature store. For this, we need to define the LOCATION
, PROJECT
, and FEATURESTORE_ID
in our bash script. These variables should remain constant for all other requests. Note that the LOCATION
variable is the Feature store location. As of this writing, there are four locations (regions) in which we can create feature stores.
👾 In Edge, we have created a feature store for each machine learning project, allowing us to separate the different Entities and Feature Values that will be used as feature lookups to make predictions. The benefit of this system is that we can get features in real time and we maintain the MLOps workflow within the Applied Data Science team
The two code blocks below are the API endpoint to create a feature store and the request payload that will define the the configuration of our Feature store.
01_create_featurestore.sh
LOCATION="featurestore_location"
PROJECT="project-name"
FEATURESTORE_ID="feature_store_name"curl -X POST \\
-H "Authorization: Bearer "$(gcloud auth print-access-token) \\
-H "Content-Type: application/json; charset=utf-8" \\
-d @01_request.json <https://$>{LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT}/locations/${LOCATION}/featurestores?featurestoreId=${FEATURESTORE_ID}
01_request.json
{
"online_serving_config": {
"fixed_node_count": 1
},
"labels": {
"environment": "feature_store_name"
}
}
Creating an EntityType
The second step is to create an EntityType
. The EntityType
is the holder for the all Entities. We create the EntityType
by defining an ENTITY_TYPE_ID
and making a POST request to the EntityTypes
url with the 02_request.json
file as our payload. The payload will only require a description
for our EntityType.
👾 In Edge, we have different EntityTypes that relate to the HR Tech space by project. Some examples include
candidate_attributes
,candidate_interactions
,vacancies
, andpools
.
02_create_entity_type.sh
LOCATION="featurestore_location"
PROJECT="project-name"
FEATURESTORE_ID="feature_store_name"
ENTITY_TYPE_ID="entity_name"curl -X POST \\
-H "Authorization: Bearer "$(gcloud auth print-access-token) \\
-H "Content-Type: application/json; charset=utf-8" \\
-d @02_request.json \\
<https://$>{LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT}/locations/${LOCATION}/featurestores/${FEATURESTORE_ID}/entityTypes?entityTypeId=${ENTITY_TYPE_ID}
02_request.json
{
"description": "EntityType description",
"monitoringConfig": {
"snapshotAnalysis": {
"monitoringInterval": "3600s"
}
}
}
Creating the Features
Third, we will add features. Features are defined in the 03_request.json
payload file. For each feature we need to input the description
, valueType
and featureId
. Value types can be found in the API documentation here. Note that in this step we are creating the feature schema with names and datatypes as placeholders but there are still no feature values within the Feature store.
👾 In Edge, we try to limit our features to
BOOL
,INT64
, orDOUBLE
as quantitative features can be used directly on our model without additional data extraction (e.g. as we would in NLP-focused projects)
03_add_features_to_entity_type.sh
LOCATION="featurestore_location"
PROJECT="project-name"
FEATURESTORE_ID="feature_store_name"
ENTITY_TYPE_ID="entity_name"curl -X POST \\
-H "Authorization: Bearer "$(gcloud auth print-access-token) \\
-H "Content-Type: application/json; charset=utf-8" \\
-d @03_request.json \\
<https://$>{LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT}/locations/${LOCATION}/featurestores/${FEATURESTORE_ID}/entityTypes/${ENTITY_TYPE_ID}/features:batchCreate
03_request.json
{
"requests": [
{
"feature": {
"description": "Feature 01 Description",
"valueType": "BOOL",
"monitoringConfig": {
"snapshotAnalysis": {
"monitoringInterval": "3600s"
}
}
},
"featureId": "feature_01_name"
},
{
"feature": {
"description": "Feature 02 Description",
"valueType": "INT64",
"monitoringConfig": {
"snapshotAnalysis": {
"monitoringInterval": "3600s"
}
}
},
"featureId": "feature_02_name"
},
{
"feature": {
"description": "Feature 03 Description",
"valueType": "INT64",
"monitoringConfig": {
"snapshotAnalysis": {
"monitoringInterval": "3600s"
}
}
},
"featureId": "feature_03_name"
}
]
}
Ingesting features from BigQuery into the Feature store
Manually creating an ingestion job
To create an ingestion job from BigQuery to the Feature store we need to define in the 04_request.json
payload file several variables:
- First we need to define what is the
entityIdField
that will be used as the Unique Id for each of our features. Note that the UniqueId should not be added as a feature. - We can also define a
featureTimeField
that will tell us how recent this feature is. - After this we have to define where our data lives. In the case of this example we are using the
bigquerySource
and defining a BigQuery table, but data can also be added from Google Storage. - Finally we define the key of the features in BigQuery to the Features in the Feature store. Note that these features require the same datatypes from BigQuery to the Feature store. In most cases you can use the same names between BigQuery and the Feature store; however the naming conventions differ slightly.
Depending on how many features you are ingesting, you can also adjust the workerCount
.
04_ingest_bq_data.sh
LOCATION="featurestore_location"
PROJECT="project-name"
FEATURESTORE_ID="feature_store_name"
ENTITY_TYPE_ID="entity_name"curl -X POST \\
-H "Authorization: Bearer "$(gcloud auth print-access-token) \\
-H "Content-Type: application/json; charset=utf-8" \\
-d @04_request.json \\
<https://$>{LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT}/locations/${LOCATION}/featurestores/${FEATURESTORE_ID}/entityTypes/${ENTITY_TYPE_ID}:importFeatureValues
04_request.json
{
"entityIdField": "uniqueId_field",
"featureTimeField": "timestamp_field",
"bigquerySource": {
"inputUri": "bq://beamery-data.big_query_dataset.big_query_table"
},
"featureSpecs": [
{
"id": "feature_01_name",
"sourceField": "bq_column_01_name"
},
{
"id": "feature_02_name",
"sourceField": "bq_column_02_name"
},
{
"id": "feature_03_name",
"sourceField": "bq_column_02_name"
}],
"workerCount": 3
}
After a job is submitted you will see it under the Ingestion jobs
tab.
Further documentation on Batch Ingestion can be found here.
Automating the ingestion jobs
👾 The example below is the process we have used in Edge to automate feature ingestion jobs from BigQuery, however depending where the source data lives this process could change. This process is managed by our Service Account to ensure reproducibility across Data Scientists.
Although we can create an ingestion job manually, in practice we want to automate this process to ensure we always have the latest data in our online feature store. The way we approach this is to automate the BigQuery query that generates our features using the BigQuery Scheduler and then automate the ingestion jobs triggers using GCP’s Cloud Scheduler. More details below.
- BigQuery Scheduler
To create a BigQuery scheduled query you need to click under Scheduled queries
. This page allows you to create a query and transfer the results into a new BigQuery table.
2. Cloud Scheduler
In order to create a Cloud Scheduler job, you need to configure the job to send a POST request to the address used in the 04_ingest_bq_data.sh
script. Then you need to define the contents of the payload in the Body section using the contents of 04_request.json
. After this is done, you need to include a Service Account with the right permissions that will allow you to trigger this Cloud Scheduler job. Remember to add the Content-Type
header as well.
Making a real-time request to the online Feature store
Once the Ingestion jobs finish running, your features will be available to query in the Feature store. In the next two sections we show you how to make read requests using cURL and Python. From our experience, the cURL methodology is useful to understand how the API works; however the Python client allows you to add this module into a machine learning Python repository keeping programming language consistency across the repo.
cURL
For the cURL methodology we use the streamingReadFeatureValues
endpoint to query the features in real-time. The 05_request.json
payload defines an array of uniqueIds
that we want to query for, these are the Entity Ids. Then you need to define the features Ids within the idMatcher
using an array of feature names.
Streaming requests will return a feature vector with all the features for each Entity Id that was requested.
05_make_request.sh
LOCATION="featurestore_location"
PROJECT="project-name"
FEATURESTORE_ID="feature_store_name"
ENTITY_TYPE_ID="entity_name"curl -X POST \\
-H "Authorization: Bearer "$(gcloud auth print-access-token) \\
-H "Content-Type: application/json; charset=utf-8" \\
-d @05_request.json \\
<https://$>{LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT}/locations/${LOCATION}/featurestores/${FEATURESTORE_ID}/entityTypes/${ENTITY_TYPE_ID}:streamingReadFeatureValues
05_request.json
{
"uniqueId_field": [
"id_01",
"id_02",
"id_03",
"id_04",
"id_05"
],
"featureSelector": {
"idMatcher": {
"ids": [
"feature_01_name",
"feature_02_name", "feature_03_name"
]
}
}
}
Additional documentation on Online serving can be found here.
Python
In Python, you can follow a similar approach where you load the GCP Service Account first and then you start the FeaturestoreOnlineServingServiceClient
. With this client, you can run the streaming_read_feature_values
method which accepts a StreamingReadFeatureValuesRequest
object that includes the entity type, the entity ids and the feature you want to load. This endpoint returns an iterable response_stream
that will require transformations into the desired shape.
In this example the config.FEATURES
variable is an array of feature names.
👾 In Edge, we have included a feature serving module within our machine learning API in order to read features in real-time. This process is managed by our Service Account to ensure reproducibility across Data Scientists.
from google.cloud.aiplatform_v1beta1 import FeaturestoreOnlineServingServiceClient, FeaturestoreServiceClient
from google.cloud.aiplatform_v1beta1.types import featurestore_online_service
from google.cloud.aiplatform_v1beta1.types import FeatureSelector, IdMatcher
from src.config import config
import pandas as pd
import os
from typing import Iterable
import numpy as npGBQ_CRED_PATH = str("creds/credentials_vertex_featurestore.json")
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = GBQ_CRED_PATHdef _make_featurestore_request(id_array: list) -> Iterable[featurestore_online_service.ReadFeatureValuesResponse]:
API_ENDPOINT = "europe-west4-aiplatform.googleapis.com" client = FeaturestoreOnlineServingServiceClient(client_options={"api_endpoint": API_ENDPOINT}) # noqa
admin_client = FeaturestoreServiceClient(client_options={"api_endpoint": API_ENDPOINT}) # noqa feature_selector = FeatureSelector(
id_matcher=IdMatcher(ids=config.FEATURES)) region = "featurestore_location"
project = "project-name"
featurestore_id = "feature_store_name"
entity_type_id = "entity_name" response_stream = client.streaming_read_feature_values(
featurestore_online_service.StreamingReadFeatureValuesRequest(
entity_type=admin_client.entity_type_path(
project, region, featurestore_id, entity_type_id
),
entity_ids=id_array,
feature_selector=feature_selector,
)
) return response_stream
In this Jupyter Notebook you can see examples on how to use the Python API client to manage the end to end process of Feature store, EntityType, and Feature creation, run ingest jobs and make requests.
Further documentation about the google-cloud-aiplatform
Python APIs can be found here.
Final thoughts
In this article we provided an overview on how to use Featurestores in Google’s VertexAI platform and the importance to include these in your machine learning development and production workflows. The hands-on examples will also help you create your first Feature store, ingest features and serve them within your machine learning applications.
Our experience with VertexAI’s Feature Store was overall very positive. We can rest assured that the internal AI APIs that rely on this service will provide the latest features for the model in a reliable manner. Even though the system requires the in-and-out knowledge on how to use the gcloud CLI properly, there are a myriad of benefits of using this tool if your company team works within the Google Cloud Platform environment.
VertexAI’s Feature store is still in beta. This means that changes could be made along the way to overall GCP API as well as to the Python library. Any updates to the underlying libraries should be tracked to ensure that scripts are updated accordingly.
If you have any questions, comments or thoughts reach out to me on Twitter @sebastianmont__
Interested in joining our Engineering, Product & Design Team?
We’re looking for Data Scientists, Software Engineers Front & Back Mid/Senior/Director, SRE Platform Engineers, Engineering Managers, Tech Leads, Product Operations/Managers/Designers across all levels and even more roles across London, Remote, USA & Berlin! to join us — apply here!