Transform once, use everywhere; from computer science to data science with SageMaker Feature Store.

Kalin Vera
Tech @shift.com
Published in
4 min readAug 29, 2021

All machine learning models are as good as their input data. Majority of the internal products leverage 60% of somewhat same input datasets. This means that we can facilitate our development if we process data input once and then provide access to it to other data science models.

Other data scientists and engineering teams can leverage the input centric feature store to facilitate and expedite the development of their corresponding products. Analytics, BizOps and Engineering can benefit enormously from relying on the same foundation of data cleanup, preprocessing and normalization. The basic idea of implementing once and reuse can certainly benefit many teams; therefore the whole technology organization as a result.

A new data science model should not start out of the gate with a focus on data cleanup, preprocessing, and normalization; the focus should instead be on delivering business value as a functional product, gathering inference insights , and exposing them to business. The input data (data attributes a.k.a. model features) needs to be preprocessed and certain normalization usually takes place for any ML model as part of the normal pipeline. Data preprocessing and validation takes time, and a common feature store will expedite the delivery process and reuse. Common use cases like dealing with String type data, feature engineering needed to be built once used everywhere (common methodology in software engineering practice is easily applicable in the data science field). In this article we will focus on SageMaker functionality and how to leverage it. Please refer to AWS SageMaker documentation for more details.

Fast Model Development. Three things to focus on:

  1. Preprocess input data, fill in missing values, do data normalizations, feature engineer once and leverage the results for other use cases
  2. Do not build custom solutions if the industry solution exists
  3. Grasp feature drift to manage a model with the help of central feature store

Online and offline feature stores are needed, an easy access from models deployed and hosted on AWS SageMaker, data versioning, ability to time travel for model management.

Typically, we see teams rely on Airflow for data preprocessing with Redis as a feature store and potentially Redshift or S3 as an offline feature store.

If you look at AWS SageMaker feature store you’ll notice it instead leverages data visioning, and data attributes exposure via SQL like functionality with Athena.

You can also use AWS Lambda orchestration to trigger a scheduled model re-train. You can see the illustrating diagram below.

Architecture Diagram

Now we will walk you via our feature store process. As we build a sample model we emphasize the steps we took:

  1. Connect to Redshift (because of our specific use case, you can certainly retrieve your input from Postgres, MySQL, S3 or other analytics data storage)
  2. Run a select statement to extract the needed dataset into pandas data frame
  3. Fill in missing values
  4. Normalize certain columns
  5. Leverage one hot encoding for categorical variables
  6. Use boto3 library to connect to SageMaker Feature Store
  7. Get a handle to a SageMaker client
  8. Get a handler to SageMaker Runtime client from SageMaker session
import boto3
import sagemaker
from sagemaker.session import Session
region = boto3.Session().region_nameboto_session = boto3.Session(region_name=region)sagemaker_client = boto_session.client(service_name=’sagemaker’, region_name=region)
featurestore_runtime = boto_session.client(service_name=’sagemaker-featurestore-runtime’, region_name=region)
feature_store_session = Session(
boto_session=boto_session,
sagemaker_client=sagemaker_client,
sagemaker_featurestore_runtime_client=featurestore_runtime
)

Dive deep into the Feature Store

  1. Specify the S3 bucket name for the feature store
  2. Attach policy to the SageMaker ARN role (this is done outside your notebook)
  3. Create SageMaker feature group (think of this as a database table)
  4. Define a unique identifier for the feature record (like customer id, order id, etc)
  5. Load feature definitions into a group (table schema definition)
  6. Ingest your data set into the defined above feature store
your_model_feature_group.ingest(data_frame=df[limit_cols], max_workers=1, wait=True)

Data Retrieval from defined Feature Store and data exposure via Athena.

  1. Leverage Feature Store runtime client
  2. Call as_hive_ddl() to create an Athena table
  3. Leverage athena_query() method to run select like operations on the feature store data
  4. Retrieve dataset back with as_dataframe()

Some code snippets below:

record_identifier_value = str(200)featurestore_runtime.get_record(FeatureGroupName=your_model_feature_group_name, RecordIdentifierValueAsString=record_identifier_value)

with the response

{‘RecordIdentifierFeatureName’: ‘record_id’,‘EventTimeFeatureName’: ‘EventTime’,‘FeatureDefinitions’: [{‘FeatureName’: ‘record_id’, ‘FeatureType’: ‘Integral’},{‘FeatureName’: ‘f2’, ‘FeatureType’: ‘Integral’},{‘FeatureName’: ‘f3’, ‘FeatureType’: ‘Integral’},{‘FeatureName’: ‘f4’, ‘FeatureType’: ‘Fractional’},{‘FeatureName’: ‘f5’, ‘FeatureType’: ‘Fractional’},{‘FeatureName’: ‘f6’, ‘FeatureType’: ‘Fractional’},{‘FeatureName’: ‘f7’, ‘FeatureType’: ‘Fractional’},{‘FeatureName’: ‘f8’, ‘FeatureType’: ‘String’},{‘FeatureName’: ‘EventTime’, ‘FeatureType’: ‘Fractional’}}

Feature Retrieval in Athena

Our Features are on S3

Conclusion

The above approach provides fast data exchange between models, creates an exposure of model attributes to other teams, allows model health monitoring, and easy capture of any potential model drift. Above all — this build once and reuse framework reduces development time, increases data scientists focus on the product development and delivery.

--

--