Turn Data into Feature Groups with IBM Cloud Pak for Data

Create, share and reuse curated feature data across your data science projects

Alexander Lang
5 min readMar 3, 2023

Managing machine learning features in Cloud Pak for Data today

A feature is a piece of data that serves as the input or target for a machine learning model. Data scientists spend a lot of effort constructing features such as In-Store Customer Revenue Last Month from ‘raw’ data sources. These features can be reused in multiple machine learning models — if they’re discoverable across the enterprise.

IBM Cloud Pak for Data provides a large variety of tools to construct features, including Jupyter notebooks, SPSS Modeler flows or Watson Pipelines. With its myriad of data connectors, it’s easy to store these features in the backend of your choice — you’re not tied to the backend of a particular “feature store”. And by adding these features to a catalog in Watson Knowledge Catalog, all data scientists can find them and use them in projects. So, we’re done, right? Not quite…

Feature Groups: Curated Data Assets for Machine Learning

We just enhanced feature management in IBM Cloud Pak for Data as a Service with Feature Groups. A Feature Group is a data asset that contains features for machine learning. Feature Groups extend the existing feature management capabilities in three ways:

  1. Feature metadata that helps build models
  2. Easily find features across the platform
  3. Use feature metadata in Jupyter notebooks

Creating feature groups in projects and catalogs

When you open a data asset in one of your projects or catalogs, you will notice a new tab Feature group, with a small “beta” icon on the side. Initially, this tab will be empty. Press the New feature group button, select the data asset columns, enter some feature metadata, and your screen will start looking like this:

Feature metadata overview for a feature group

There are various metadata attributes you can set for each feature. I want to highlight three of them. See our documentation for the full list.

Role: Most features serve as Input for a machine learning model. If your feature group is curated training data, specify one or more features as Target, to make the “goal” of this feature group clear. Use the Identifier role for features that are not to be used as input for modeling, but that identify a particular row of data.

Recipe: If your feature represents a complex formula (such as customer lifetime value) or is the result of some advanced data wrangling, store the actual formula or the code snippet in the recipe. That way, others can reproduce your feature, and it’s transparent how a particular feature was derived.

Fairness information: For input features that may be susceptible to bias (such as Age or Gender), provide reference and monitored groups. For target features, provide favorable and unfavorable outcomes. IBM AutoAI and open source frameworks such as AIF360 use this information to detect and mitigate bias in the feature group and in the models built from it.

Metadata details for Feature Age, including Fairness Information

Finding feature groups

When searching for data assets in Cloud Pak for Data, you will notice a new filter option: Contains Feature Group. Use this filter to narrow your results from dozens of data assets that contain your search term down to feature groups, as shown below:

Restricting asset search results to feature groups

Starting with AutoAI, you will be able to select feature groups as input in Watson Studio model building tools:

Zoning in on feature groups when selecting input data

The AssetFrame: create and use feature groups in Notebooks

Metadata is most useful in the tool you’re currently working with. That’s why we’ve created a new library in our Python notebooks called AssetFrame. It provides easy access to data assets and their feature group metadata. Here’s a quick look at what it can do:

Initialize an AssetFrame with the name of a data asset:

from assetframe_lib import AssetFrame
af = AssetFrame.from_data_asset("Retail Customers")

Show a sample of the data asset:

Feature group — metadata in context

Have you noticed? The syntax is similar to pandas dataframes, but the result has two key improvements:

  • You don’t need to load the full data asset into your notebook first — the AssetFrame retrieves the data “on the fly”, using the data connectivity capabilities of Cloud Pak for Data.
  • The data is enriched with feature metadata: each column contains the role, monitored and reference groups are highlighted, as are favorable and unfavorable outcomes.

Create a feature group from a pandas data frame:

af = AssetFrame.from_pandas(dataframe=credit_risk_df, name="Retail Customers")
af.create_default_features()

risk_feat = af.get_feature("Risk")
risk_feat.set_roles(["Target"])

risk_feat.set_favorable_labels("No Risk")
risk_feat.set_unfavorable_labels("Risk")

af.to_data_asset()

The code snippet above takes your feature-engineered pandas dataframe credit_risk_df, turns all columns into features, and provides additional metadata for the target column Risk. It then creates a new data asset in your project containing the feature group metadata and your pandas dataframe.

Next steps

Should you turn all your data assets into feature groups? No. Here are good candidates for feature groups:

  • Data for a specific use case — for example, manually annotated training data for supervised learning. You plan to reuse the data to build multiple models for the same use case.
  • Data about a specific entity —for example, a customer, product information or monthly sales. You plan to reuse the data to build models for different use cases, typically joining it with other data first.
  • Data that is the result of feature construction, and contains “higher-level” attributes such as customer lifetime value.

To get started with feature groups, see our documentation on the Feature Group UI and the AssetFrame to find out their complete capabilities.

We have also added a new Sample Project in the Watson Studio Gallery: Creating and using feature store data. In this project, you will find more examples of the AssetFrame library, including how to use a Feature Group to train an AutoAI model.

Start making your feature data “stand out” through feature groups — we’d love to hear your feedback!

--

--

Alexander Lang

Architect in the IBM Watson Studio Team. Experience in Data Science, NLP and Social Media Analytics