Requirement for your Feature Store

Iago Brandao
casasbahiatech
Published in
4 min readOct 4, 2022

Written by Iago Brandão, revised by Mônica Borges and Pedro Carvalho

Introduction

In the recent past there has been a lot of talk about Feature Store, a relatively new concept to the market, being a centralized repository of features for machine learning models, there are several benefits, but the reuse of features is the main one.

During the definition and implementation of Databricks Feature Store here at ViaHub, we realized that we need to have clear processes and requirements for Feature Store, helping to its adoption and development evolution be as fluid as possible.

We realized that we need to have clear processes and requirements for Feature Store, helping to its adoption and development evolution be as fluid as possible.

In this article, seeking to make your process of inserting new features in the Feature Store a success, we will share some tips and lessons learned.

Context

For you seize more our tips, it is worth commenting more on the tools and technologies involved in the Feature Store we use.

In Databricks Feature Store, we deal with feature tables , that is, tables with various features that can be reused by Data Scientists and Machine Learning Models. To save features in feature tables, we use computation in cluster mode and use Spark as a processing engine.

In addition, it is important to mention that features can be generated at any time, mainly by Data Scientists and Data Analysts, but they only become productive after being deployed at Feature Store, hence the need to standardize and request pre-deployment requirements.

Features can be generated at any time, mainly by Data Scientists and Data Analysts, but they only become productive after being deployed at Feature Store, hence the need to standardize and request pre-deployment requirements.

Pre requisites for Feature Store

Considering this context, here are some tips and lessons learned to facilitate this process of inserting new features in the Feature Store.

Use incremental writes

Prefer writting using append mode. You don’t want a black and white world, full overwrite will, definitively, overwrite your data, incremental writes are colorful and vivid. If you overwrite your data, you will be erasing your data and putting another one in its place, impacting scientists, analysts and Machine Learning models, in addition to making analysis and backtests with historical data unfeasible.

Finish and validate the ETL before including it in the Feature Store

Even though validating the correct functioning and content generated by the ETL is a clear and essential point, it is important to make it explicit for everyone who wants to deliver features to the Feature Store.

Version the ETL code that creates your features

Often you will need to ensure that you are using the features that are calculated as expected to avoid having a mismatch between the expected code and the production code to generate the feature, or even report how some feature was calculated, or even make a rollback and recalculate one or more features in the previous format, for that, version the code that generates its Features. Using Github can be very plausible alternative.

Specify the path/directory for your table

This is important for both Feature Store feature tables and features that are candidates for entry into the Feature Store. If you are using Spark and Delta/Parquet tables, do not forget to save your data in a directory that can be accessed, otherwise Hive will save the data in its default location, which may make it impossible to query and approve other workspaces without further configuration.

Describe each feature, you will need it

Preferably, ask whoever created the feature table to describe what each feature represents, it is possible that only that person knows essential details, such as business knowledge and logic involved, thus preventing the ML team from creating such a generic description that prevents the feature from being reused.

Value for governance, prefer Delta tables

Prefer the Delta format of tables to Parquet format, all the additional governance layer is of great importance to consult when there have been changes in the data or to go back to a previous version of the data if necessary. This helps you deal with unexpected events related to the data, e.g. running a DELETE command without a WHERE clause.

Specify the feature write periodicity

The person who created the features table must indicate the best period to load these features in the Feature Store, even if this periodicity is null, in the case of features that will be written once. Thus, who consumes the data will know which periods will have data available for their analysis and ML models.

Use a naming convention

With each new feature, or feature table, you will realize the need to have a convention for naming your features, otherwise it will be harder to use and join features with a lot of different names to the same information, e.g. client_id, id_cli, cli_avg_spending, CliAvgSpend, and so on.

I believe you can already understand that here at ViaHub, we are passionate about high performance, autonomy and participation, right? These are some of the pillars of ViaHub, a tech company! If you liked it and are interested in being on such a team, just sign up on our job portal at https://viahub.gupy.io/ . Also learn more about ViaHub and the tech culture we have at https://www.viahub.com.br/ !

--

--

Iago Brandao
casasbahiatech

Apaixonado pelo propósito e pela entrega de valor. Atualmente trabalha na Via como Tech Lead de Ciência de Dados, apoiando o negócio e o time de produtos a evol