A Feature Store to Enable Machine Learning Innovation at BFA Industries

IPSY Technology
CODE + CONTOUR by IPSY
5 min readApr 2, 2021

Contributors: Atishay Jain, James Faghmous Jeremiah Gaw

Machine learning is a field of computer science that develops algorithms that improve their performance as more data are observed. Historically, machine learning has been considered a subfield of artificial intelligence focused on predictive models — given a set of historical data, make a prediction for new unseen data. The focus on predictive models made it different from artificial intelligence, which was focused early on reasoning and cognition.

Today, most companies are racing to adopt machine learning and some are even re-organizing to become “AI-First” organizations. Why such intense interest? This is mainly due to the recent advances in computer hardware (and cloud computing) that made adopting machine learning and artificial intelligence extremely affordable (in the 80s and 90s you would need a supercomputer at a university or government agency to build AI algorithms). The second reason AI has gained extreme popularity is that it has shown to be extremely adept at certain task matching and sometimes outperforming humans at the same task. This includes automatic image classification, automating tasks, and speech recognition.

But what does that have to do with beauty, you might ask? At BFA, we are committed to empowering everyone to express their unique beauty. This means that every month we work across technology, merchandising, operations, and customer service to deliver our subscribers the best possible bag for them to express their unique beauty. Machine Learning and AI play a critical role in delivering our core value proposition from sourcing the right items, to sending the right email, to building the perfect bag based on a subscriber’s personalized preferences.

Innovative Machine Learning Model: A Feature Store

At BFA, machine learning is at the core of delighting our subscribers. Each month, several machine learning models are used to give our subscribers the best products, at the best time, and at the right price.

On any given day, our data scientists are building models to improve the member experience, reduce subscriber churn, and help source the best items. So a significant effort goes into training models and exploring new model ideas by gathering data and building prototypes.

Given the sheer number of models we rely on and explore, we have begun looking at improving efficiencies in our model development pipeline. One bottleneck we identified was feature generation. It takes an immense amount of time and effort to identify relevant data and convert them to features. Moreover, given the sheer number of models being built, several members were building similar features multiple times.

Furthermore, we didn’t have standard procedures for generating features, such that two team members may compute the same feature in two subtly different ways, resulting in inconsistent features. This led to significant business risk in the form of lack of reproducibility, the potential of errors, and redundancies. The realization of the risks led us to develop our feature store to simplify and streamline feature generation and model training.

Development of Machine Learning-Based Feature Store

When we started developing the feature store, we had the following requirements for any potential solution:

  • Reusable Modeling Datasets (context-free)
  • Incrementally updated datasets
  • Backfill across a range of dates
  • No circular dependencies
  • Codesharing; Single library
  • Dataset metadata handling

Most of our features are time-based, which might change based on a given timestamp. A simple example can help in understanding the feature.

Suppose a user joins us on 15th April 2019 and enters an address. And then on 14th December 2019, they updated their address on our website. Based on when we are calculating their location feature, we will have two different values for the same feature.

Any query between 15th April and 14th December should return their old address, while any query after 15th December should have their updated address. So, the feature is generated twice for the same user with a different ‘timestamp’ and the data scientist will have to select the feature from the timestamp that is relevant to her analysis.

To build the feature store, our machine learning science and machine learning platform teams worked together to build the right infrastructure and datasets. We divided all our features between team members who would code the feature, get it reviewed by another team member, set up backfilling for the feature, and set up automated jobs that would update the feature regularly.

Thanks to the feature store, now it is much easier to quickly build a feature set to train a model. Assume we need a given feature; we simply query the feature store with the context and timestamp and grab the most updated value for the feature given the timestamp. A data scientist doesn’t have to spend time querying the raw data, clean it and process it. For example, if we are interested in the location feature we would simply run:

location_df = spark.table(‘database.location_table’)context.join(location_df, [‘userId’]).filter(context.timestamp > location_df.feature_end_time)

Every dataset that gets created in the feature store has metadata associated with it which gives information about the dataset and how to access it. A sample of metadata stored looks like this:

Dataset(
table_name,
s3_base_path
partition_timestamp_column,
partition_column,
data_format,
owner,
)

Improving the Model Development Process

Once we developed our feature store, we began to see new opportunities to further optimize our model development process. For example, shortly after launching the feature store, we noticed that we had multiple notebooks that gathered different features and joined them into one so that we had a dataset ready for model development.

Looking at our current pipeline, we realized that we could improve our process further by creating a way to interact with the feature store, enabling us to do all these steps in a single notebook (in parallel). With this idea, we developed our feature store API. The feature store API allows us to load a dataset, attach it to our context, do any post-processing and build the dataset.

Below is a simple example of how we use the API in practice.

Suppose we have 3 features that we need for a model. We can use the feature store as we highlighted earlier to create the three features. Now the API enables us to simply add the features to our context, apply any postprocessing and return the final dataset.

# load the features
feature_1_df = Feature_1()
feature_2_df = Feature_2()
feature_3_df = Feature_3()
# join features to context
context.add_feature(feature_1_df)
context.add_feature(feature_2_df)
context.add_feature(feature_3_df)
# any post-processing if necessary
context.add_postprocessing(feature.post_processing)
#build the dataset by joining all the features
features_df = context.build_dataset()

And that’s it. The ‘features_df’ is the final data frame that can be used for model development or analysis.

Final Remarks

Machine Learning and Artificial Intelligence have huge potential to transform commerce and the beauty industry. AT BFA, we leverage AI for numerous business-critical applications. The machine learning team focuses on building models to delight our subscribers as well as the infrastructure to deliver new models frequently, easily, and reliably. Our feature store is one such infrastructure investment and we have seen outsized returns in the form of convenience, efficiency, and reproducibility of the features we use to train our models.

--

--