Feature Store Design at Constructor

How we compute, store, and deliver ranking signals for our ML

Published in

Constructor Engineering

15 min readJan 18, 2023

One of the central components in product ranking at Constructor is a Feature Store. It is a centralized storage of ML features. The Feature Store has two APIs: offline and online. The offline API is used during the model train phase and experiments, while the online API is requested by our production Machine Learning models to retrieve features before making a prediction.

In this article, we’ll discuss feature challenges in real-time Machine Learning, walk through the current design of the Feature Store in Constructor with code examples, go over key motivations and decision drivers behind it, show how Data Science experimentation workflow benefits from the Feature Store API, and discuss our future plans.

Table of Contents
· A Recap of Product Ranking at Constructor
· Feature Challenges in Real-Time ML
∘ Motivations
∘ E-Commerce ML Ranking Overview
· Existing Approaches for Feature Delivery
∘ Features Logging
∘ Feature Store
· Constructor’s Feature Store
∘ Requirements
∘ Offline Batch Layer
∘ Online Serving Layer
∘ Ranking Service
· Results
· Future Plans

A Recap of Product Ranking at Constructor

If you haven’t read Product Ranking at Constructor, please go ahead and read it!

Terminology Disclaimer

Customer refers to a business that uses Constructor, i.e. our client
Shopper or user refers to a particular user on a website performing some product discovery activity

Serving HTTP requests at Constructor has a lot of requirements, such as real-time personalization (no caching), high volume, high performance, and unique data sets per customer.

Here is the high-level flow when we receive a search HTTP request:

Retrieve a set of matching products with an initial ranking
Send top-K candidates to the Ranking Service for enhanced ML ranking and reranking
Apply Business/Merchandising rules to results

First, the Retrieval System takes a query and returns a matching set of candidate products with initial lightweight ranking. Second, Top-K candidates are sent to the Ranking Service. Using dozens of computationally complex features, the Ranking Service calculates all the features needed for an ML model and runs model inference to score candidates. Predictions are returned back to the Backend Service for subsequent reranking. Finally, Business/Merchandising rules (slotting, boosting, and burying) are applied, affecting the results order for the last time. Merchants create rules to infuse business specifics into product rankings, for example to handle sales or promotions or increase visibility of new products.

In the rest of this article, we’ll concentrate on what happens inside the Ranking Service. In particular, we’ll discuss common feature challenges in online ML, serving hundreds of models with varying configurations, and how we solved feature calculation, delivery, and storage via our Feature Store.

Feature Challenges in Real-Time ML

Motivations

As mentioned in Product Ranking at Constructor, when deciding to build the Ranking Service, we had several motivations to address:

Improve time-to-market for ranking improvements (from months to days)
Support more ranking signals (<10 to 10–100)
Support almost any ML ranking model (e.g., Gradient Boosted Trees, Linear Models, Neural Networks)
Have a flexible, customer-specific model configuration system
Have a unified way to handle customer-provided ranking factors

These motivations arose for various reasons. The implementation of the “ranking part” in the retrieval system was built with specific ranking factors in mind, such as token-to-product relevance and lack of extensibility. Flexible configuration was needed due to Constructor serving many customers, and we wanted to serve many models per customer (e.g., for A/B testing or serving in different contexts). Moreover, every customer has a unique combination of domain specifics and patterns in shopper behavior, which leads to different optimal model configurations and feature sets.

E-Commerce ML Ranking Overview

Let’s take a concrete case as an example.

We want to train an ML model that predicts the probability of a conversion event like add-to-cart or purchase. A training dataset is constructed using request logs (search results shown in the past) along with behavioral logs (user activity such as an event of searching, clicking, adding a product to cart). The dataset row is a logged product shown in a specific request (request has associated timestamp, search query, user ID, and other info). Label is a binary indicator whether an interaction has happened within T seconds after request was made.

E-Commerce Ranking Dataset Example — *E-Commerce ML Ranking Dataset Example*

When training the model against the dataset, we want to utilize various kinds of features that our data can express. For example:

Product metadata (categories, facets, sizes, stock levels, colors etc)
Behavioral statistics over any period of time (global product CTRs, query-product CTRs, query-category CTRs, etc.)
Real-time user history
User metadata (geographical region, device)

Once a model is trained by an offline batch process, we save the artifact to cloud storage like S3 or GCS. The Ranking Service regularly picks up new artifacts for serving.

When it comes to online serving of ranking models, a few questions arise:

How do we get features for every request? Where do we store & retrieve them from?
How do we know what features the model needs? We have many customers and hundreds of models served in parallel and theset of features varies for every model so it can’t be hardcoded
How do we ensure consistency between the feature values model observed during the train phase and what we see now in serving?

In order to address all the motivations and answer questions above, we adopted some ideas from the Feature Store concept and implemented a solution that covered all our needs at that point.

Existing Approaches for Feature Delivery

Let’s discuss two approaches for feature delivery that are popular in the industry. We don’t use them as is, but talking through them will help us to understand design choices made for Constructor’s Feature Store.

Features Logging

One approach is what we call Features Logging. In this approach, the Feature Calculation code lives on the backend side, close to model inference. Usually, that code uses additional features data (e.g., product CTRs or user history) saved in an external storage such as MySQL or Redis. When a request comes in, features are calculated and then sent to request logs. Since the training dataset is constructed from request logs, it has features available straight away. Once the model is picked up for serving, the backend runs the same feature calculation code and calls model inference.

In experiments, data scientists have two options for adding features to the model:

Use features that are already available in request logs
Construct new features using custom offline code

With the first approach, you are limited to only the features that are already logged and their derivations. With the second approach, when moving to production you’ll need to either rewrite offline code for serving or have duplicate feature logic for training and serving. Both options come with disadvantages. When re-implementing a feature for serving, you have double the work and backfilling problems (like needing to run code live for weeks to collect logged feature values). In the case of duplicate logic, there’s always a chance for train-inference mismatch that might ruin online model quality.

Feature logging is a good approach when you have a bunch of heuristics already in your codebase and want to apply ML using them as features — or when you want to use request-time information for features. Log them and use them for training. As you scale to more features, challenges with train-inference mismatch, backfilling, and double work arise, making the feature logging approach less attractive.

We opted for a feature logging approach for real-time features, i.e. the ones that have sub-hour update frequency.

Feature Store

Another approach that’s become popular in the last few years is the Feature Store. There are several differences from the feature logging approach:

Responsibility for the feature calculation is shifted from the online to offline environment. This allows feature code to be written once in an offline environment, and its result will be available for both training and serving.
Features are decoupled from particular log streams or datasets. Features are defined for particular entities and have specific time granularity. This allows joining features to any dataset during training with point-in-time correctness. This also allows backfilling features.
Feature Storage is centralized. This allows features to be reused by different consumers for different tasks.

One of the core pieces in the Feature Store that allows having a source of truth in an offline environment is a streaming engine. In order to support real-time features such as up-to-date user history for any point in time, events need to be processed by a streaming engine such as Apache Spark or Apache Flink.

In the rest of this article, we’ll go through Constructor’s Feature Store design and discuss our hybrid approach that has allowed us to utilize most of the FS principles while supporting real-time features without having a streaming engine in our architecture.

Constructor’s Feature Store

Requirements

Here’s a list of our key decision-making drivers for the Feature Store solution:

Usability

Minimized code duplication (reducing chances of train-inference mismatch)
Convenient DS experiment workflow (maximize iteration speed from idea to prod) for both ad-hoc experiments and scheduled runs
Convenient configuration engine that supports many customers and many models per customer

We focused on usability since our ML teams are small but we have many customers. We need to iterate very quickly to deliver massive value across the whole customer base.

Efficiency

Features are stored once, then reused by multiple models
Store 100s of millions of feature values with minimal increase in infra costs
Read latency for 1K feature values as P99 < 10ms

We focused on efficiency because our Backend Service sets a high standard for response time percentiles. Our customers have already gotten used to Constructor’s great performance, and we didn’t want to change expectations as we rolled out the Ranking Service. We set SLO for Ranking Service response time as P99 < 100ms. Since Feature Store lookup is just one step in ranking service processing, it should be blazingly fast so we could put most of the complexity into the model.

Considerations

Few real-time features, many daily-updated features
Reuse existing infrastructure if possible

We were ready to step aside from state-of-the-art and take advantage of the specifics of our existing state in order to bring the Ranking Service to life as soon as possible.

Offline Batch Layer

Entity, Feature Spec

Every feature definition consists of three components:

A List of Entities
Type (e.g. float, int, str)
Name

Entities are dictated by business domain, and in our case include but are not limited to:

user
query
browse page
product
product variation

Every particular instance of an entity has a join_key — column which is used as an identifier (e.g. user_id for user, product_id for product, query itself for query).

Such design allows for declarative definitions of features. This ultimately helps us to talk about features in the code, communicating what features a model needs, what features to retrieve from FS, and what features are available in FS.

As an example, a feature like “product CTR over the last 7 days” is defined for a single “product” entity, has type “float,” and has a name like “product_ctr_7_days”.

Feature Registry

A set of batch jobs (Luigi PySpark tasks) calculate features in the offline environment on a daily basis. Each job’s output is a Parquet table with multiple feature columns. For example, a product CTR job saves CTRs for different actions (clicks, add-to-carts, purchases) into separate columns. Such groups of features are called Feature Groups.

Feature Registry is a Python class that keeps track of Feature Groups. Every day, a batch job goes through all feature groups in the registry and runs corresponding feature tasks for the current day’s partition. Thus offline, we have feature values available for the whole timeline and can backfill features for older dates.

All feature groups are registered in a declarative Python file (example here).

Feature Joiner

Now, as we have all features registered in a single place, we need the Offline API to retrieve and join features to the dataset. For ranking, we’d have a dataset of logged results pages with an attached label indicating behavioral action presence. We want to enrich the dataset with features, e.g. product metadata, query-product CTRs, etc.

Since all the features conform to the same interface, it’s possible to write a generic logic that adds any set of features to a dataset. It takes a dataset and feature specs as input, and adds actual feature values with point-in-time correctness. Remember how we used join_key in entity definition? It is now used to match feature values with rows in a dataset. A component that does such a join is called Feature Joiner and is also implemented as a Pyspark Luigi job.

As an example, suppose we have a ranking training dataset of form (timestamp, request_id, user_id, query, product_id, position, label). If we have a product CTR feature in the registry defined for PRODUCT entity, the way joiner would add it is:

For dataset, calculate daycolumn based on timestamp
Calculate min(day)and max(day) in dataset
Read feature group data for a period from min(day) to max(day)
Left join dataset with features data on (product_id, day)

Enriched with features, the dataset is final and can be used for training.

*E-Commerce ML Ranking Dataset enriched with Features*

Model Interface

We are on the bridge with online part of the system, and questions we need to answer at this point are:

After the model is picked up in the Ranking Service, how does it know what features the model needs?
How do we make the Ranking Service model agnostic to allow as much flexibility for DS as possible?

To solve these questions, a simple model interface is defined for all actual model implementation to inherit from.

Knowledge about features is carried in the model artifact. The Ranking Service doesn’t need to know any implementation details of the model, but it just needs to have the ability to a) load the artifact from cloud storage, b) get the required features list, and c) call the predict method.

This interface lives in a shared library and is used both during offline model training and Ranking Service.

Basis and Processed Features

In experiments, DS might want to apply some transformation on a feature and see if it improves metrics. For example, we might want to apply min-max normalization to a numerical feature before training the model, or calculate textual distance features based on query and product title.

How does this play with the Feature Store? Do we store transformation outputs in the FS as a separate feature or do we make this processing a part of the model itself?

We decided to keep both options open. On one hand, if reuse or better performance is needed, it makes sense to put the transformed value into the FS as a separate feature. On the other hand, some transformations are only possible at request-time, such as calculating textual distance mentioned above.

Features stored in the Feature Store are also referred to as basis features (returned by ModelInterface.get_required_features(), serving as input in ModelInterface.predict()), while features obtained by post-processing basis features are called processed features. For the examples above, the numerical feature is a basis feature, while min-max normalized feature is a processed one; product title is a basis feature, while textual distance is a processed one.

Configuration

As a consequence of bringing features to a uniform interface, we get quite a nice experience for DS. They can define a config file for any experiment with all the features to use in training. Once satisfied with the result, the same config is scheduled for production runs. All the configs are stored and versioned with the rest of the codebase.

model:
    type: boosting
    n_estimators: 500
    max_depth: 10
transforms:
    - name: text_distances
      input_columns: ['product_title', 'query']
dataset:
    name: main_ranking_dataset
    period: 30d
    sampling: uniform
basis_features:
    - 'product_title'
    - 'qp_view_to_click_rate_30d'
    - 'qp_view_to_add_to_cart_rate_30d'
    - ...<DS can add / remove features here>...

Online Serving Layer

When it comes to the online part, a central question is the choice of what low-latency storage to use for request-time features retrieval.

Index Service

The Index Service is an in-house built service that manages multiple memory-mapped data structures. Data lives on disk and, thanks to memory mapping, doesn’t eat up a lot of RAM. The Index Service performs well when there’s a need for fast low-latency reads, but not for quick writes or updates. Data structures are updated regularly in a background by Index Builders and put on the disk for serving. One type of data structure the Index Service supports is a simple key-value, which is enough for online Feature Store use cases.

We compared Index Service with Redis, and both options had similar read performance, so for us the deciding factor was pricing. Based on our back-of-the-envelope calculations, going with Redis would mean an increase in total production costs of more than 10%. With the Index Service all data lives on disk, so cost increase was negligible.

After going to production with the Feature Store based on the Index Service, we saw even better read times than in our benchmark. Read latencies for production requests have P99 < 25ms while storing 10x more feature values, and 10x more feature values are requested from FS for every call.

Daily Batch Ingestion

Every day, a Luigi task goes through all model configs scheduled for production, collects all the latest feature values needed by models, and pushes them to the cloud storage. Index Builders pick up new data and put prepared indexes in the Online Feature Store for serving.

Real-Time Features Support

Everything we discussed so far works for features with batch updates. Such features are usually based on a high period of aggregation or cover cases where we don’t need to react to changes immediately. For example, product CTRs based on the last 30 days of data can be updated daily.

But what about features where we do need to propagate updates within seconds?

For such features, we decided to leave the “features logging” path open. Our source of truth lives not in an Offline Feature Store, but rather in external storage. Once a shopper performs some activity, our behavioral API receives an event and updates Redis. This way, feature updates are propagated to the model almost immediately (milliseconds-seconds lag). The Ranking Service, in addition to calling the Online Feature Store (Index Service), also checks Redis for real-time features.

Another type of real-time feature is quickly changing dimensions in the product catalog, such as stock levels. Customers regularly (sometimes every 5 minutes) send catalog updates which are picked up by Index Builders and then served by the Index Service.

To summarize different types of features and how we handle them, a table below can be helpful.

Historical vs Real-Time Features Comparison

A cleaner architecture for handling both historical and online would be a Lambda or Kappa architecture, but given the split between historical and real-time features and having Redis / Index Service components already in our architecture, we decided to leave streaming for future improvements.

Below, you can see the overall system architecture diagram which covers all the components we have discussed.

Constructor’s Feature Store: Overall Architecture

Ranking Service

All the design described so far actually allowed us to make the Ranking Service completely model-agnostic. There is no need to write custom code in the Ranking Service for a particular model or customer. All implementation details are encapsulated in the model interface and in uniform feature design.

Here’s what happens when a request comes into the Ranking Service:

RS looks up a requested model by (customer_id, model_name)
RS calls model.get_required_features() to receive required feature specs
RS looks at additional features passed in the request
RS calls FS for the rest of features (present in required specs but not passed in the request)
RS calls model.predict() with retrieved basis feature values
RS returns predicted scores back to Backend Service
RS logs request data along with all the feature values

Results

By incorporating best industry practices for ML feature operations and adjusting for the specifics of our existing architecture, we have been able to get a lot of benefits:

Drastically improve time-to-market for new features
Achieve convenient DS workflow
No code duplication

And all that comes with a minimal increase in infrastructure costs. Online Feature Store performance satisfies our needs at the moment, allowing us to put all the complexity into the model. Our design supports both batch features and real-time features.

Future Plans

While the described solution covers most of our needs, there are always things to improve. A few directions we’re exploring at the moment include:

Hourly updates for batch features. For features that don’t really need online updates, we still can update more often than daily via Offline to Online ingestion dataflow.
With more online features to support, explore Spark streaming (Lambda or Kappa architecture)

Dmitry Persiyanov is a data science team lead at Constructor where our data science and engineering teams collaborate on e-commerce search and discovery and serve billions of requests each year for retail’s top companies.

Interested in joining the future of search? We’re hiring.