Feature calculation and serving are the heart of every classical machine learning system.
The goal of these components is to capture analytical signals in the data, and reduce them to the form of features which allow the model to predict the outcome accurately.
Calculating and serving features at “descent” scale requires domain and feature specific tradeoffs between system constraints and analytical functionality
The goal of the post is to review some of these system design tradeoffs and understand their implications on common analytical use-cases.
This post is aimed at teams who are in the process of building or evaluating a feature calculation and serving platform at “descent” scale.
The feature store paradigm
In the past few years, feature stores have gained a huge amount of interest from the machine learning community. They offer a number of useful abstractions that cover:
- A contract for feature retrieval
- Train/serve equivalence
- Feature re-use
- Feature lifecycle
- And more
Do I need a feature store?
Feature stores often require a considerable level of investment in infrastructure, and thus fit use-cases that meet certain scale requirements, e.g.:
- The # of features you plan to build is approaching triple digit
- The # of models you plan to build is approaching double digit with potential re-use of features, perhaps across multiple teams
- Your models require high throughput low-latency inference (tens or more QPS, sub-second latency)
- Overall you have TBs of raw data or more
- Individual features depend on large amounts of raw data, require high degree of freshness, or both
- Training your models requires point-in-time accuracy
A good example for such use-cases are companies in the e-commerce, fintech, or consumer facing services with 100K or more DAUs running multiple ML use-cases, from fraud detection to recommendation to time-series forecasting.
If your problems are more modest in scale and complexity, it’s possible that a full-blown feature store is an overkill
Having said that, some vendors offer low-friction solutions for “reasonable scale” cases where the tradeoffs and assumptions matter less.
Will feature store X be suitable for my use-case?
Feature store solutions are there to serve your analytical needs while meeting your system constraints.
Unfortunately, feature store solutions are usually described exclusively in the system technical domain — online, offline, key-value, batch, stream, API, etc.
As a results, teams who want to adopt feature stores often have to reverse engineer the assumptions, tradeoffs and the limitations they impose on the analytical functionality itself.
Hopefully, this post can shed some lights on the common tradeoffs in feature store design.
Note: this post is not about comparing individual feature store solutions or creating feature store checklists — but focus on deeper system design considerations.
Feature store vs. Feature pipelines
Some feature stores focus exclusively on storing and serving feature values. Others also offer a framework for defining and running the feature calculation itself — i.e. feature pipelines.
Feature stores and pipelines must be designed and optimized together in order to meet system design goals
Feature pre-computing paradigm
Feature stores are all about storing feature values in a database, and retrieving them during inference.
But what is the motivation here? And is this the only way to compute features?
In low-latency inference environments (e.g. user-blocking request/response), we have no more than a few 10’s of ms to prepare the features for prediction, and sometimes a lot less (e.g. in adtech use-cases).
At the same time, we often want to create features that depend on a large number of raw data points.
Consider the following feature for a fraud detection system:
“total payment volume in the account in the last 90 days”.
This feature value is derived from 90 days of transaction data.
In some cases (e.g. business accounts), this can mean hundreds of thousands of records. Now multiply thousands of records X number of features X requests per second…
Achieving the aggregated throughput of processing millions of events per second during inference, while keeping the latency of individual requests to tens of milliseconds, is usually too difficult and expensive to pull off.
Pre-computation allows us to perform most of the heavy lifting ahead of time, and keep inference latency down
Workflow of pre-computing feature values
- Feature pipeline aggregate data as it becomes available, e.g. in batch or in stream.
- Feature values are stored in a feature store
- A request is made to load features
- Features are read from the feature store
- Feature values are returned to the consumer
The advantages of pre-computing features
- Feature pipelines can be more throughput oriented and less latency sensitive
- Reading features remains low-latency
- Feature producers are decoupled from feature consumers
Precomputing is not a silver bullet
The pre-computing paradigm solves many challenges, but it also introduces a number of analytical compormises and constrains:
- Some features must perform part of their computation during inference
- Some analytical functionality simply cannot be pre-computed
- Precomputing means compromising on feature value accuracy
- Breaking feature calculation into several moving components in production adds complexity both in online and in offline.
Let’s dive right in.
Not all features can be fully pre-computed
Can all feature values be precomputed in full, ahead of the request, and just loaded as-is from the feature store?
Turns out that in many cases, the answer is “no”, or at least, “it would be extremely wasteful to do so”.
“Last mile” calculations
Consider the following feature: “distance in miles between current user location and known home address”.
Since the “current user location” is only known at the request time, the distance cannot be fully calculated ahead of time with any level of accuracy.
One of the ways around this issue is to combine pre-computation and “last mile calculation”
- Precompute home long/lat — geo location translation may depend on expensive API calls or lookups. As a result, it makes sense to perform the conversion between the user’s provided address to long/lat ahead of time
- Store the location as “features” on the user entity — i.e. as two float values , “home_long”, and “home_lat”
- Receive a request containing the current user id and long/lat
- Load the values from the store by user id
- Perform a “last mile” calculation in memory — between the location sent on the inference request, and the location from the DB
- Return the result
Note: “last mile” features are sometimes referred to as “on-demand”
Generally, features that depend on extremely-high cardinality data elements that are only known at request time, are virtually impossible to calculate efficiently ahead of time. Arguably this includes features that depend on the request time.
At the time of writing this post, some of the most prominent feature store solutions do not support last mile calculations
If we treat the home location, the current location and the distance as independent features, we can say that the distance feature depends on the home location feature and the current location feature.
To generalize further, it is useful sometimes to define a DAG of feature dependencies.
Feature DAGs allow us to defined features whos value depend on other features to allow re-use and encaplusation of logic
Last mile calculations challenge the feature store abstractions
Feature stores promise data scientists a clean interface for loading features in online and in offline.
When dealing with fully pre-computed features, the API we use to load values can a generic library + metadata (i.e. feature store client + registry).
Last mile features require the “feature store clients” in both online and offline to call pieces of imperative logic which can change frequently
In addition, these “last mile” calculations have to be optimized to run on both online and offline platforms efficiently, including replicating “DAGs” of feature dependencies in the offline workflow orchestration.
Features that cannot be pre-computed
Not all features can be efficiently pre-computed to start with. For example, graph-based queries for discovering relationships based on information from the current request cannot be efficiently pre-computed at all.
Building such features requires re-framing the problem entirely, or alternatively use non-feature store solutions like graph or vector databases etc.
Pre-computation vs. feature accuracy
Pre-computing is a very useful paradigm, but it comes with multiple tradeoffs.
As we saw, there are some features that cannot be pre-computed easily. In addition, by moving feature calculation to happen ahead of the inference, we also trade off their accuracy in several important ways.
Feature value staleness
Pre-calculation has intrinsic latency. If we rely mostly on pre-calculated values, we may not be able to consider datapoints that occurred very near the point of inference.
In a sense, pre-computation introduces a “event horizon” — the closer the event is to an inference request, the smaller the chance that this event will manage to influence the prediction.
Some features can tolerate hours or even days of staleness, and some cannot tolerate even a few milliseconds.
The fresher we want to value to be, the more expensive it is to manage the infrastructure that can (pre-)compute it.
Trading off feature freshness and operational cost is a domain-specific and use-case-dependent decision
More on this in the section below.
Time-windows design tradeoffs
Time window features are one of the most commonly used feature styles in many domain areas, and yet are notoriously tricky to engineer for.
An example of such a feature can be: “how many items has the user purchased within the last 7 days”.
Generally, time-window features perform :
- Aggregations of some value (# items purchased)
- Based on a key (user id)
- And time window (7 days).
Typically, the end of the window is the request’s event time.
Naively, to calculate the above example feature we need to load or scan 7 days worth of purchases in real time. However, this is unrealistic in most cases. It follows that we need to at least partially pre-compute them.
Time window features 101
Consider the following scenario:
- A user performs an a few item purchases on days 1, 2, 3, 4.
- We get inference requests on day 6 and on day 10.
- First, we notice that the value of the feature depends strongly on the point in which we are required to calculate it, i.e. it cannot be fully pre-computed at the time of observing new events.
- Knowing the feature’s aggregated value at a previous point in time is often not enough in order to derive its value at a later point in time.
These innocent-looking properties send us down a rabbit hole of complexity and tradeoffs — that involve terms like “tumbling windows”, “watermarks”, “tiles”, “monads”, “reversible trees” and “sketch algorithms”, to name a few.
Let’s take a closer look.
Main analytic considerations for time-window features
First, we need to consider the functional dimensions of the problem:
- What aggregations can I perform on time window data?
- For hard-to-calculate aggregation operations, are there “good enough” approximations?
- What are the sizes of windows supported?
- What is the accuracy at the “head” of the window (i.e. how fresh are the events considered in the results?)
- What is the accuracy at the “tail” of the window (i.e. the deeper past)?
- Can the computation method guarantee lack of train/serve skew?
Note that Inside a single ML system, we may have features that prefer different tradeoffs — for example:
- Detecting emerging fraud trends may need a short window with a high degree of freshness at the “head”
- Evaluating long term shopping habits will prefer counting over longer windows, but as a trade off, it can live well without the most recently browsed items
Sometimes, the technical solution needs to consider factors that are domain specific, like key cardinality and data skews.
For example, aggregating events for IP address needs to be carefully implemented in case we have individual addresses that may perform millions of events per minute.
Lift the hood of virtually any feature store out there, and you’ll find a different solution for time-windows, that makes different tradeoffs and assumptions about the features it will be serving.
Below are a few design alternatives for performing time window calculations, and the tradeoffs they offer on accuracy vs. complexity
Computation platform selection
Generally if we need to support both high freshness and deep history, we are looking at running multiple compute platforms.
If we have features that require both of these properties at the same time, we are looking at stitching together multiple compute platforms for one feature.
This is more complex operationally, and usually has implications about train/serve skew.
Time window design pattern 1 — tiles
- Use batch/stream/both to precompute “tiles” — sub-aggregation of the data for smaller time units
- A “tile” always refers to a key + fixd time range tied to an absolute time anchor, e.g.
“account ID=1234, time_anchor=2022–01–01 00:00 TO 2022–01–01 01:00”
- We can calculate multiple granularities of tiles — from daily to hourly to minutely etc. — some in batch and some in stream
- The finest granularity of tile (smallest time unit) should be determined by the level of freshness we need to achieve.
- During request time, we query the tiles that can cover the window we are interested in
- Then, we perform the final rollup to derive the value as a “last mile” calculation (in memory)
Tile granularity at the tail
Say we need to calculate “90 days rolling count of number of transactions performed by the account”.
Usually it’s cheaper and acceptable to count 90 days such that:
- We use fine grained tiles (e.g. minutes) for the recent history
- We use coarse grained tiles (e.g. days) for the “tail” (distant history)
Requiring minute-level granularity of tiles at both the “head” and the “tail” of the window can mean X1000 more records in the database.
Time window design pattern 2 — Tiles + events at head/tail/both
In this solution, we calculate tiles like before, and in addition, we :
- Stream the events into the DB with very low latency
- During query time, considers the tiles + the events at the “head” and/or “tail” to get milli-second accuracy
- Even writing events to the DB involves some milli-second latency
- Key and data skew can be catastrophic — imaging an IP address with thousands of events per minute
- Again, maintaining fine granularity at the tail should be avoided since it increases the data size by several orders of magnitude.
Time window design pattern 3 — timer-based recalculation
In this approach, the feature pipeline itself triggers a re-calculation and aggregates the value of the time window every <window size> seconds.
At inference time, the full window value is already pre-computed in the database (up to <window size> freshness)
For high cardinality, low-activity keys, this is an extremely wasteful approach.
Consider a feature calculated on e.g. user cookies.
Say the system contains 1 billion unique keys, but 900,000 of which are active once or twice and never appear again.
Eagerly calculating and pushing updates for these keys on an hourly basis requires orders of magnitude more resources than the tile-based approach.
On the other hand, if key cardinality is relatively low and key activity is relatively uniform, using this technique can simplify the calculation during feature retrieval.
Summary — time window design patterns
This is by far not an exhaustive review of time window calculation styles, which keep appearing in the industry.
Below is a summary of the difference between different styles of time window aggregations.
Some of the cell values depend on lower-level implementation details and hence this is not a definitive comparison.
Accuracy-scale tradeoff — approximation algorithms
Sometimes, we need to go further and change the core behavior of the aggregation so that it can scale to the problem.
Consider the following fraud detection feature: “count the number of distinct accounts this IP address accessed in the last 30 days”.
Unfortunately, storing the distinct count results in tiles and adding them up to a window will not provide the correct result.
A naive approach would be to simply store the values (in this case, account id’s) in tiles, and then load them and do one last “distinct” during inference. However, this solution has no fixed space and time requirements — think about very active keys, such as individual IP addresses which for one reason or another access many accounts.
We may decide to:
- Employ sketch algorithms like HyperLogLog which can use bounded space with tradeoff on accuracy
- Set a hard bound for the feature value per tile e.g. “>100”
Whatever we decide to do, we are trading off the accuracy of the feature with the amount of storage and IO we can consume to pre-calculate it.
The decision and method for approximation is very feature-and domain- specific.
Interestingly, many feature stores don’t yet support approximate counts out of the box.
In some cases, even approximated features are complex to implement.
As an exercise, consider the following feature for an e-commerce website with billions of items and highly-active accounts that perform many purchases: “given an item the user is currently browsing, what is the number of days since the user first purchased the item?”
Pre-calculation and train/serve skew
So far, we covered multiple methods designed to enable complex calculations while meeting the latency requirements of the online serving environment.
Pre-calculating features may cause train/serve skew — i.e. make training data different from prediction data.
Ingestion time vs. event time skew
Say you have a feature that you pre-aggregate using a batch process, and then write the results to some online store from which you will serve it.
The point in time in which the new values are made available to inference may vary based on the schedule in which you run your jobs, and the time it takes to ETL the data to production.
Simulating this lag accurately during training is not at all straight forward.
In addition, it’s relatively easy to design features that can work well in the online environment, but are decidedly not batch-friendly, and/or do not offer a point-in-time at all.
Which leads us to the next section.
Analytical tradeoffs driven by offline constraints
In offline, our main task is to prepare datasets for training and evaluation. Here, we typically have a “driverset” — a list of labeled examples we would like to train on, and the goal is to attach feature values to these examples, in a point-in-time accurate manner.
During this process, we need to optimize for:
- High throughput —attaching the values of 100’s of features, to 100K+ or even millions of training examples, in a “reasonable” time (say < few hours)
- Achieving training / serving parity — this includes point-in-time correctness of all features we are training on
- Simplicity — making it accessible to Data scientists, specifically around trying out new features
These considerations lead us towards a few tradeoffs:
Naive solution — fast forwarding data through a production replica
Naively, to achieve parity with serving, all we need is to deploy the exact same stack we have in production, and “fast forward” all the data through it, storing all the interim values and performing last mile calculation using the same mechanisms we use for serving.
However, computing a year worth of features in hours, using the same stack we use in production, would require an environment which is 1000X larger, just in order to create a training dataset.
As a result, in the offline environment we usually attempt to leverage a batch-oriented compute platform that is optimized for throughput.
Having different computation platforms in online and offline introduces the risk of drift, and introduces constraints on the feature logic.
When a data scientist comes up with a new feature idea, or solves a bug in a feature, the optimal flow would be for her to quickly calculate the new feature and attach it to a training set, to see if it helps.
In some cases, when the data is small and/or the feature logic is limited, this is very possible to do. For example — for features that depend only on the data from the inference event payload.
In other cases where we need to consider large amounts of historical data, computing the feature value from scratch may take a long time — and makes little sense for 20 Data scientists to recompute it from scratch for every experiment.
As a result, for large scale training data with hard-to-compute features, we typically attempt to calculate the feature data once, and re-use it every time we want to create a training set.
Backfilling features is both expensive, slow, and introduces further risk of skew.
Avoiding backfilling cost and reducing risk of skew
For large enough datasets and complex enough features, it’s often necessary to retrieve feature values from multiple sources.
The idea is to use whatever feature values we have already reliably computed, with a strong preference to sources that represent values calculated as close to the point of inference as possible.
Using features that were calculated near the point of inference reduces cost, time and chances of skew.
The diagram below demonstrates a selection of feature data sourcing alternatives. The darker the color, the more performant and reliable is the source.
Logged values from inference requests
The most reliable source of features will always be data we logged from within the inference service, after feature loading.
Not only are the features already attached to the example we want to train on, they are 100% guaranteed to be identical to the correct value the models can expect (setting aside operational changes since logging time).
Most feature stores do not have native support for leveraging logged features during offline dataset building.
Values computed by batch/stream in production
This is useful when we want to re-use mature features for a new model, performing a different task, potentially in a separate inference service.
This approach also helps us avoid event/ingestion time skew.
Previously backfilled features
Extending on the previous case, any backfilled values we already created for the feature is certainly a valid source for training.
After re-using all of the above, we may still be in a position where we have a few features we’d like to add, that are new and do not yet have values.
Here we still have some choices:
- If the features are important and/or easy to backfill — we can fill them
- If the features may not be that important and/or hard to backfill — we may decide to deploy them to production and log them for a while prior to consuming them in training.
There are multiple options for sourcing feature data such that we can save time, reduce cost and avoid the risk of skew. The decision for how to best source each feature requires analytical tradeoffs.
In this post, we reviewed several system design considerations for feature calculation and serving.
Like many areas of engineering, from a certain scale of features and data, this quickly becomes a difficult problem to solve, with subtle tradeoffs between system requirements and analytical functionality.
If you are building or adopting a feature store solution at scale, it’s critical to look at some of these considerations and understand the tradeoffs.
If you already have a set of features you want to support, it’s worth doing the exercise of attempting to re-design a representative from each feature style on the new platform to ensure the design meets your analytical requirements.
Hopefully we will see more direct discussion around the functional tradeoffs in the larger feature store community.