or, Let the Punching Begin
Feature engineering is a critical component in any machine learning system. As the basic input into a predictive model, the quality of features can make or break the overall performance of the model.
Feature engineering also takes a tremendous amount of work. If a new machine learning application requires one new feature, it probably means that the ML engineers discarded ten other features and tried ten variations of each candidate feature. Features have to be computed, versioned, backfilled, and shared. Finally the method for storing and accessing features will also fundamentally differ between offline vs online contexts.
This means that as more and more companies scale up their machine learning applications, they are finding it extremely difficult to manage this combinatorial explosion of needs. It’s become a big enough problem that many of the biggest tech companies have invested significant resources into creating specialized systems to manage model features.
Uber was one of the first big companies to publish the concept of a feature store. This was a set of services that helped users 1) create and manage shared features and 2) allow for unified references to both online and offline versions of a feature to help eliminate the need to reproduce code between offline training and online serving.
Airbnb built their own as part of their Zipline framework. In addition to setting up the basic stores and services, Zipline also focused on features to deal with timeseries alignment using “label offset” (pictured below), backfilling training datasets, combining batch and streaming data, etc.
Neither Uber and AirBnB’s projects are open source, which was part of what made Feast interesting. Feast is an open-source feature store jointly developed by Gojek and Google Cloud. Besides being open-source, it also differs from Uber’s system by including the ingestion component. This meant that features can be specified in a configuration yml file:
description: number of times the word appears
Most recently, Logical Clocks added a feature store as part of their Hopsworks framework. It mostly focuses on the offline training portion but probably has the most clearly / simply presented architecture.
So what’s the problem?
Let me preface the rest of this article with this disclaimer: each of these systems have been tested in production and they work their respective use cases. So until I’ve actually built/tested the ideas I’m about to discuss, you probably shouldn’t believe anything I say.
Having said that, I’ve found it difficult to think about these existing systems coherently. In general the focus has been “well we needed to satisfy requirement X so we implemented Y in this way” (e.g., Zipline’s backfilling, Michelangelo’s online store, etc). But I haven’t really seen a conceptual framework that truly helps us think about how all of these different pieces fit together. Without a coherent conceptual framework, the result seems to be that depending on what use case you decide to add or remove, the architecture to serve these use cases may look drastically different. For anyone trying to select or design a feature store, this is not very encouraging.
I think we can do better.
Be like Joe
Wikipedia defines a feature as “an individual measurable property or characteristic of a phenomenon being observed”. I think we’ve been focusing too much on the outcome of observation and not enough on the process of observation. Focusing on the outcome is static and narrow in scope. Focusing on the process gives us a coherent framework that provides natural ways to think about and design solutions for feature engineering issues.
Joe Armstrong talked about Erlang programs being structured as processes that generate messages to communicate with each other. This was in stark contrast to the traditional OO paradigm and a primary factor that makes Erlang unique and successful. I’d like to propose an analogous statement about features:
Features are data generating processes. Processes have input and generates output. And in different contexts, processes may have different concrete implementations.
In particular, let’s discuss several unique benefits of this mental model:
Versioning and provenance
Feature processes are represented by code artifacts that can be versioned. When a particular version of the artifact is executed to materialize the feature, the output data is point-in-time and can also be versioned. Feature processes must also have input in order to generate output. We can then follow the input feature references to trace the full provenance of a particular feature. As you can see, versioning, point-in-time, lineage are now no longer just extra appendages you attach to a static dataset. Instead they’re an integral part of the concept of a feature itself.
Some features are expensive to compute (e.g., embeddings) and you’ll likely want to always materialize the output. Other features (e.g., log(engage)) are fast to compute from the raw data so there might not be any need to materialize it at all. But that is a level of detail that should be completely hidden from the data scientist / developer doing feature engineering. By focusing on features as processes, we can enable the user to specify just the feature identifier regardless when the feature is materialized / persisted. So the feature store user can write code like:
df = feature_store.get("user_view_time").materialize(start, end)
And not have to worry about what’s being fetched from database vs computed in memory. Moreover, we can also chain lazy features processes together into the pipeline and enable a feature compiler to optimize the underlying computations on the fly.
Online vs offline
Processes don’t exist in isolation. Instead all process must run inside a particular context. When that context changes, the specific implementation on the process within the context may also change. For example, the code for getting offline data vs online data for the same feature might look very different. The offline code might run a SQL query or call Spark API. The online code might call a ScyllaDB API, a Redis API, or some custom real-time data source. In both environments, we’d be able to write the same code to get the feature values:
df = (feature_store.get("user_hist")
And the offline version would get executed as perhaps some Spark code snippet whereas the online version would add the user_id filter into the request body for an API call to a user-profile service that updates in real-time.
No data scientist is an island
All of the systems I described before has collaboration as a central goal. While they solve for visibility and discovery of features, none of them solve the third critical problem in collaboration: trust. By focusing on the data generating process instead of the static dataset, we enable verifiability of how the feature is generated, which is much better than our current reputational paradigm of trust. In addition, this also makes it a lot easier to debug complex feature pipelines because we can easily verify each stage separately.
But also be like Mike
A good mental model lets us describe a very complex system or phenomenon with as few distinct concepts as possible. Different features, use case, and functionality should all fit together into an integrated whole. This helps us design architectures that make sense and can stand-up to the test of time. Of course, I fully expect my thinking to evolve a lot in the process of actually building our feature store. As Mike Tyson once said “Everybody has a plan until they get punched in the mouth”.
Most importantly, this is a beginning rather than an end. We’re currently building out our next generation machine learning platform at Tubi. If you’re passionate about creating tools and products to enable ml-driven companies, come join us and build this platform with me from vision to execution.
And let the punching begin.