Why your first ML project should be a Feature Store

Chris Kenwright
Feature Stores for ML
8 min readSep 15, 2021

There are plenty of tutorials about getting started with Machine Learning and a recurring theme is the difficulty in getting Machine Learning models into production. These notes reflect some of the struggles in serving ML models, the slippery slope of complexity, and how to approach serving recommendations from a data-first, rather than model-first approach. It doesn’t feel like something talked about a lot, so hopefully of some value.

The basic tutorial pattern is build your example model with scikit-learn, tensorflow, insert-framework-of-your-choice, deploy to a Flask server … and then get caught up in all the monitoring, deployment, updating that surrounds the production system. And realise that half the “simple” features you want are just not that simple for anyone calling your service 😔

Rules of ML

Stepping back a few paces, Google offer an excellent guide to building ML systems, called the Rules of ML. There are 43 of them at time of writing.

Here are the first three:

  • Rule #1: Don’t be afraid to launch a product without machine learning
  • Rule #2: First, design and implement metrics.
  • Rule #3: Choose machine learning over a complex heuristic

Basically your first steps are going from an heuristic (some intuitive or simple rules) to basic ML using some metric you can actually measure. And I suggest that moving from heuristic to first ML is the single biggest step in all the rules of ML lies here.

A simple example — building a movie recommender (because there aren’t enough movie recommenders. OK, there are 😀 but I want a simple relatable example that dodges the really tricky issue issue of actually working out how and where to effectively deploy ML in your organisation). Our assumption is that getting the “play next” suggestion correct results in higher retention.

Diagram showing 5 movie recommendation processes, from suggesting a random movie, a movie the user has not seen, the most popular movie the user has not seen, most popular not seen in favourite genre to ML-based most likely to watch next
Left to right, we improve a simple model by improving the suggested next movie to watch.

To get from “random movie selection” to recommending “something we haven’t seen”, a software engineer only really needs a simple query on an operational database.

*By “Operational”, I mean the systems directly supporting your app or site, typically then systems that feed into your data infra structure.

-- Any movie that hasn't been watched by a user
select movie
from movies
where movie_id not in (
select movie_id from user_movies where user_id = <user>
)
limit 1

Heuristic 1 is pretty easy! … with scope to add an index or table optimisation to ensure a quick response.

Once we start to include the most popular movies, we don’t really want to be running aggregate queries on the operational table, so might kindly ask a software engineer to aggregate data and build out a new service:

-- Hourly, nightly or through a dedicated counting service 
-- rank movies by popularity
select movie_id, count(*) as user_views
from user_movies
group by 1

Now we have two operational stores, one for the underlying service and one for the aggregate data, probably with separate APIs.

And then we do a third data store for the most popular genre for users to cover the next heuristic

-- we can now get the most popular movies within a genre
with top_movies as (
select genre, movie_id, count(*) as view_count
from user_movies join movies using (movie_id)
group by 1, 2
)
select
genre,
movie_id,
row_number() over (partition by genre,
order by view_count desc) as popularity
from top_movies

By the time we get to predicting “the movie the user is most likely to watch next” and deploying ML models, the data scientist is requesting multiple aggregates in operational stores … six months later we have dozen of data stores, supporting multiple different front end teams, each trying to solve their use cases:

As more heuristic start to appear — we shouldn’t show horror flicks at lunchtime to minors, which feels pretty reasonable — we start to get multiple data serving APIs appearing. If we have a separate teams focused on Mobile (perhaps because customers consume different content on Mobile than on their 60-inch widescreen?), then complexity increases again.

Keeping simple heuristics in the operational tier can get us a long way, but we see an expanding number of silos and pipelines needing support. Every team is working independently to service their own needs and data is at risk of becoming inconsistent and unfindable.

There is also increasing complexity — not just about the complexity of the algorithm - but with the number of teams interacting and number of systems involved that we should address in delivering some ML system.

Even keeping recommendations to simple heuristics, perhaps some centralisation would actually be helpful…

Enter the Feature Store

Feature Stores felt confusing to me because they always seemed to be pitched at organisations that are really mature in their ML efforts and deployments. They are always sold as “expert level” data science 🥷. This is why I am writing now, they feel so much more fundamental to the process of serving.

Feature Stores are data stores that aim to unify offline batch data and provide online access, so you get the same features for offline training as you get for online serving. By features, we really mean measures like customer aggregate data, or anything calculated offline and pumped into a datastore to be made available online. (In some cases, this also adds the ability to ingest and aggregate real time data).

The actual mechanism for deployment depends on your team and requirements (we sync some tables in Big Query with automatically generated MySQL stores right now … imperfect but very workable for the required size).

There are some great links at the bottom for more detail. The practical article comparing BigQuery + MemoryStore with Feast is a great read.

Why a Feature Store First?

The key reasons to get your feature stores off the ground early, in no particular order:

Ownership by the Data Team

Bringing data pipelines under the ownership of your data team leads to a much more reliable system. Data Engineers, and data scientists, have more experience building reliable pipelines and it has the performance benefits of removing all aggregate workloads from operational stores.

This single point of access centralises the data stores, introduces more accountability and makes quality more manageable, rather than dispersed across multiple software engineering teams.

Directly involving the data team in the delivery of production data comes with 4 potential benefits

  1. If your data team mostly deal with supporting analytics, here they can directly impact customer experiences. From a team motivation perspective, there is much more engagement.
  2. The primary consumers, data scientists, get a direct hand in ensuring they are building a complete product. No more letting software engineers figure it out, as a data scientist you need to make sure all the data you need is available in a serving context.
  3. Data scientists gain flexibility to extend features based on keys and referenced data, not directly consume features. Our first ML attempts ended up deliberately limited to features we knew were available inside the client, without calling any remote data sources. A simple lookup service would have meant so much more opportunity to make decisions.
  4. The data team can collect varied data much more easily than operational teams that tend to work in narrower contexts, which means a much richer set of data can be made available to support models.

Reusable Components

There is an XKCD for everything…

All too often small changes in requirements, or just plain new ideas, lead to the need to completely re-architect some part of the ML pipeline. Getting re-use in the infrastructure where you can is a big win for rapid testing and experimentation. https://xkcd.com/1425/

In Data Science, a simple feature request can radically change the infrastructure.

ML Solutions often don’t have lots of reusable code — new features, a new library, my first simple heuristic doesn’t look very similar to my second simple heuristic — so re-use of work where possible is really at a premium and should be maximised.

Feature stores can be used across the full ML complexity arc, from simple heuristic to complex model. Your complex model will likely need features, so why not test them as a simple heuristic first? Simple lookups can make the customer experience better (looking at you, anonymous retail chain who can simply never give me a list of my nearest stores…) and more model complexity may not be cost effective.

Multiple Model Implementations are all supported by a feature store, from simple rules and heuristics to deep learning models allowing you progress as you learn what works for your business. Offline models, like customer classification or nightly recommendations can be fed into a feature store for serving (and online model use).

Path to more ML

While I have been working through an online use-case, there is no reason not to re-use for feature store for batch ML processes, too.

If we can expose an RFM score for a customer to the customer support team, then why not also re-use that information for email segmentation AND pay-per-click team.

A batch first approach shouldn’t be issue — Spotify calculated all their recommendations overnight, so you can get comfortable with building and deploying ML without having to break sweat with realtime systems and make the results available via a simple lookup service.

(And there is the obligatory “benefit 5” — every other benefit the expert proponents of feature stores espouse see every other Feature store article..)

Quick caveat — context

I put a caveat here, I feel lots of “data science” posts come without context … so my problems and opinion here has been formed from implementing Rules 1–3 in a customer facing, e-commerce context. My objectives tend to be transaction values, conversion, click through and retention.

I have been trying to leverage ML and find that access to simple data, for simple decisions, in a live context … is actually really hard. Especially once it grows beyond the first use-case.

But, if you aren’t working in an e-commerce context but building a really complex expert systems, like automatic translation or image processing … you might be a different problem! Caveat lector.

Conclusion

Overall, I have found that making and managing features — aggregates — available online is really one of the foundation layers deploying machine learning and impacting user experience. Where we started thinking model first but got stuck with available data.

If there is a hierarchy of building machine learning, building an if-else statement on static data in the foundation to build on. Easily deploying heuristics should be table stakes.

Nothing here is to say that you can’t keep heuristics self-contained in operational systems — that is very much a judgement call. But, if you find yourself repeating the pattern, reusing metrics, or getting blocked by an operational team with their own backlog to work through, then a feature store might be a good step to take.

And if you are deeply involved in improving customer experience, then rather than starting with the deployment and monitoring of complex ML models, focusing on a centralised datastore that supports your benchmark heuristics could be massively helpful and giving multiple simple wins.

References

The following are great sources for learning more about feature stores

https://www.featurestore.org/

AWS video on Sage Maker Feature Store: https://www.youtube.com/watch?v=pEg5c6d4etI

The End

--

--