Feature Store for Data Science: Upward Journey Continues

Vikulp Sharma
Geek Culture
Published in
6 min readOct 21, 2021
Photo by Luke Chesser on Unsplash

Why feature store :

For us to understand the feature stores, we need to first understand what are the features. The data which we feed to machine learning models are known as features. Now data generated in the real world does not always contain the power for future predictions. You need to do data aggregations, transformations, data joins to create new features based on your business/domain/statistical knowledge. I strongly believe that it is where data scientists differ from each other, their experience, expertise, creativity plays a major role here. I have written another article on same in detail.

Now, just to explain this by a simple example, suppose you work for a cycle manufacturing company, now you are tasked to predict product cycles sales forecasting in the future for your company. You have extensive domain experience in your business, you know that cycle sales increases whenever there is good weather, health campaign in the locality. It is a must for you to include these features in your model. You have put a lot of effort to integrate those external paid APIs, calculating those features at each zip code level, and finally, you have a great working ML model.

Now in your company, other data scientists need to build an ML model where they need to predict store footfall on a given day. They also realized that weather and health campaign plays a major role for the footfall to increase. Now they have to spend similar compute and storage costs to get those features included in this model. But you just created that, both of you work for the same company? can they leverage the same? How would data scientist “A” knows features created by other data scientists “b to z” before creating his/her features?

Now here I have assumed that you have great domain/functional knowledge about your business, to create new features out of blue. What about you have no domain knowledge, you have done various iterations, used extensive statistical concepts to arrive at the most prominent features for those ML models. Would’s it be a good idea for your colleagues to build on top of your expertise? After all, there is a saying, “ standing on the shoulder of giants”, for a reason.

There is more depth to realize the need for future stores, but I thought the above example would fit the major need. To answer many of these questions above, we introduce a feature bank/feature store or any other fancy name. This is where it fits in.

What is a feature store :

I do not want to introduce another definition of feature store here, it is a repository of features that allows data scientists to Compute, Store, Update, Log, Monitor, Discover, and Serve various machine learning features. A feature store is a place, where various data scientists can collaborate to discover each other’s creativity, and reduce development time and cost for their organization. For more details around this, I would recommend reading these articles (1,2).

Important prerequisites for feature store:

Now you might be wondering what are the major aspects a GREAT feature store should have.

The below image describes the various features, we should have in the feature store.

Data Scientist interacting with Feature Store, Image from Author, Inspired by Feast

The above architecture aptly summarizes the various important aspects of a feature store. Let us discuss prerequisites for a feature store in detail :

Compute, Store, Update, Log, Monitor, Discovery, and Serving

This section (Compute, Store, Update, Log, Monitor, Discovery, and Serving) is the bare minimum ask for a feature store to have. Any feature store if missing any of these aspects, would be very difficult to justify the need for a feature store compared to existing data warehouses, data platforms, etc.

The first important aspect of a feature store is to allow computing new features based on the data pipeline created by data scientists. Then it should not only be able to store those features with some metadata, attribute key, but it should also allow update/versioning of those features.

Then comes another important aspect of feature stores that how other teams in the same organizations search those features, so there should be a way to search features by using metadata variables. So that other teams can easily discover and leverage those features for their business problems. Now for this to happen, data science teams can register features with appropriate metadata (feature description, primary key, details, etc), later this metadata information can be used by other data scientists in the organization for feature discovery.

Online, Offline Features

Another important aspect of a feature store is to store and maintain both online and offline features. Offline features can be used to serve batch models or for training ML models. While the online features can be used for real-time predictions. Very low latency limitations would be there for online features compared to offline features. Hence the feature store should be able to support appropriate databases for meeting latency and storage requirements. For example, the Databricks feature store launched this year can support both online and offline features on Azure with storage choices using delta lake, Azure MySQL, etc.

We should not confuse feature stores with data warehouses, there are differences among the two. I would recommend reading this article for more details on this topic. To quickly summarize, a feature store requires many additional components (as highlighted above) on top of a typical data warehouse. We, however, can build our feature store by utilizing some of the cloud-native functionalities i.e. DVC, Compute, Storage, IAM, ML services, and so on.

Open Sourced, Managed

When it comes to completely open sourced feature stores, options are very few. Feast is one among them and it is a great open sourced feature store, it is natively available for both AWS and GCP. It was jointly developed by Google and Gojek, read here. However, recently, we are seeing its implementation on Azure and on-prem environments, for more details I would request to go through this April 2021 Article from the feast team. In this article, they are calling out, that they are now shipping support for Azure and On-premise deployments as well.

Now few startups are developing commercially available feature stores. Tecton is one among them. If you see the image in the next section, Uber’s Michelangelo platform, was the initial platform having a feature store. Tecton is founded by Michelangelo’s team. Tecton is also the core contributor to the feast open-source library for feature stores.

Various cloud providers also provide feature stores as managed services. It might be easy to use and integrate with the same cloud compute, storage, networking resources. This could be a natural choice for various enterprises based on their need. For example, Google’s vertex AI is having its feature store, AWS has added a feature store in 2020 as part of their SageMaker platform.

Access control

This is another important aspect of a feature store, about the security of the data. It should provide an identity and access control system to manage authenticated access to those features. This is the place where cloud-native feature stores score big, as you can easily enable those within the existing cloud IAM infra.

Featured Stores Journey:

Now how feature stores evolved, how big technology companies are driving the evolution of feature stores, I would recommend referring to the below image taken from this article.

You could potentially draw various important insights from the above journey map. How the feature stores adoption is increasing post-2020. That is the reason we are recently seeing cloud-native options getting available for feature stores as managed services.

Enthusiasm around feature stores has now reached a stage that it has observed the world’s first feature store summit, which happened in Oct 2021. See the details below about the agenda, speakers, topics.

If you are curious to read more about the architecture, implementation, usage details of the feature stores implemented by these tech companies (Netflix, Uber, Airbnb, Gojek, DoorDash, Lyft, LinkedIn), then I would recommend going through the below GitHub link which summarizes the links for all those feature stores. Nothing could beat hearing directly from their technology teams.

Conclusion:

In this article I thought of penning down my thoughts around the feature stores, please feel free to add your thoughts here in the comments section.

--

--

Vikulp Sharma
Geek Culture

Lifelong learner. Love Philosophy, Maths, ML, AI, Cloud, Digital, Data, Astrophysics. Opinions expressed are my own & don’t express the views of my employer.