Data Lineage Doesn’t Have To Be Hard

Feature Stores Have End-to-End Data Lineage Baked In

Jack Ploshnick
Mar 6 · 4 min read
Image via Alex/Adobe Stock under license to Zer0 to 5ive

Your boss comes up to you in the middle of the day and asks, “Your fraud detection algorithm has consistently been flagging transactions from one particular zip code as fraud. What’s going on?”

That should be an easy enough question to answer, right? You can use model interpretability tools, such as the popular open-source packages Lime or SHAP, to peer into your model and better understand how your model works.

But, that’s only part of the story. Even if you understand how your model works, you will quickly want to uncover what data was used to train your model, as well as the individual features served to your deployed model. Even if you are lucky enough to have robust data pipelines, tracking down old training sets, re-engineering feature pipelines, and searching through API logs to identify past predictions will be difficult, if not impossible.

The Increasing Need for Data Lineage

Data-driven organizations now have the desire, and oftentimes the regulatory requirement, to keep better track of their data and explain how it is used inside of their company. The solution to tracking data lineage and ensuring data governance is a Feature Store. Tech giants like Uber and Airbnb introduced us to the idea of a Feature Store, and companies of all sizes are putting them into practice.

How Feature Stores Simplify Data Lineage

Image by Author

A Feature Store eliminates that mess of pipelines; it is a single, central, feature repository. All data used for analytics, whether it’s batch inserts coming from your data warehouse or real-time pipelines updated multiple times a second, feed into the Feature Store. Here, features are engineered, versioned (so you can identify past values), and shared across multiple deployed models. If you want to know what data was used to train a model years ago, the Feature Store can rebuild the exact training set used, even if that model was trained on data from many different sources.

Image by Author

While Feature Stores are incredibly powerful on their own, they are particularly powerful when paired with scalable, searchable, and persistent storage of machine learning predictions. An increasingly popular way to do this is to store the input features coming from the Feature Store, and output predictions of the machine learning model in a single database. This is often called database deployment. Because the training features, serving features, and model predictions are all linked, you can monitor feature drift from training to deployment easier than ever before.

Image by Author

Moreover, if you want to get a more granular view and see the individual features used to generate a particular prediction, that information is just a query away. You can then use the Feature Store to see where these features came from.

Image by Author

In my conversations with data scientists, I often hear that end-to-end lineage and transparency is something they’d like to have, but they don’t have the time or resources to build out the necessary infrastructure. With a Feature Store, coupled with database deployment, no extra engineering efforts are needed for complete data lineage; it’s all baked in.

The next time your boss comes to you with a question about your model, you can use a Feature Store to find the exact training set used to train your model, the code used to generate the features in that training set, and the features served to the model, in seconds.

RESOURCES: To learn more about Feature Stores, check out Splice Machine’s quick demonstration of our Feature Store, or our technical requirements for a Feature Store.

DISCLAIMER: I am a Data Scientist at Splice Machine, a single engine Feature Store for Machine Learning.

Feature Stores for ML

AI, Data, and everything in between

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store