Data Lineage Doesn’t Have To Be Hard
Your boss comes up to you in the middle of the day and asks, “Your fraud detection algorithm has consistently been flagging transactions from one particular zip code as fraud. What’s going on?”
That should be an easy enough question to answer, right? You can use model interpretability tools, such as the popular open-source packages Lime or SHAP, to peer into your model and better understand how your model works.
But, that’s only part of the story. Even if you understand how your model works, you will quickly want to uncover what data was used to train your model, as well as the individual features served to your deployed model. Even if you are lucky enough to have robust data pipelines, tracking down old training sets, re-engineering feature pipelines, and searching through API logs to identify past predictions will be difficult, if not impossible.
The Increasing Need for Data Lineage
And so enters my personal favorite buzzword (or perhaps buzzwords) of 2021: data lineage. What exactly is data lineage? In the machine learning context, data lineage is a complete history of your data, from raw ingest to features used for model training and served to deployed models.
Data-driven organizations now have the desire, and oftentimes the regulatory requirement, to keep better track of their data and explain how it is used inside of their company. The solution to tracking data lineage and ensuring data governance is a Feature Store. Tech giants like Uber and Airbnb introduced us to the idea of a Feature Store, and companies of all sizes are putting them into practice.
How Feature Stores Simplify Data Lineage
How exactly does a Feature Store help you explain your models? It all starts with feature engineering, turning raw data into the inputs of your model. Without a Feature Store, you have to build a separate feature engineering pipeline for each model you want to deploy. Duplicate pipelines don’t just lead to unnecessary compute costs and engineering efforts, they lead to a data lineage nightmare. One pipeline, weekly_trans_agg might aggregate weekly transactions starting on Sunday, while another, weekly_trans_agg_m aggregates weekly transactions starting on Monday. Figuring out which pipeline, potentially out of hundreds or thousands, was used to train which model months after the model was trained… I wouldn’t want to be the data scientist asked to do that.
A Feature Store eliminates that mess of pipelines; it is a single, central, feature repository. All data used for analytics, whether it’s batch inserts coming from your data warehouse or real-time pipelines updated multiple times a second, feed into the Feature Store. Here, features are engineered, versioned (so you can identify past values), and shared across multiple deployed models. If you want to know what data was used to train a model years ago, the Feature Store can rebuild the exact training set used, even if that model was trained on data from many different sources.
While Feature Stores are incredibly powerful on their own, they are particularly powerful when paired with scalable, searchable, and persistent storage of machine learning predictions. An increasingly popular way to do this is to store the input features coming from the Feature Store, and output predictions of the machine learning model in a single database. This is often called database deployment. Because the training features, serving features, and model predictions are all linked, you can monitor feature drift from training to deployment easier than ever before.
Moreover, if you want to get a more granular view and see the individual features used to generate a particular prediction, that information is just a query away. You can then use the Feature Store to see where these features came from.
In my conversations with data scientists, I often hear that end-to-end lineage and transparency is something they’d like to have, but they don’t have the time or resources to build out the necessary infrastructure. With a Feature Store, coupled with database deployment, no extra engineering efforts are needed for complete data lineage; it’s all baked in.
The next time your boss comes to you with a question about your model, you can use a Feature Store to find the exact training set used to train your model, the code used to generate the features in that training set, and the features served to the model, in seconds.
DISCLAIMER: I am a Data Scientist at Splice Machine, a single engine Feature Store for Machine Learning.