Data Drift Detection In Kubeflow Pipelines

It is important to identify the drift at the earliest and take the necessary steps to mediate it.

Varun Mallya
DKatalis
3 min readMay 30, 2022

--

We use Kubeflow pipelines to orchestrate our ML Pipelines. This includes training, inference, and backfilling of feature tables. Data drift is the variation in inference data from the data used during model training. It is important that we identify this at the earliest and take the necessary steps to mediate it.

One of the open-source tools used to identify data drift is Evidently. It runs statistical tests in the background to identify data drifts and provides a simple interface to run these tests. Evidently also provides interactive reports, which help us in debugging.

Our objective was to build a Kubeflow pipeline component which we could reuse across all our batch inference pipelines. This component needs 2 inputs, mainly:

  1. Reference Dataset
  2. Inference Dataset

This component’s output is a report that we can visualize in the Kubeflow pipelines UI.

You can find the component and the pipeline in this repo.

The component checks if data drift has occurred by running the K-S test, which compares the two distributions(reference and inference dataset). The comparison is performed feature by feature. If a significant portion of features has drifted (in our case, it is 50% of the features), we can conclude that there is a significant data drift and we may need to retrain the model.

When we run this component in a pipeline we get the below dashboard under Kubeflow pipeline visualisation:

We can use the results of this component to conditionally execute the rest of the pipeline if required.

The following blog post will showcase how we use Evidently for “real-time monitoring”, mainly for our models deployed as a Fast API service.

Meanwhile, if you find this article helpful and interesting, then maybe you would make a great fit for our team! Join us and be a Katalis!

--

--