Tensorflow Extended (TFX) — Data Analysis, Validation and Drift detection — part 2

Srivatsan Srinivasan
DataDrivenInvestor

--

In a previous article, we saw details on how Tensorflow along with Tensorflow Extended (TFX) provides functionality for developing and deploying end to end ML pipeline. In case if you have missed reading part 1 of this series below is the link to my article

A quick recap of Part 1...

Model Algorithm and Training is a relatively small fraction of the entire end to end Machine Learning life cycle. Data Collection, Data Engineering, Data Analysis and Validation, Feature Engineering, Model performance monitoring and deploying the model is where a typical Data Engineer + Data Scientist spend 90+% of the time

Most deep learning framework today focus only on Model Training. Tensorflow and its ecosystem covers entire ML life cycle mentioned above

In this second part of TFX series, we will focus on Tensorflow Data Validation (TFDV) component. In future series, we will cover Tensorflow Transform and Tensorflow Model Analysis

If you want to directly jump into the code, link to Github repo is towards the end of this article

Tensorflow Data Validation (TFDV) is a library for analyzing, visualizing and validating data used for machine learning model. TFDV provides insight into 3 key questions in the data analysis process

  • What is characteristic of my data and how does my data look like?
  • Are there any errors in the data?
  • Has the data evolved (drifted) from the underlying assumptions than what the model was trained on?

Specifically on drift wanted to highlight a few important concepts.

Data and business evolve over time causing the most predictive model to be less accurate over time. Machine learning life cycle does not end with the deployment but rather requires continuous model monitoring and data monitoring post-deployment. Depending on how rapidly underlying assumption changes in business process, models have to re-calibrated and deployed that frequently.

Take in case of financial fraud detection, fraudsters are innovating faster than banks. Introducing of EMV chip did reduce card present fraud but fraudsters were quick to shift their fraud online or to card not present channel. In case of fraud detection, models might require frequent re-calibration compared to models that deal with identifying employee churn.

In data driven approach, Treat data as you treat code. Catching errors early is critical

A key aspect that data scientist needs to worry about is identifying concept drift, where data changes unpredictably over time. Check wiki link below if you need more details on concept drift

https://en.wikipedia.org/wiki/Concept_drift

One aspect data scientist need to understand is to differentiate drift that is anomalous with natural drifts. Natural drifts in data can be modeled using feature engineering.

Example of natural drift include Seasonality of business, Correlation of economy against employment rate or housing prices among others

Another drift is schematic or structural drift which is change due to the structure of source data.

TFDV helps to catch errors early as well as identify and flag anomalous drift. Some of the key capabilities of TFDV are

  • Compute and visualize summary statistics with 2 lines of code across all features
  • Compares multiple datasets. Helps in identifying data and distribution skew between train, eval and serving datasets
  • Automatically generates a schema based on underlying data as well as use the schema to inspect out of time datasets
  • Identify anomalies, such as missing features, out-of-range values among others
  • Detect data drift by looking at a series of data

While there are plenty of frameworks that do have similar functionality, One of the benefits of TFDV is its use of Apache Beam for computation. This makes TFDV scalable and high performant across thousands of feature and on a large dataset. This also helps TFDV run consistently on streaming as well as batch pipeline

Let us now look at code and output produced by TFDV component. I will be using Watson Telecom churn dataset to demonstrate TFDV capability.

Below is the output of the Data Analysis component with just 2 lines of code

train_stats=tfdv.generate_statistics_from_csv(data_location=OUTPUT_FILE)
tfdv.visualize_statistics(train_stats)

Data Analysis component produces quantiles, equi-width histograms, mean, standard deviation and other summaries for continuous features and top-k values by frequency for discrete features. There are three variations of univariate plots one can generate using TFDV — Box Plot, Quantile Plots, and Value list length.

Now we will try to infer schema on the incoming dataset. The schema defines constraints for the data that are relevant for ML. Example constraints include the data type of each feature, whether it’s numerical or categorical, or the frequency of its presence in the data. For categorical features, the schema also defines the domain — the list of acceptable values.

schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

If you notice above infer_schema not only infers datatype but also possible values for categorical data. This schema can be further used to validate and identify schematic drift on evaluation or serving dataset as well as identify categories present in serving dataset but missing during training.

So far we’ve visualized dataset individually. It’s important that our evaluation data is consistent with our training data, including that it uses the same schema. It’s also important that the evaluation data includes examples of roughly the same ranges of values for our numerical features as our training data so that our coverage of the loss surface during evaluation is roughly the same as during training. The same is true for categorical features as well. Otherwise, we may have training issues that are not identified during evaluation, because we didn’t evaluate part of our loss surface.

Notice in the graph below each feature now includes statistics for both the training and evaluation datasets. The charts of both training and evaluation datasets are overlaid, making it easy to compare them. If you notice “MultipleLines” categorical feature, the new category came in during evaluation but not present in training.

tfdv.visualize_statistics(lhs_statistics=eval_df_stats, rhs_statistics=train_df_stats,lhs_name='EVAL_DATASET', rhs_name='TRAIN_DATASET')

If you want to highlight anomalies alone, this can be done via display_anomalies method. This uses the schema defined above on training dataset and validates it against evaluation dataset

eval_anomalies = tfdv.validate_statistics(statistics=eval_df_stats, schema=schema)
tfdv.display_anomalies(eval_anomalies)

Finally, we will look to see how to set skew and drift comparator threshold to identify Concept and Data drift between multiple model datasets as well as continuously monitoring serving dataset.

multiple_lines_skew = tfdv.get_feature(schema, 'MultipleLines')
multiple_lines_skew.skew_comparator.infinity_norm.threshold = 0.001
totalcharges_comp =tfdv.get_feature(schema, 'TotalCharges')
totalcharges_comp.drift_comparator.infinity_norm.threshold = 0.001
skew_anomalies = tfdv.validate_statistics(eval_df_stats, schema,
previous_statistics=train_df_stats)
tfdv.display_anomalies(skew_anomalies)

Drift is expressed in terms of L-infinity distance, and you can set the threshold distance so that you receive warnings when the drift is higher than is acceptable.

You can read more about L-infinity distance in wiki link below

https://en.wikipedia.org/wiki/Chebyshev_distance

Githhub repo : https://github.com/srivatsan88/Tensorflow_Extended_Notebook/blob/master/TFX_Visualize_Distribution.ipynb (You can open the notebook in google colab using the link provided within this notebook)

Official documentation of Tensorflow Extended — Tensorflow data validation can be found in the link

https://www.tensorflow.org/tfx/guide/tfdv?hl=hi

Stay tuned and keep watching this space for next part of Tensorflow Extended (TFX) series...

--

--