Build an MLOps Pipeline to Automatically Relabel Data using Cleanlab and DVC

Cleaning data and purifying labels are always challenging tasks, but a good experimentation framework makes things easier.

Todd Cook
Sage Ai
11 min readMay 24, 2023

--

Repeatable, independently verifiable experiments are the building blocks of scientific progress. Recent advancements in ML tools make reproducibility easier than ever before.

DVC stands for Data Version Control, and it’s a Python utility library that does more than help you access and manage large datasets and changes. It also helps you manage, track and compare experiments easily, without external server dependencies. It’s easy to get started, and I’ll show you how to leverage its power with a working, non-trivial example. Our dataset and problem deals with NLP, but the principle of cleaning data by using Cleanlab to analyze cross validation prediction probabilities is generally applicable to all types of data.

Let’s take a quick peek at the demo code of this article in action, here’s a quick asciicast of fetching and comparing the demo’s experiments using git and DVC:

Cleaning Data Labels — A Problem for Today and Tomorrow

Casually labeled data is commonly available in many sources: Wikidata, Kaggle datasets, academic data papers, individual blog posts, etc. We describe it as “casual” because often the quality of the labels can range from okay to questionable, and typically, no quality metrics are available.

Auto-ML that automatically categorizes data correctly will happen tomorrow — as in “Free Food Tomorrow” — i.e. probably never and certainly never free! However, we can push a little in that direction, since randomly sampling the data and labels is a poor selection mechanism and human evaluation is tedious and expensive. Better is to use some unsupervised learning to assess and rank the data and labels according to their predicted probabilities, a.k.a. confidence scores. However, since we don’t have an a priori gold label set to evaluate our confidence scores, we’ll use Cleanlab’s processing of the predicted probabilities from cross validation segments to arrive at a reasonable approximation for a set of gold labels. Thus, we can reduce the downstream burden of human evaluation by finding and relabeling the worst performers automatically.

Once the data is cleaned, the classes and labels may be useful for training NER models, ML-assisted data entry, text classification, or trend detection, such as spotting emerging employers and occupations, etc.

Goals of the demo and this article

  • Demonstrate reproducible ML
  • Use DVC to build a pipeline and track experiments
  • Automatically relabel noisy data labels using Cleanlab
  • Use FastText subword embeddings for unsupervised classification
  • Tune hyperparameters with experiment tracking
  • Prepare casually labeled data for models and human evaluation

The Problem and The Data

For our working demo, we will purify some of the slightly noisy and dirty labels found in Wikidata’s people entries, namely the attributes for Employers and Occupations. Our initial data labels have been harvested using a json dump of Wikidata, the Kensho Wikidata dataset, and this notebook script for extracting the data.

The dataset is similar to many real-world problems: noisy labels; and foggy boundaries of the label classes. The labels are noisy because they are typically entered by Wikidata volunteers, or by engineers importing raw data into Wikidata from legacy sources, without much cleaning. Casual web browsers and encyclopedia readers are tolerant of errors and foggy boundaries, but ML models suffer when trained with erroneous data. The easiest way to increase a model’s performance and robustness is to train with better data and fewer errors.

Our main class labels, employers and occupations, have foggy boundaries: a baker (occupation) could actually hire a carpenter (occupation) and then, it’s technically true that this baker is actually an employer; but an example like this is really just a word game; in the real world we typically expect that people describe their occupation in general terms of the most common usage, same for employers — a bakery (employer) typically employs bakers.

Morphology presents another foggy boundary, in the example above, a baker is employed by a bakery — bold type here shows how morphology indicates the occupation and the employer — however, since a carpenter is not employed by a carpentry, and many other examples break the morphology paradigm. Thus we will use contextual word embeddings to harvest boundary information so that we may better cluster, classify and clean the noisy labels.

Data Input Format

Tab-separated CSV files, with the fields:

  • text_data — The item that is to be labeled (single word or short group of words)
  • class_type — The class label
  • context — Any text that surrounds the text_data field in situ, or defines the text_data item in other words.
  • count — The number of occurrences of this label; how common it appears in the existing data.

Initial Data: Occupations and Employers

Sample Wikidata Occupations data:

This small sample of labeled occupation data looks fine, but the file has 12,755 entries, and we’ll soon see where things can go wrong.

Sample of the Wikidata Employers data:

Even with a small sample of employer data, we can see problems: neither bibliography nor computer science nor poetry are employers, and classification system has only occurred once, so it’s not even well supported. Single instance examples are almost always suspect since they are outliers. They could be valid, but more often, they are errors in judgment.

Using the instance count as a rough heuristic for the truthiness of an item is not robust in isolation, but it’s a feature worth considering. The context field comes from the description of the text_data field, or for many datasets it could be a sentence where the text_data item was used. One might try concatenating the context field with the text_data field to build a stronger signal to use during labeling and clustering of the class types.

Data Output format

  • (same parameters as the data input plus)
  • class_type — Updated label
  • previous_class_type — The previous class_type label
  • mislabeled_rank — the Cleanlab confidence rank prior to re-label
  • date_updated — When the label was updated

The Pipeline Stages

A DVC pipeline is made up of stages defined in a file dvc.yaml and configured by a file params.yaml. The pipeline is run, start to finish by simply invoking dvc repro (follow along with the code repo’s README.md). If any of the parameters are changed in the params.yaml file or if any dependencies have been updated (source code or data) then that stage is run again from scratch. Let’s take a look at our stages:

  • fetch.py — This program is called to download models and data. The FastText embedding model is several gigabytes.
  • prepare.py — This program is called to process the CSV files in the directory data/raw using the downloaded embeddings and a language detection model.
  • train.py — This program builds a classifier to cluster label the data.
  • relabel.py — This program uses the hyperparameters tuned by the train stage, and the data output by the prepare stage, and functionality of the Cleanlab utility library to train models using cross validation to predict which instances are mislabeled. The data is relabeled and exported to the directory data/final.

The pipeline stages are connected by the dvc.yaml file configuration. We’ll look at each stage shortly, and examine how they’re configured. You can see a visual representation of this if you type: dvc dag

The ouptut of the command: dvc diag
The output of the command: dvc dag (image by author)

Fetch Stage

Usually DVC projects store large files in cloud storage and bring them down locally for processing. However since one of the goals of this demo pipeline is to determine what’s the best embedding to use, I’ve made the data access step its own separate automated stage, instead of requiring users to set up cloud storage access and import the file into their DVC config.

# Fetch stage of dvc.yaml
fetch:
cmd:
- python src/fetch.py --param_group fetch_data
- python src/fetch.py --param_group fetch_model
always_changed: True
deps:
- src/fetch.py
params:
- fetch_data.file
- fetch_data.uri
- fetch_model.file
- fetch_model.uri
cmd:
- python src/fetch.py --param_group fetch_data
- python src/fetch.py --param_group fetch_model
always_changed: True
deps:
- src/fetch.py
params:
- fetch_data.file
- fetch_data.uri
- fetch_model.file
- fetch_model.uri

Note: you can call a command with more than one target and you can pass different parameters as done above — however, these parameters aren’t meant to be updated here, and all parameter changes should happen in the params.yamlfile.

Prepare Stage

In the Prepare stage we collect our text_data items and class_type labels and generate and save an embedding representation for each item. Our FastText subword embeddings are generated using the binary FastText embedding models (i.e. crawl-300d-2M-subword.bininstead of crawl-300d-2M-subword.vec). This allows us to generate embeddings based on the character sequences/subwords that make up a word. In practical terms, if I encountered a neologism like “bakeryarama” the FastText binary model would create an embedding very similar to the word “bakery” with some additional embedding values found near words that end in “arama”, speaking figuratively. Using a binary embedding model like FastText allows us to create a more robust preparation stage which should generate embeddings that are somewhat tolerant of misspellings and neologisms, and this can help our model generalize and perform better on unseen data.

If there are multiple words in a text_data item we:

  • drop any stopwords
  • weight the individual word embeddings using a IDF dictionary (if a tfidf_dict_pickle_fileparameter has been provided inparams.yaml ).
  • average the weighted embeddings

This approach above is explained in detail in the paper A Simple but tough-to-beat baseline for sentence embeddings.

# Prepare stage of dvc.yaml
prepare:
cmd: python src/prepare.py
deps:
- src/prepare.py
- data/raw
outs:
- data/prepared/data.all.csv
metrics:
- reports/prepare.metrics.json:
cache: false
params:
- fetch_data.file
- prepare.embeddings_dim
- prepare.filter_lang_bool
- prepare.filter_lang
- prepare.lang_detect_model
- prepare.stopwords_filter_lang
- prepare.filter_count_gteq
- prepare.tfidf_dict_pickle_file

An output has been specified for the prepare stage, and if that output file exists when the stage starts, it is deleted; and when the cmd completes, if that file has not been recreated, the stage has failed and DVC will try to run in with the next invocation of dvc repro.

Metrics are also reported and the reports/prepare.metrics.json file must be generated or the stage is marked as failed. The dependencies of the stage deps are only the source code of the file and the raw directory — normally, the data files would be listed here and if these files changed, DVC would compare checksums and rerun the prepare stage if necessary, however, since fetch stage allows users the flexibility to specify whatever files they want, the data file dependency is not hardcoded.

The prepare stage also uses a FastText language model to detect and filter by whatever language you want to specify. The parameter filter_count_gteq is an integer and it’s used to filter out entries where the data’s count attribute is less than this value; if there’s no count attribute in the data row, then it doesn’t apply.

Last but perhaps most importantly, the prepare stage marks any duplicate text_data items as unknown, using the UNK tag. This allows us to get away from the false dichotomy that occurs when you have only two labels in the data, but in the real world you find items outside of either class type, e.g. there are many things that are neither an occupation nor an employer. Marking collisions as unknown will allow the model and cross validation steps to have a safe bucket for corralling items of low confidence.

Train Stage

The train stage allows you to specify a classifier: Support Vector, or KNeighbors (or others could be added). The stage’s output allows the user to inspect and adjust model hyperparameters. For our demo we build a Support Vector classifier, which was chosen because they work well to discover and uncover boundaries and maximum margins between the data (for more information consult the papers listed at the end of this article).

# train stage of dvc.yaml
train:
cmd: python src/train.py
deps:
- data/prepared/data.all.csv
- src/train.py
outs:
- model/svm.model.pkl
metrics:
- model/train.metrics.json:
cache: false
plots:
- model/class.metrics.csv:
cache: false
params:
- train.class_types
- train.svm_degree
- train.svm_dual
- train.svm_gamma
- train.svm_kernel
- train.svm_loss
- train.model_type
- train.svm_penalty
- train.svm_regularization_C
- train.seed
- train.split
- train.knc_n_neighbors
- train.knc_weights
- train.knc_algorithm
- train.knc_leaf_size
- train.knc_p
- train.knc_metric
- train.num_components
- train.use_pca

One minor annoyance is that it’s difficult to group parameters that belong together. For example, the models LinearSVC and SVC both take a C regularization parameter, but then LinearSVC takes penalty, loss, dual parameters, and SVC takes parameters specifying kernel, degree, gamma. There’s no easy way to group them; in the demo code we allow experimenters to toggle between the models by specifying the model_type parameter.

Relabel Stage

The Relabel Stage uses the Cleanlab library and all of the training stage settings.

# The relabel stage of dvc.yaml
relabel:
cmd: python src/relabel.py
deps:
- src/relabel.py
- data/prepared/data.all.csv
- model/svm.model.pkl
outs:
- data/final/data.csv
metrics:
- reports/relabel.metrics.json:
cache: false
plots:
- data/final/class.metrics.csv:
cache: false
params:
- prepare.unknown_tag
- train.class_types
- train.svm_degree
- train.svm_dual
- train.svm_gamma
- train.svm_kernel
- train.svm_loss
- train.model_type
- train.svm_penalty
- train.svm_regularization_C
- train.seed
- train.split
- train.knc_n_neighbors
- train.knc_weights
- train.knc_algorithm
- train.knc_leaf_size
- train.knc_p
- train.knc_metric
- relabel.num_crossval_folds
- relabel.min_distance_decision
- relabel.max_distance_decision

Using the Cleanlab library to purify the labels requires that we create probability values for predictions, such as those provided by a Scikit-Learn class’s predict_proba method. The mechanics of Cleanlab’s algorithm is beyond the scope of this article, but curious minds will want to familiarize themselves with the Confident Learning paper and the other papers listed in the Cleanlab github repo.

The Results

To relabel, first we use Cleanlab’s Latent Estimation method:cross_val_predict, which computes the out-of-sample predicted probability [P(s=k|x)] for every example in X using cross validation. Then we use the Cleanlab Pruning method: find_label_issues, which returns the indices of most likely (confident) label errors in y-hat.

When the relabel.py program runs, it prints out the top 20 items that Cleanlab returns in order of increasing confidence, (that is, the first is the worst). Since the random seeds are fixed, and the experiments are repeatable, they provide interesting and enlightening comparisons.

Here we see that each kernel type (linear, polynomial, radial basis) featurizes the embedding vector in a different manner, and learns different boundaries so that each model provides a different way to look at the data. For our particular data set (with UNK being the obvious unbalanced class) it seems more acceptable that the models’ uncertain predictions tend to mark more values asUNK, which is preferable to them hallucinating the wrong label.

In running our demo code, we see our initial data set of Wikidata occupations and employers thinned using data quality heuristics, language detection, count thresholds, and the discarding of duplicates. Employers shrunk from 66,880 to 27,734 entries, and Occupations shrunk from 10,975 to 6,280 entries, and the raw labeled data was de-duplicated, resulting in 33,103 data items and labels. Our pipeline and Cleanlab’s algorithm detected between 1,354 and 1,993 label issues (depending on the classifier used) which were then relabeled, or moved to an unknown category for further inspection.

Further Observations

  • For relabeling and cleaning, it’s important to have more than two labels, and to specify an UNK label for: unknown; labels spanning multiple groups; or low confidence support; often a binary set of labels is too limiting to capture the current state of the information we have.
  • Standardizing the input data formats allow users to flexibly use many different data sources.
  • Language detection is an important part of data cleaning, however problematic because:
    - Modern languages sometimes “borrow” words from other languages (but not just any words!)
    - Language detection models perform inference poorly with limited data, especially just a single word.
    - Normalization utilities, such as unidecode aren’t helpful; (the wrong word in more readable letters is still the wrong word).
  • Experimentation parameters often have co-dependencies that make a simple combinatorial grid search inefficient.

Conclusion

The demo pipeline produces enough automatically re-labeled data that the output of multiple models should be evaluated and compared, and this is left as an exercise for the reader. Each classifier likely provides useful information, or perhaps a different custom kernel may produce better results on your data. The classifiers could vote together and the votes could be weighted. The finer points of using ensemble methods are beyond the scope of this article, but interested readers may want to start with Kunapuli’s practical guide Ensemble Methods for Machine Learning. There are many options to consider for the next stage of your auto-labeling pipeline.

Sources

https://github.com/todd-cook/auto-label-pipeline

https://fasttext.cc/docs/en/english-vectors.html
Confident Learning: Estimating Uncertainty in Dataset Labels by Curtis G. Northcutt, Lu Jiang, Isaac L. Chuang, 31 Oct 2019, [arxiv]
A Simple but tough-to-beat baseline for sentence embeddings by Sanjeev Arora, Yingyu Liang, Tengyu Ma, ICLR 2017, [paper]
Support Vector Clustering by Asa Ben-Hur, David Horn, Hava T. Siegelmann, Vladimir Vapnik, November 2001 Journal of Machine Learning Research 2 (12):125–137, DOI:10.1162/15324430260185565, [paper]
SVM Clustering by Winters-Hilt, S., Merat, S. BMC Bioinformatics 8, S18 (2007). [link], [paper]

--

--