Data Harmonization using Machine Learning and Language Models

Published in

DataReply

11 min readJul 21, 2023

Introduction and Motivation

In today’s data-driven world, organizations are accumulating vast amounts of information from diverse sources across multiple countries and regions. However, one significant challenge that arises in such scenarios is the inconsistency and lack of standardization in how data is tagged and labeled. This variability hinders the ability of data scientists to effectively utilize the data for building machine learning (ML) use cases, resulting in lost opportunities for valuable insights and actionable outcomes.

The conventional approach to tackling this challenge has been through rule-based Extract, Transform, Load (ETL) processes. However, these methods often require manual intervention and are highly dependent on explicit rules, making them less scalable and time-consuming. To overcome these limitations, an innovative and more efficient solution is needed — one that leverages the power of artificial intelligence (AI) and machine learning.

Let’s explore the key reasons why AI and ML techniques are the preferred choice in harmonizing data with multiple labels:

Scalability and Efficiency: Manual data harmonization processes are labor-intensive and time-consuming, especially when dealing with a large number of labels. Machine learning offers the potential to automate this process, significantly reducing the time and effort required to normalize the data.
Adaptability to Changing Data: As companies expand their operations globally, the diversity of data sources, languages, and variations in label representation increases. Machine learning models can adapt to these variations and learn patterns from the data, making them robust and flexible in handling different labeling conventions.
Improved Accuracy and Consistency: Manual harmonization processes are prone to human errors and inconsistencies, especially when dealing with a large number of labels. Machine learning models, on the other hand, can learn from the existing labeled data and apply statistical techniques to harmonize new labels automatically. This leads to improved accuracy, consistency, and reliability in the harmonized data.
Language and Contextual Understanding: Labels in datasets may be written in different languages or represent similar concepts differently. Machine learning models can utilize Natural Language Processing (NLP) techniques to understand the semantics and context of the labels. This empowers companies to efficiently process and utilize data from diverse sources, regardless of the language barrier.
Cost Reduction and Time Savings: Implementing an unsupervised data harmonization approach based on machine learning can result in significant cost savings for organizations. By automating the process, companies can reduce manual effort, increase efficiency, and free up valuable resources for other critical tasks.

Data harmonization using machine learning holds tremendous potential for organizations facing the challenge of inconsistent and non-standardized labeling across diverse data sources. Let’s explore the construction of data harmonization models by utilizing an unsupervised approach using Natural Language Processing (NLP) techniques.

Dataset Generation

In order to simulate a data harmonization problem scenario, we will construct a dataset to showcase the challenges associated with harmonizing data sources. To accomplish this, we will utilize the powerful features and capabilities offered by WordNet. WordNet is an extensive lexical database that organizes words into semantic hierarchies and groups them into synsets based on shared meanings. WordNet captures the relationships between words, including synonyms, antonyms, hyponyms, hypernyms, and more.

These synsets serve as proxies for different labeling conventions or variations in how data is tagged, reflecting the challenges faced during data harmonization. With this simulated dataset, we can explore and develop unsupervised data harmonization approaches and evaluate the effectiveness of different machine learning models and techniques.

We are going to create a dataset by iterating through all the noun synsets available in WordNet, which comprises a total of 82,115 synsets.

from nltk.corpus import wordnet
all_noun_synsets = list(wordnet.all_synsets('n'))
print(len(all_noun_synsets))
>>> 82115

During each iteration, we extract all the synonyms associated with each synset. Subsequently, we filter and retain only the synsets with the highest number of synonyms, resulting in a subset of 1,000 synsets. Consequently, we obtain a comprehensive list of synonyms, with each synset containing a minimum of 6 and a maximum of 28 items.

For instance, consider the concept of “noon.” The synonyms associated with this concept include “twelve_noon,” “high_noon,” “midday,” “noonday,” and “noontide.” These synonyms exemplify the variety of terms that are captured within a synset, showcasing the range of linguistic expressions associated with a particular concept.

#extract the synonyms (lemma names)
all_lemmas = [synset.lemma_names() for synset in all_noun_synsets]
most_synonyms = sorted(all_lemmas, key= lambda x: len(x), reverse = True)[:1000]
print(max([len(el) for el in most_synonyms]))
>>> 28
print(min([len(el) for el in most_synonyms]))
>>> 6
print(most_synonyms[-2])
>>> ['noon', 'twelve_noon', 'high_noon', 'midday', 'noonday', 'noontide']

To ensure there are no overlaps among synonyms, we apply a filtering step to the synonyms. Specifically, we exclude any synonyms that appear more than once within our dataset. This filtering process guarantees that each synonym is unique and avoids redundancy in the data.

Thus, we create a dataset that maps each concept to its corresponding list of potential synonyms. This serves as a useful reference, enabling easy access and retrieval of the synonyms associated with a specific concept.

dataset, existing = [], set()
for syn_list in most_synonyms:
    keep = []
    for syn in syn_list:
        if not syn in existing:
            keep.append(syn)
        existing.add(syn)        
    if keep:    
        dataset.append((keep[0], keep))
print(dataset[-2])
>>> ('noon', ['noon', 'twelve_noon', 'high_noon', 'midday', 'noonday', 'noontide'])

Here are some additional examples from the dataset:

Snippet of the harmonization dataset created using WordNet

These examples provide a glimpse into the dataset, showcasing the variety of concepts and their corresponding synonyms. The dataset captures the diverse range of terms associated with different concepts, highlighting the challenges of data harmonization.

Unsupervised (zero-shot) harmonization ML model

Now, we can construct an unsupervised harmonization model utilizing a pre-trained Sentence Transformers model. The model enables us to perform the following steps:

Firstly, we embed our labels by leveraging the capabilities of the Sentence Transformers model. This process involves representing each label as a dense vector that captures its semantic meaning and context.

Secondly, we store these label embeddings, employing a library such as FAISS for efficient storage and retrieval.

Finally, we proceed to normalize our synonyms by calculating the cosine similarity score between the vector representation of the label and the vector representation of each synonym. This similarity score quantifies the degree of semantic similarity between the label and each synonym. Consequently, we identify and select the label with the highest similarity score as the harmonized representation.

Now, let’s examine the results of applying this approach to the concept of “Dostoyevsky”:

The synonyms for the concept “Dostoyevsky” are predicted to be “Dostoyewsky”.

Here is the code snippet to perform the classification:

#choose the model
from sentence_transformers import SentenceTransformer
def generate_embedding(model, sentences):
    return model.encode(sentences, batch_size=len(sentences), show_progress_bar=True)

model = SentenceTransformer('sentence-transformers/paraphrase-mpnet-base-v2')

#create index
import faiss
import numpy as np
def create_index(labels):  # normalization function
    # normalize with faiss: 1. embed labels
    gold_vectors = generate_embedding(model, labels) 
    print("shape of vectors:", gold_vectors.shape)
    # normalize with faiss: 2 build the index with labels
    gold_vectors_norm = gold_vectors / np.expand_dims(np.linalg.norm(gold_vectors, axis=1), axis=1)
    index = faiss.index_factory(gold_vectors.shape[1], "Flat", faiss.METRIC_INNER_PRODUCT)
    index.add(gold_vectors_norm)
    return index

#set of labels
labels = list(dict(dataset).keys())
index = create_index(labels)

#run inference for test words
test_words = ['Dostoyevsky', 'Dostoevski', 'Dostoevsky', 'Feodor_Dostoyevsky', 
              'Fyodor_Dostoyevsky', 'Feodor_Dostoevski', 'Fyodor_Dostoevski', 
              'Feodor_Dostoevsky', 'Fyodor_Dostoevsky', 'Feodor_Mikhailovich_Dostoyevsky', 
              'Fyodor_Mikhailovich_Dostoyevsky', 'Feodor_Mikhailovich_Dostoevski', 
              'Fyodor_Mikhailovich_Dostoevski', 'Feodor_Mikhailovich_Dostoevsky', 
              'Fyodor_Mikhailovich_Dostoevsky']
test_embeddings = generate_embedding(model, test_words)
test_embeddings_norm = test_embeddings / np.expand_dims(
    np.linalg.norm(test_embeddings, axis=1), axis=1
)
#predictions for test words
cos_dist, indices = index.search(test_embeddings_norm, 1)
cos_dist = cos_dist.flatten()
indices = indices.flatten()
threshold = 0.8
for i, idx in enumerate(indices):
    cos_sim = round(cos_dist[i], 3)
    print(cos_sim, test_words[i],  "->", labels[idx])
>>>
  1.0 Dostoyevsky -> Dostoyevsky
  0.726 Dostoevski -> Dostoyevsky
  0.954 Dostoevsky -> Dostoyevsky
  0.83 Feodor_Dostoyevsky -> Dostoyevsky
  0.856 Fyodor_Dostoyevsky -> Dostoyevsky
  0.586 Feodor_Dostoevski -> Dostoyevsky
  0.724 Fyodor_Dostoevski -> Dostoyevsky
  0.81 Feodor_Dostoevsky -> Dostoyevsky
  0.833 Fyodor_Dostoevsky -> Dostoyevsky
  0.794 Feodor_Mikhailovich_Dostoyevsky -> Dostoyevsky
  0.794 Fyodor_Mikhailovich_Dostoyevsky -> Dostoyevsky
  0.623 Feodor_Mikhailovich_Dostoevski -> Dostoyevsky
  0.666 Fyodor_Mikhailovich_Dostoevski -> Dostoyevsky
  0.78 Feodor_Mikhailovich_Dostoevsky -> Dostoyevsky
  0.789 Fyodor_Mikhailovich_Dostoevsky -> Dostoyevsky

For the example of “Dostoyevsky,” we achieved a 100% accuracy. However, when running predictions for all the labels and comparing them to our gold labels, we anticipate encountering numerous challenging problems:

Set of challenging synonyms

When we ran the model to predict all 1,000 labels, we obtained an accuracy of approximately 43%. This performance can be considered relatively poor as no supervision was utilized, and some labels pose challenges for accurate prediction using only a pre-trained model. In an attempt to improve the results, we narrowed down the prediction to a smaller set of 162 labels that are relatively easier to predict. With this subset, we achieved an accuracy of around 53%. However, despite this improvement, we are still not satisfied with the results and acknowledge the need for further enhancements in the harmonization model.

Unsupervised (few-shot) harmonization ML model.

Language models are also few-shot learners!

We will utilize the SetFit model, which leverages Few-shot Learning with Sentence Transformers, enabling high accuracy even with limited labeled data. An intriguing fact about this model is its competitiveness when compared to fine-tuning RoBERTa Large on a full training set of 3,000 examples. We are eager to explore its potential for our harmonization task.

To train the ML model, we only require 8 samples per class, making it a few-shot learning approach. To adapt our dataset for training, we will treat the problem as a multi-class classification task and make slight modifications accordingly.

An additional advantage of SetFit is its multilingual support. It can be utilized in conjunction with any Sentence Transformer available on the Hub, including multilingual variants. This feature proves beneficial in the context of the harmonization problem, especially when dealing with labels present in multiple languages.

By harnessing the capabilities of SetFit and its multilingual support, we aim to enhance the performance and effectiveness of our harmonization task, utilizing a few-shot learning approach with limited labeled data.

Let’s proceed with modifying our dataset by constructing training, evaluation, and test data. In this particular experiment, we will work with a total of 162 labels. To split our synonym list for each label into training and evaluation sets, we will assign the first 8 synonyms for training and the last 8 synonyms for evaluation.

It is important to note that the size of the synonym list may vary, resulting in some overlap between the evaluation set and the training set. This scenario closely reflects real-world situations where there might be instances where evaluation samples appear in the training set. However, to mimic real-world conditions accurately, it is crucial to include samples in the evaluation set that have never been encountered by the model before.

For testing, we will utilize the entire dataset, encompassing the labels included in both the training and evaluation sets. This approach aims to simulate real-world scenarios where the model encounters both known labels (from training) and unseen labels during inference. It is essential to ensure that the model can handle normalizing both familiar and previously unseen labels effectively.

Taking the example of “Dostoevsky,” we observe only one overlap between the training and evaluation sets. Specifically, “Feodor_Dostoevsky” appears in both the training and evaluation sets.

from datasets import Dataset
train_dataset = Dataset.load_from_disk("my_dict_train")
eval_dataset = Dataset.load_from_disk("my_dict_eval")

train_dataset
>>>
Dataset({
    features: ['idx', 'sentence', 'label'],
    num_rows: 1071
})

train_dataset.data.to_pandas().head(8)
>>> idx sentence label
0 Dostoyevsky Dostoyevsky
1 Dostoevski Dostoyevsky
2 Dostoevsky Dostoyevsky
3 Feodor_Dostoyevsky Dostoyevsky
4 Fyodor_Dostoyevsky Dostoyevsky
5 Feodor_Dostoevski Dostoyevsky
6 Fyodor_Dostoevski Dostoyevsky
7 Feodor_Dostoevsky Dostoyevsky

eval_dataset.data.to_pandas().head(8)
>>> idx sentence label
0 Feodor_Dostoevsky Dostoyevsky
1 Fyodor_Dostoevsky Dostoyevsky
2 Feodor_Mikhailovich_Dostoyevsky Dostoyevsky
3 Fyodor_Mikhailovich_Dostoyevsky Dostoyevsky
4 Feodor_Mikhailovich_Dostoevski Dostoyevsky
5 Fyodor_Mikhailovich_Dostoevski Dostoyevsky
6 Feodor_Mikhailovich_Dostoevsky Dostoyevsky
7 Fyodor_Mikhailovich_Dostoevsky Dostoyevsky

First, we load the SetFit model from the Hub, which provides us with the pre-trained weights and architecture necessary for few-shot learning. We then instantiate a Trainer object, which handles the training process and model evaluation.

Next, we train the model using our modified dataset, which includes the training and evaluation sets. The Trainer orchestrates the training procedure, fine-tuning the SetFit model on our specific task of data harmonization.

Once the model is trained, we can utilize it for inference. Let’s take the example of predicting the label for the synonym “Fyodor_Mikhailovich_Dostoevsky”. We input this synonym into the trained model, and it will output the predicted label associated with it.

from setfit import SetFitModel, SetFitTrainer
from sentence_transformers.losses import CosineSimilarityLoss

model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    metric="accuracy",
    batch_size=16,
    num_iterations=20, # The number of text pairs to generate for contrastive learning
    num_epochs=1, # The number of epochs to use for contrastive learning
    column_mapping={"sentence": "text", "label": "label"} # Map dataset columns to text/label expected by trainer
)
trainer.train()

# Run inference
preds = model(["Fyodor_Mikhailovich_Dostoevsky"])
preds 
>>> array(['Dostoyevsky'], dtype='<U38')

Now, let’s proceed with running the inference for all synonyms across our 162 labels and calculate the corresponding accuracy:

total_words = [w for gold, words in dataset for w in words]
gold_labels = [gold for gold, words in dataset for _ in words]

# Run inference
preds = model(total_words)

#Calculate accuracy
from sklearn.metrics import accuracy_score
acc = accuracy_score(gold_labels, preds)
print(acc)
>>> 0.9270568278201866

The results demonstrate significant improvement, indicating a substantial enhancement in performance.

Supervised harmonization ML model

Finally, it’s important to mention the supervised approach, particularly when training data is available. This approach can yield high accuracy; however, it necessitates a sufficient amount of training data. In this scenario, the problem is treated in a manner similar to few-shot classification, where we address a multi-class classification task involving either 162 or 1000 labels to predict. Due to the larger number of labels, this approach is often referred to as “Extreme classification”.

One advantage of the supervised approach is the ability to utilize both textual data and the values associated with the labels as features for the model. This means that numerical features can be extracted and incorporated into the model, enabling the development of a powerful model that can learn not only from synonyms but also from the characteristics of the labels themselves. When training data is readily available, this approach is highly recommended for achieving optimal results.

By leveraging the supervised approach, organizations can benefit from a robust model that exploits both textual and numerical features, ensuring accurate predictions and comprehensive insights.

Summary

Data harmonization is a critical task in organizations where data from various sources, regions, or countries need to be normalized to ensure consistency and enable effective analysis. Traditional rule-based and ETL approaches are time-consuming and not scalable for large datasets, calling for more advanced machine learning (ML) techniques.

Unsupervised ML approaches, such as leveraging NLP techniques and semantic similarities, offer automated solutions for data harmonization. By using language models and similarity scores, labels and synonyms can be harmonized, enabling effective data normalization.

Few-shot learning models, like SetFit, excel in scenarios with limited labeled data. These models achieve high accuracy by leveraging the concept of transfer learning and adapting to new label sets with minimal examples.

Supervised ML approaches, on the other hand, are suitable when training data is available. They treat data harmonization as a multi-class classification problem, and by incorporating textual and numerical features, these models can learn from both synonyms and label characteristics, resulting in powerful models.

Additionally, multilingual language models can be advantageous in data harmonization tasks where labels exist in multiple languages, allowing for cross-lingual harmonization.

Choosing the appropriate ML approach for data harmonization depends on the specific requirements, available data, and scalability needs. By leveraging ML techniques, organizations can automate the harmonization process, improve data quality, and enhance decision-making based on harmonized and normalized data.

Data harmonization, the process of standardizing and unifying diverse datasets, holds immense potential for addressing the challenges of data compliance and standardization. By automating the process of standardization, machine learning can reduce manual effort, enhance data accuracy, and streamline regulatory reporting. As organizations face increasing demands for data compliance, leveraging machine learning for data harmonization can greatly reduce the effort required to become standards compliant.