AI-Powered Schema Mapping

8 min readSep 30, 2023

Introduction

Large Language Models (LLMs) possess the ability to perform various tasks, including machine translation, summarization, text generation, and many others. One particularly intriguing task they excel at is schema mapping.

Schema mapping finds practical applications in several industries, notably in Health Care, Insurance, Finance, and Manufacturing.

Problem Statement

Within healthcare Electronic Health Record (EHR) systems, patient data often exists in various formats across different EHR systems, making the task of consolidating this data into a unified schema highly complex.

For instance, consider the differentiation in how gender is labeled: in one EHR system, it might be denoted as “MALE,” while in another, it could be recorded as “MAN.” Identifying these columns as representing the same information programmatically presents significant challenges.

But, what if we could leverage the power of Large Language Models (LLMs) to address this issue? LLMs possess the capacity to recognize such semantic similarities, potentially offering a solution to this intricate problem.

Sample Use Case

Patient information from one EHR system — Say “Source 1”

Patient information from another EHR system — Say “Source 2”

The objective at hand involves extracting information from both Source 1 and Source 2 and then producing an output that adheres to the following format:

Indeed, it’s evident that the patient’s name, gender, and date of birth from the two distinct sources are identical but presented in varying formats.

To successfully transform this information into a unified structured table, we need to address several key considerations:

Name Field Identification and Merging:

Properly identify the name fields in both Source 1 and Source 2.
Merge the names into a single consistent format, possibly including initials.

2. Date Formatting:

Reformat all dates to adhere to the yyyy-mm-dd format for consistency.

3. Gender Column Identification:

Identify and standardize the gender column, ensuring it’s marked appropriately (e.g., as “Male” or “Female”).

4. Mobile Number Standardization:

Normalize the mobile numbers if they are in different formats, ensuring they all follow a consistent format.

By addressing these considerations, we can create a structured table that aligns with the target format in the Centralized EHR system.

Let’s explore various approaches to tackle this challenge and determine which one offers a more efficient and straightforward solution for implementation.

Direct SQL Transformation:

The initial and straightforward approach involves pinpointing the disparities between Source 1 and Source 2, and subsequently crafting SQL code to facilitate the transformation from one format to the other.

Pros:

Simplicity of implementation.

Cons:

Lack of generality: The solution is tailored specifically to the current Source 1 and Source 2 configurations.
Susceptibility to changes: Any alterations in Source 1 or Source 2 would render the code ineffective, necessitating frequent adjustments.

Using LLMs for Schema Mapping:

To address this challenge, Large Language Models (LLMs) can be employed to grasp the mapping intricacies between Source 1 and Source 2. We can formulate a straightforward prompt to instruct Openai GPT-3.5 Turbo to find the distinctions and generate the desired result in the target format.

Install the necessary packages

! pip install -qq langchain openai

#imports

import json
import os
from argparse import ArgumentParser

import pandas as pd
from langchain.prompts import PromptTemplate  
from langchain.chains import LLMChain
from langchain.callbacks import wandb_tracing_enabled
from langchain.chat_models import ChatOpenAI

Create dummy csv files with the table content

source1 = """PatientID,FirstName,LastName,Gender,Age,DateOfBirth,Phone,Email
101,John,Smith,Male,35,15-07-1987,(555) 123-4567,john.smith@email.com
102,Mary,Johnson,Female,28,22-03-1995,(555) 987-6543,mary.j@email.com
103,David,Williams,Male,45,10-12-1978,(555) 555-5555,david.w@email.com
104,Sarah,Brown,Female,52,05-09-1971,(555) 111-2222,sarah.b@email.com
105,Michael,Davis,Male,30,20-11-1992,(555) 333-4444,michael.d@email.com
"""

source2="""PatientID,Name,Sex,Age,DOB,Telephone,Email
101,John Smith,M,35,15/07/1987,(555) 123-4567,john.smith@email.com
102,Mary Johnson,F,28,22/03/1995,(555) 987-6543,mary.j@email.com
103,David Williams,M,45,10/12/1978,(555) 555-5555,david.w@email.com
104,Sarah Brown,F,52,05/09/1971,(555) 111-2222,sarah.b@email.com
105,Michael Davis,M,30,20/11/1992,(555) 333-4444,michael.d@email.com
"""

sample="""ID,Full Name,Gender,Age,DOB,Mobile,Email
104,Sarah Brown,Female,52,1971-09-05,555-111-2222,sarah.b@email.com
105,Michael Davis,Male,30,1992-11-20,555-333-4444,michael.d@email.com
"""

with open("source1.csv", "w") as fp:
  fp.write(source1)

with open("source2.csv", "w") as fp:
  fp.write(source2)

with open("sample.csv", "w") as fp:
  fp.write(sample)

Generate a prompt that involves inputting rows from both Source 1 and Source 2, along with a sample row, to identify the desired target format. Instruct the OpenAI GPT to generate the output in this specified target format.

template = """ 
Transform the data in Source 1 and Source 2 to match the format of the Sample row. Here are the contents of Source 1 and Source 2:

Source 1 - {source1_row}
Source 2 - {source2_row}

Please create a new row JSON object in the format of the Sample row:

Sample - {sample_row}

Pick the value from Source 1 or Source 2. Apply transformation so that the value changes to Sample format.
    JSON object:"""

Create Langchain — llm chain

OpenAI_API_KEY = "<SET-YOUR-OPEN-AI-KEY-HERE>"
prompt = PromptTemplate(
    template=template, input_variables=["source1_row", "source2_row", "sample_row"]
)

llm = ChatOpenAI(openai_api_key=OpenAI_API_KEY, model="gpt-3.5-turbo")
llm_chain = LLMChain(prompt=prompt, llm=llm, verbose=True)

Use the chain to gather the result and store it. Iterate the process for all the rows.

openai_result_list = []
for i in range(len(source1_df)):
    source1_row = source1_df.iloc[i].to_json()
    source2_row = source2_df.iloc[i].to_json()
    sample_row = sample_df.iloc[-1].to_json()
    response = llm_chain.run(
        {"source1_row": source1_row, "source2_row": source2_row, "sample_row": sample_row, "sample_columns": sample_df.columns}
    )
    target_row_dict = json.loads(response)
    openai_result_list.append(target_row_dict)

Generate the result in dataframe format

 target_df = pd.DataFrame(openai_result_list)
 target_df

Sample output:

As observed, LLM has demonstrated its capability to identify mappings and perform value transformations effectively. However, there are certain drawbacks associated with this approach:

Occasional Mapping Misses:

Notably, LLM may occasionally overlook certain mappings, as exemplified by one row with an incorrect phone number format.

2. Time-Consuming Process:

Scaling this approach to handle a large dataset, such as millions of rows, can be time-consuming since it requires making a corresponding number of API calls to obtain results.

3. Cost Implications:

When using batched inputs with long-context models for a vast dataset like a million rows, the substantial number of API calls can significantly increase the overall cost of the operation.

These limitations need to be carefully considered when opting for this method.

To address these challenges, let’s explore the next phase of the solution. It’s evident that LLMs possess the capability to discern the patterns between the source and target tables. The primary challenge lies in ensuring consistency.

Mapping Schemas using Code Generation

To overcome this hurdle, we can leverage LLMs to discover the mapping just once and subsequently utilize it for all rows by employing code generation.

Rather than processing each row individually, we can adapt the prompt in a manner that instructs GPT-4 to provide Python code for generating the target table.

template = """You are an assistant to generate code. 

Lets think step by step

1. You are given three tables. Source1, Source2 and Sample.
2. Task is to generate a target table which has exactly the same number of columns as sample table and same number of rows as source1 table
3. For each column in the sample table, identify which column matches from source1 or source2 table and find the transformation needed from source to sample table
4. Use pandas in built functions or regex and transform the column into sample table format.
5. Apply mobile transformations(xxx-xxx-xxxx) simillar to the sample table format. 
6. Always transform dates into yyyy-mm-dd format
7. Do not change the source1, source2 and sample table values. Instead, find the transformations and apply it on the target table. 
8. Do not perform merge or concat, as the tables are huge.
9. The column names in the sample table might not match exactly in the source1 and source2 table. identify the columns based on the column values.
10. Generate python code to create target table by reading source1.csv, source2.csv, sample.csv.

Few rows of Source1, Source2 and Sample tables:

Source 1 - {source1_row}
Source 2 - {source2_row}
Sample - {sample_row}
 
Python Code:
"""

Now, lets use langchain to gather the result

prompt = PromptTemplate(
    template=template, input_variables=["source1_row", "source2_row", "sample_row"]
)

llm = ChatOpenAI(openai_api_key=OpenAI_API_KEY, model="gpt-4")
llm_chain = LLMChain(prompt=prompt, llm=llm, verbose=True)
source1_row = source1_df.iloc[:2].to_json()
source2_row = source2_df.iloc[:2].to_json()
sample_row = sample_df.iloc[:2].to_json()
response = llm_chain.run(
    {"source1_row": source1_row, "source2_row": source2_row, "sample_row": sample_row}
)
print(response)

Response from GPT-4

```python
import pandas as pd
import re

# Load data
source1 = pd.read_csv('source1.csv')
source2 = pd.read_csv('source2.csv')
sample = pd.read_csv('sample.csv')

# Create a target DataFrame with the same number of rows as source1 and columns as sample
target = pd.DataFrame(index=range(len(source1)), columns=sample.columns)

# Map source column names to sample column names based on the content
column_map = {
    'PatientID': 'ID',
    'FirstName': 'Full Name',
    'LastName': 'Full Name',
    'Name': 'Full Name',
    'Gender': 'Gender',
    'Sex': 'Gender',
    'Age': 'Age',
    'DateOfBirth': 'DOB',
    'DOB': 'DOB',
    'Phone': 'Mobile',
    'Telephone': 'Mobile',
    'Email': 'Email'
}

# Apply transformations
for src_col, tgt_col in column_map.items():
    if src_col in source1.columns:
        if tgt_col == 'Full Name' and src_col in ['FirstName', 'LastName']:
            if tgt_col not in target.columns:
                target[tgt_col] = source1['FirstName'] + ' ' + source1['LastName']
            continue
        if tgt_col == 'Gender' and src_col == 'Sex':
            target[tgt_col] = source1[src_col].map({'M': 'Male', 'F': 'Female'})
            continue
        if tgt_col == 'DOB':
            target[tgt_col] = pd.to_datetime(source1[src_col], dayfirst=True).dt.strftime('%Y-%m-%d')
            continue
        if tgt_col == 'Mobile':
            target[tgt_col] = source1[src_col].apply(lambda x: re.sub(r'\D', '', x)).str.replace(r'(\d{3})(\d{3})(\d{4})', r'\1-\2-\3')
            continue
        target[tgt_col] = source1[src_col]
    elif src_col in source2.columns:
        if tgt_col == 'Full Name' and src_col == 'Name':
            target[tgt_col] = source2['Name']
            continue
        if tgt_col == 'Gender' and src_col == 'Sex':
            target[tgt_col] = source2[src_col].map({'M': 'Male', 'F': 'Female'})
            continue
        if tgt_col == 'DOB':
            target[tgt_col] = pd.to_datetime(source2[src_col], dayfirst=True).dt.strftime('%Y-%m-%d')
            continue
        if tgt_col == 'Mobile':
            target[tgt_col] = source2[src_col].apply(lambda x: re.sub(r'\D', '', x)).str.replace(r'(\d{3})(\d{3})(\d{4})', r'\1-\2-\3')
            continue
        target[tgt_col] = source2[src_col]

# Save target to CSV
target.to_csv('target.csv', index=False)

Lets run the python code and print the target dataframe

Indeed, as demonstrated, a single API call allowed us to efficiently generate the desired output. However, it’s important to note that this approach relies on uniformity in the data. Any outliers or irregularities may pose challenges, as the generated code may not be equipped to apply the required transformations in such cases.

You have the option to execute the code by accessing the Google Colab notebook provided at this link: Google Colab Notebook.

Conclusion

In conclusion, we’ve explored various approaches to tackle the complex task of schema mapping between disparate data sources. While leveraging Large Language Models (LLMs) for automated mapping and code generation offers significant advantages in terms of efficiency and scalability, it’s crucial to acknowledge the limitations.