Four Data Cleaning Techniques to Improve Large Language Model (LLM) Performance

Intel
Intel Tech
Published in
11 min readApr 1, 2024

Unlock more accurate and meaningful AI outcomes with RAG (retrieval-augmented generation).

Photo by No Revisions on Unsplash

By Eduardo Rojas Oviedo and Ezequiel Lanza

The retrieval-augmented generation (RAG) process has gained popularity due to its potential to enhance the understanding of large language models (LLMs), providing them with context and helping to prevent hallucinations. The RAG process involves several steps, from ingesting documents in chunks to extracting context to prompting the LLM model with that context. While known to significantly improve predictions, RAG can occasionally lead to incorrect results. The way documents are ingested plays a crucial role in this process. For instance, if our “context documents” contain typos or unusual characters for an LLM, such as emojis, it could potentially confuse the LLM’s understanding of the provided context.

In this post, we’ll demonstrate the use of four common natural language processing (NLP) techniques to clean text before it’s ingested and converted into chunks for further processing by the LLM. We’ll also illustrate how these techniques can significantly enhance the model’s response to a prompt.

The steps of the RAG process, adapted from RAG-Survey.

Why Is it Important to Clean Your Documents?

It’s standard practice to clean up text before feeding it into any kind of machine learning algorithm. Whether you’re using supervised or unsupervised algorithms, or even crafting context for your generative AI (GAI) model, getting your text in good shape helps to:

· Ensure accuracy: By getting rid of mistakes and making everything consistent, you’re less likely to confuse the model or end up with model hallucinations.

· Improve quality: Cleaner data ensures that the model works with reliable and consistent information, helping our models to infer from accurate data.

· Facilitate analysis: Clean data is easy to interpret and analyze. For example, a model trained with plain text may struggle to comprehend tabular data.

By cleaning our data — especially unstructured data — we provide the model with reliable and relevant context, which improves generation, reduces the probability of hallucinations, and improves GAI speed and performance, as large volumes of information lead to longer wait times.

How Do We Achieve Data Cleaning?

To help you build your data cleaning toolbox, we’ll explore four NLP techniques and how they help the model.

Step 1: Data Cleaning and Noise Reduction

We’ll start by removing symbols or characters that don’t provide meaning, such as HTML tags (in the case of scraping), XML parses, JSON, emojis, and hashtags. Unnecessary characters often confuse the model, and increase the number of context tokens and therefore the computational cost.

Recognizing that there’s no one-size-fits-all solution, we’ll adapt our methods to different problems and text types using common cleaning techniques:

· Tokenization: Split the text into individual words or tokens.

· Remove noise: Eliminate unwanted symbols, emojis, hashtags, and Unicode characters.

· Normalization: Convert the text to lowercase for consistency.

· Remove stop words: Discard common or repeated words that do not add meaning, such as “a,” “in,” “of,” and “the.”

· Lemmatization or stemming: Reduce words to their base or root form.

Let’s take this tweet for example:

I love coding! 😊 #PythonProgramming is fun! 🐍✨ Let’s clean some text 🧹

While the meaning is clear to us, let’s simplify it for the model by applying common techniques in Python. The following code snippet and all others in this post were generated with the help of ChatGPT.

import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopword
s
from nltk.stem import WordNetLemmatizer

# Sample text with emojis, hashtags, and other characters
text = “I love coding! 😊 #PythonProgramming is fun! 🐍✨ Let’s clean some text 🧹”

# Tokenization
tokens = word_tokenize(text)

# Remove Noise
cleaned_tokens = [re.sub(r’[^\w\s]’, ‘’, token) for token in tokens]

# Normalization (convert to lowercase)
cleaned_tokens = [token.lower() for token in cleaned_tokens]

# Remove Stopwords
stop_words = set(stopwords.words(‘english’))
cleaned_tokens = [token for token in cleaned_tokens if token not in stop_words]

# Lemmatization
lemmatizer = WordNetLemmatizer()
cleaned_tokens = [lemmatizer.lemmatize(token) for token in cleaned_tokens]

print(cleaned_tokens)

# output:
# [‘love’, ‘coding’, ‘pythonprogramming’, ‘fun’, ‘clean’, ‘text’]

The process has removed irrelevant characters and left us with clean and meaningful text our model can understand: [‘love’, ‘coding’, ‘pythonprogramming’, ‘fun’, ‘clean’, ‘text’].

Step 2: Text Standardization and Normalization

Next, we should always prioritize consistency and coherence across the text. This is crucial for ensuring accurate retrieval and generation. In the following Python example, let’s scan our text input for spelling errors and other inconsistencies that could lead to inaccuracies and decreased performance.

import re

# Sample text with spelling errors
text_with_errors = “””But ’s not oherence about more language oherence .
Other important aspect is ensuring accurte retrievel by oherence product name spellings.
Additionally, refning descriptions oherenc the oherence of the contnt.”””

# Function to correct spelling errors
def correct_spelling_errors(text):
# Define dictionary of common spelling mistakes and their corrections
spelling_corrections = {
“ oherence ”: “everything”,
“ oherence ”: “refinement”,
“accurte”: “accurate”,
“retrievel”: “retrieval”,
“ oherence ”: “correcting”,
“refning”: “refining”,
“ oherenc”: “enhances”,
“ oherence”: “coherence”,
“contnt”: “content”,
}

# Iterate over each key-value pair in the dictionary and replace the
# misspelled words with their correct versions
for mistake, correction in spelling_corrections.items():
text = re.sub(mistake, correction, text)

return text

# Correct spelling errors in the sample text
cleaned_text = correct_spelling_errors(text_with_errors)

print(cleaned_text)
# output
# But it’s not everything about more language refinement.
# other important aspect is ensuring accurate retrieval by correcting product name spellings.
# Additionally, refining descriptions enhances the coherence of the content.

With a cohesive, coherent text representation, our model can now generate accurate and contextually relevant responses. This process also enables semantic search to extract the most optimal context chunks, particularly in the context of RAG.

Step 3: Metadata Handling

Metadata collection, such as identifying important keywords and entities, makes it easy for us to recognize elements in the text that we can use to improve semantic search results, especially in enterprise applications such as content recommendation systems. This process provides the model with additional context, often required to improve RAG performance. Let’s apply this step to another Python example.

Import spacy
import json

# Load English language model
nlp = spacy.load(“en_core_web_sm”)

# Sample text with meta data candidates
text = “””In a blog post titled ‘The Top 10 Tech Trends of 2024,’
John Doe discusses the rise of artificial intelligence and machine learning
in various industries. The article mentions companies like Google and Microsoft
as pioneers in AI research. Additionally, it highlights emerging technologies
such as natural language processing and computer vision.”””

# Process the text with spaCy
doc = nlp(text)

# Extract named entities and their labels
meta_data = [{“text”: ent.text, “label”: ent.label_} for ent in doc.ents]

# Convert meta data to JSON format
meta_data_json = json.dumps(meta_data)

print(meta_data_json)

# output
“””
[
{“text”: “2024”, “label”: “DATE”},
{“text”: “John Doe”, “label”: “PERSON”},
{“text”: “Google”, “label”: “ORG”},
{“text”: “Microsoft”, “label”: “ORG”},
{“text”: “AI”, “label”: “ORG”},
{“text”: “natural language processing”, “label”: “ORG”},
{“text”: “computer vision”, “label”: “ORG”}
]
“””

The code highlights how spaCy’s entity recognition capability recognizes dates, persons, and organizations, and other important entities in the text. This helps RAG applications better understand context and relationships between words.

Step 4: Contextual Information Handling

When working with LLMs, you may commonly be working with diverse languages or managing extensive documents brimming with various topics, which can be hard for your model to comprehend. Let’s look at two techniques that can help your model better understand the data.

Let’s start with language translation. Using the Google Translation API, the code translates the original text, “Hello, how are you?” from English to Spanish.

From googletrans import Translator

# Original text
text = “Hello, how are you?”

# Translate text
translator = Translator()
translated_text = translator.translate(text, src=’en’, dest=’es’).text

print(“Original Text:”, text)
print(“Translated Text:”, translated_text)

Topic modeling including techniques like clustering data, is like organizing a messy room into neat categories, helping your model identify the topic of a document and sort through lots of information quickly. Latent Dirichlet allocation (LDA), the most popular technique for automating the topic modeling process, is a statistical model that helps find hidden themes in text by looking closely at word patterns.

In the following example, we’ll use sklearn to process a set of documents and identify key topics.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Sample documents
documents = [
"Machine learning is a subset of artificial intelligence.",
"Natural language processing involves analyzing and understanding human languages.",
"Deep learning algorithms mimic the structure and function of the human brain.",
"Sentiment analysis aims to determine the emotional tone of a text."
]

# Convert text into numerical feature vectors
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

# Apply Latent Dirichlet Allocation (LDA) for topic modeling
lda = LatentDirichletAllocation(n_components=2, random_state=42)
lda.fit(X)

# Display topics
for topic_idx, topic in enumerate(lda.components_):
print("Topic %d:" % (topic_idx + 1))
print(" ".join([vectorizer.get_feature_names()[i] for i in topic.argsort()[:-5 - 1:-1]]))

# output
#
#Topic 1:
#learning machine subset artificial intelligence
#Topic 2:
#processing natural language involves analyzing understanding

If you’d like to explore more topic modeling techniques, we recommend starting with these:

· Non-negative matrix factorization (NMF) is great for things like images where negative values don’t make sense. It’s handy when you need clear, understandable factors. For instance, in image processing, NMF helps extract features without the confusion of negative values.

· Latent semantic analysis (LSA) shines when you have a large volume of text spread across multiple documents and want to find connections between words and documents. LSA uses singular value decomposition (SVD) to identify semantic relationships between terms and documents, helping to streamline tasks like sorting documents by similarity and detecting plagiarism.

· Hierarchical Dirichlet process (HDP) helps you quickly sort through mountains of data and identify topics in a document when you’re unsure how many there are. As an extension of LDA, HDP allows for infinite topics and greater flexibility in modeling. It identifies hierarchical structures in text data for tasks like understanding the organization of topics in academic papers or news articles.

· Probabilistic latent semantic analysis (PLSA) helps you figure out how likely it is for a document to be about certain topics, which can be useful when building a recommendation system that provides personalized recommendations based on past interactions.

DEMO: Cleaning a GAI Text Input

Let’s put it all together with an example. In this demo, we’ve used ChatGPT to generate a conversation between two technologists. We’ll apply basic cleaning techniques to the conversation to show how these practices enable reliable and consistent results.

synthetic_text = """
Sarah (S): Technology Enthusiast
Mark (M): AI Expert
S: Hey Mark! How's it going? Heard about the latest advancements in Generative AI (GA)?
M: Hey Sarah! Yes, I've been diving deep into the realm of GA lately. It's fascinating how it's shaping the future of technology!
S: Absolutely! I mean, GA has been making waves across various industries. What do you think is driving its significance?
M: Well, GA, especially Retrieval Augmented Generative (RAG), is revolutionizing content generation. It's not just about regurgitating information anymore; it's about creating contextually relevant and engaging content.
S: Right! And with Machine Learning (ML) becoming more sophisticated, the possibilities seem endless.
M: Exactly! With advancements in ML algorithms like GPT (Generative Pre-trained Transformer), we're seeing unprecedented levels of creativity in AI-generated content.
S: But what about concerns regarding bias and ethics in GA?
M: Ah, the age-old question! While it's true that GA can inadvertently perpetuate biases present in the training data, there are techniques like Adversarial Training (AT) that aim to mitigate such issues.
S: Interesting! So, where do you see GA headed in the next few years?
M: Well, I believe we'll witness a surge in applications leveraging GA for personalized experiences. From virtual assistants to content creation tools, GA will become ubiquitous in our daily lives.
S: That's exciting! Imagine AI-powered virtual companions tailored to our preferences.
M: Indeed! And with advancements in Natural Language Processing (NLP) and computer vision, these virtual companions will be more intuitive and lifelike than ever before.
S: I can't wait to see what the future holds!
M: Agreed! It's an exciting time to be in the field of AI.
S: Absolutely! Thanks for sharing your insights, Mark.
M: Anytime, Sarah. Let's keep pushing the boundaries of Generative AI together!
S: Definitely! Catch you later, Mark!
M: Take care, Sarah!
"""

Step 1: Basic Cleanup

First, let’s remove the emojis, hashtags, and Unicode characters from the conversation.

# Sample text with emojis, hashtags, and unicode characters

# Tokenization
tokens = word_tokenize(synthetic_text)

# Remove Noise
cleaned_tokens = [re.sub(r'[^\w\s]', '', token) for token in tokens]

# Normalization (convert to lowercase)
cleaned_tokens = [token.lower() for token in cleaned_tokens]

# Remove Stopwords
stop_words = set(stopwords.words('english'))
cleaned_tokens = [token for token in cleaned_tokens if token not in stop_words]

# Lemmatization
lemmatizer = WordNetLemmatizer()
cleaned_tokens = [lemmatizer.lemmatize(token) for token in cleaned_tokens]

print(cleaned_tokens)

Step 2: Prepare Our Prompt

Next, we’ll craft a prompt, asking the model to respond as a friendly customer service agent based on information it gleaned from our synthetic conversation.

MESSAGE_SYSTEM_CONTENT = "You are a customer service agent that helps 
a customer with answering questions. Please answer the question based on the
provided context below.
Make sure not to make any changes to the context if possible,
when prepare answers so as to provide accurate responses. If the answer
cannot be found in context, just politely say that you do not know,
do not try to make up an answer."

Step 3: Prepare the Interaction

Let’s prepare our interaction with the model. In this example, we’ll use GPT-4.

def response_test(question:str, context:str, model:str = "gpt-4"):
response = client.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": MESSAGE_SYSTEM_CONTENT,
},
{"role": "user", "content": question},
{"role": "assistant", "content": context},
],
)

return response.choices[0].message.content

Step 4: Prepare the Question

Finally, let’s ask the model a question and compare the results before and after cleaning.

question1 = "What are some specific techniques in Adversarial Training (AT) 
that can help mitigate biases in Generative AI models?"

Before cleaning, our model generates this response:

response = response_test(question1, synthetic_text)
print(response)

#Output
# I'm sorry, but the context provided doesn't contain specific techniques in Adversarial Training (AT) that can help mitigate biases in Generative AI models.

After cleaning, the model generates the following response. With enhanced understanding enabled by basic cleaning techniques, the model can provide a more thorough answer.

response = response_test(question1, new_content_string)
print(response)
#Output:
# The context mentions Adversarial Training (AT) as a technique that can
# help mitigate biases in Generative AI models. However, it does not provide
#any specific techniques within Adversarial Training itself.

A Brighter Future for AI-Generated Outcomes

RAG models offer several advantages, including enhanced reliability and coherence of AI-generated results by providing relevant context. This contextualization significantly improves the accuracy of AI-generated content.

To get the most out of your RAG models, robust data cleaning techniques are essential during document ingestion. These techniques address discrepancies, imprecise terminology, and other potential errors within textual data, significantly improving the quality of input data. When operating on cleaner, more reliable data, RAG models deliver more accurate and meaningful results, enabling AI use cases with better decision-making and problem-solving capabilities across domains.

Have you explored additional methods to improve RAG model performance? Let us know as we continue to refine and improve its capabilities.

About the Authors

Eduardo Rojas Oviedo, Platform Engineer, Intel

Eduardo Rojas Oviedo is a dedicated RAG developer within Intel’s dynamic and innovative team. With a specialization in cutting-edge developer tools for AI, Machine Learning, and NLP, he is passionate about leveraging technology to create impactful solutions. Eduardo’s expertise lies in building robust and innovative applications that push the boundaries of what’s possible in the realm of artificial intelligence. His commitment to sharing knowledge and advancing technology drives his ongoing pursuit of excellence in the field.

Ezequiel Lanza, Open Source AI Evangelist, Intel

Ezequiel Lanza is an open source AI evangelist on Intel’s Open Ecosystem team, passionate about helping people discover the exciting world of AI. He’s also a frequent AI conference presenter and creator of use cases, tutorials, and guides to help developers adopt open source AI tools. He holds an MS in data science. Find him on X at @eze_lanza and LinkedIn at /eze_lanza

--

--

Intel
Intel Tech

Intel news, views & events about global tech innovation.