Enhancing Retrieval-Augmented Generation: Tackling Polysemy, Homonyms and Entity Ambiguity with GLiNER for Improved Performance

Mollel Michael (PhD)
9 min readMar 16, 2024

--

The Large Language Models (LLMs) capture people's attention due to their ability to solve many general problems. Still, when it comes to emerging knowledge or domain-specific tasks, the LLMs tend to fail by either hallucinating or failing to give the correct answer. To solve the problem, the Retrieval Augmented Generation (RAG) is one of the many solutions that is preferred among the solutions.

The code for this article is found in GitHub Repo.

Nevertheless, RAG has its challenges, with one significant hurdle being the difficulty of accurately searching for specific content within supplemental data. Issues such as polysemy, homonyms, and entity ambiguity can exacerbate the problem, leading to more confounding results than clarifying. To elucidate these concepts further:

  • Homonyms refer to words that are identical in spelling or pronunciation but diverge in meaning and origin. For instance, “bank” can denote either a financial institution or the side of a river, while “bat” might refer to a nocturnal flying mammal or a piece of sports equipment.
  • Polysemy involves a single word or phrase that carries multiple related meanings. An example is the word “head,” which can signify the physical part of a body, the leader of an organization, or the foremost part of something (like the head of a line). Here, the different senses of “head” are interconnected, stemming from a common semantic root of being uppermost, leading, or principal.
  • Entity Ambiguity arises when a term could point to several distinct entities within a text, making it unclear which one is intended.

Both polysemy and homonyms significantly contribute to entity ambiguity. In RAG, such ambiguity within a query or its context can severely impede a Large Language Model’s (LLM’s) ability to comprehend the inquiry and deliver an accurate response.

Consider these examples:

  • “I saw her duck under the table.” Does “duck” refer to the animal or the action of crouching?
  • “The lead mine was profitable.” Is “lead” the metal, or the verb meaning “to guide”?

However, it’s crucial to understand that entity ambiguity isn’t solely the result of polysemy and homonymy. This issue can also emerge when a name or phrase could refer to several distinct real-world entities of the same category (for example, different individuals or cities sharing the same name). This form of ambiguity generates challenges in accurately identifying the intended entity, irrespective of the name being polysemous or a homonym.

In this article, I will walk you through developing a simple RAG system. This system leverages GLiNER, an open-source Named Entity Recognition (NER) tool developed by @Urchade Zaratiana and colleagues. Additionally, we’ll use Llama-Index as a wrapper to enhance the Large Language Model (LLM) chat pipeline alongside Mistral (mistral-7b-instruct-v0.1.Q4_K_M.gguf) for LLM. Please consult the link in my previous articles for guidance on configuring your Windows environment for this setup.

The process is outlined as follows:

  1. Begin with a user input query to determine the desired information or data extraction
query = """
"In the recent advancements and initiatives related to water conservation
and sustainability, how has Jordan's work influenced policies in the
Jordan Valley, and what role does the Jordan brand play in these
environmental efforts, considering the support from the Jordan River
Foundation?"
"""

2. The user’s query is processed through GLiNER, where entities are identified and embedded. This involves selecting and loading your preferred GLiNER model and processing the query to embed the identified entities effectively.

2.1 Initiate by loading the GLiNER model of your choice. You have the option to load it directly from HuggingFace or to download it for local execution. If opting for local download, ensure you set the parameter local_files_only to True to facilitate this process. For instructions on how to utilize GLiNER, please refer to the following Link

from model import GLiNER
model = GLiNER.from_pretrained("urchade/gliner_base", local_files_only=True)

2.2 I have created nearly 20 entity labels, but you can add more according to your needs. These labels are saved in a JSON file. Once the entity labels are loaded, the system predicts the query against these labels. Subsequently, a new query is generated by incorporating the identified entities into the original query.

# the labels that is stored in label_entity Json file are loaded
with open('label_entity.json', 'r') as file:
labels = json.load(file)

# predict the entities using GLiNER
entities = model.predict_entities(query, labels)

# Function to refine the query by adding entity labels
def refine_text_with_entities(text, entities):
sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)
for entity in sorted_entities:
text = text[:entity['start']] + f"[{entity['text']}: {entity['label']}] " + text[entity['end']:]
return text

# Refine the original text
query_en = refine_text_with_entities(query, entities)
query_en = """
"In the recent advancements and initiatives related to [water conservation: Concept] and [sustainability: Concept] , how has [Jordan: Organizations] 's work
influenced [policies: Concept] in the [Jordan Valley: Locations] , and what role does the [Jordan brand: Organizations] play
in these [environmental efforts: Concept] , considering
the support from the [Jordan River Foundation: Organizations] ?"
"""

Notice how the word “Jordan” appears three times, each associated with different meanings and entities, such as an organization and a location. To evaluate the significance of this distinction, we input the query into mistral-7b-instruct-v0.1.Q4_K_M.gguf and compared the query results without named entities to those with named entities included.

3. Load LLM and embedding
In this phase, you’re tasked with loading the Large Language Model (LLM) and setting up the embedding process. You can select the default OpenAI model for the LLM or opt for a different one that aligns with your preferences. I chose Mistral for the LLM and TogetherEmbedding for the embedding process. The LLaMA Index offers the flexibility to accommodate your selection.

path_to_local_GGUF= "mistral-7b-instruct-v0.1.Q4_K_M.gguf"

llm = LlamaCPP(
# You can pass in the URL to a GGML model to download it automatically
# model_url=None,
# optionally, you can set the path to a pre-downloaded model instead of model_url
model_path=path_to_local_GGUF,
temperature=0.1 ,
max_new_tokens=500,
# llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
context_window=2000,
# kwargs to pass to __call__()
generate_kwargs={},
# kwargs to pass to __init__()
# set to at least 1 to use GPU
model_kwargs={"n_gpu_layers": 20},
# transform inputs into Llama2 format
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
verbose=True,
)

Togethor_API = "XXXXX"
Settings.embed_model = TogetherEmbedding(
model_name="togethercomputer/m2-bert-80M-8k-retrieval", api_key= Togethor_API
)
Settings.llm = llm

4. Data load and inference:

4.1 To ensure you can duplicate the findings, I’ve included the dataset I utilized for loading into the RAG system, available in the file named “data.pdf.” Within this document, among various texts, the specific paragraph from which I anticipated deriving the answer is stated as follows:

In the same year that Jordan, the country, navigated through complex 
diplomatic waters to secure a groundbreaking agreement with Israel,
Jordan, a leading environmental scientist from the University of Amman,
made headlines with his innovative research on desalination processes.
His work, coincidentally funded by the Jordan River Foundation, not only
highlighted the potential for significant advancements in water purification
techniques but also sparked interest among policymakers in the Jordan Valley
region. This interest came at a time when Jordan, the footwear brand,
released its sustainability report, showcasing efforts to reduce water
usage in its manufacturing processes, drawing accolades from environmental
groups like the Jordan Conservation Coalition, named after the Jordan River,
not the country or the scientist.

4.2 The process involves importing all the data in the data directory, utilizing the default chunk size for loading. Subsequently, two responses are produced: one derived from the original query and the other from the amended query.

documents = SimpleDirectoryReader("data").load_data()

index = VectorStoreIndex.from_documents(documents)


query_engine = index.as_query_engine()


answer_refined_query = query_engine.query(query_en)


answer_original_query = query_engine.query(query)

Results

We observed the following outcomes in the generated answers after completing the inference process. This comparison sheds light on the methodology's effectiveness and provides insight into how modifications to the query impact the precision and relevance of the answers. The analysis of these results highlights the nuanced differences between responses derived from the original and modified queries, offering a comprehensive understanding of the benefits introduced by this approach. Further examination reveals the potential for optimization and refinement in handling complex queries, thereby enhancing the overall performance and reliability of the system. As the next task, one can also apply this trick to the raw data source.

answer_refined_query
In recent advancements and initiatives related to water conservation and 
sustainability, Jordan's work has influenced policies in the Jordan Valley
in several ways. The Ministry of Water and Irrigation has been actively
involved in reducing water loss from the network due to leakage and theft,
which would save two percent annually.
The country also plans to expand water harvesting programs and dams.
Additionally, the Jordan brand has played a significant role in these
environmental efforts by encouraging citizens to build tanks to collect
rainwater. The Jordan River Foundation has also supported these
initiatives by providing funding and resources to promote sustainable
water management practices in the Jordan Valley.

Jordan's water deficit in the summer period is estimated at about 450 million
cubic metres annually, while its need is estimated at one and a half billion
cubic metres. In response to this deficit, Jordan signed an agreement with
Israel in 2021 to purchase an additional 50 million cubic metres of water,
in addition to what was stipulated in the peace agreement signed between
the two countries in 1994. This agreement was subject to technical
arrangements and should not be subject to any political dimensions outside
the framework of the peace treaty. Jordan's efforts towards water conservation
and sustainability have had a positive impact on policies related to these
issues in the Jordan Valley.
answer_original_query
Jordan has been working on several sustainable projects related to water 
conservation and sustainability, including the national carrier for desalinating
the Red Sea water and transporting 300 million cubic metres of desalinated
water from the Gulf of Aqaba on the Red Sea to all regions of the Kingdom
annually. These projects aim to reduce water loss from the network due to
leakage and theft, which would save two percent annually, and expand water
harvesting programs and dams. Additionally, Jordan has programs to encourage
citizens to build tanks to collect rainwater and treats the water from
sewage plants to use for agricultural and industrial purposes with a
capacity of 200 million cubic metres annually.

The Jordan brand plays an important role in these environmental efforts,
considering the support from the Jordan River Foundation.
The Jordan River Foundation is a non-governmental organization that aims to
promote sustainable development in Jordan and the region. The foundation
has been working on several projects related to water conservation and
sustainability, including the development of water harvesting systems,
the promotion of rainwater harvesting, and the protection of water resources.
The foundation has also been supporting the implementation of the
Jordanian government's policies related to water conservation and
sustainability.

In response to Israeli media reports, Jordanian Minister of Government
Communications and Cabinet Spokesman Muhannad Mubaideen told the
US-funded Al-Hurra TV network that Jordan buys a set quantity of water
from Israel and pays for it. He added that Jordan asked to study the
details of the agreement and based on it, either Israel sells water or does
not sell it. Mohammad Momani, a Jordanian member of the Senate and a former
Minister of State for Media Affairs, told The New Arab that Jordan is part
of a purchase agreement with Israel, and this deal is subject to technical
arrangements and should not be subject to any political dimensions outside
the framework of the peace treaty. He added that there should be cooperation
between the two countries, and if there are political issues they should be
brought to the political table. Israel must first control the statements of
its ministers that are devoid of all international and moral standards.

Conclusion

Based on the question’s focus on the influence of Jordan’s work on policies in the Jordan Valley, the role of the Jordan brand, and the involvement of the Jordan River Foundation in environmental efforts, answer_refined_query provides a more direct and relevant answer to the question. It succinctly addresses the key points:

  1. Jordan’s Influence on Policies: answer_refined_query outlines how the Ministry of Water and Irrigation has been involved in initiatives like reducing water loss and expanding water harvesting programs, directly correlating to policy influence in the Jordan Valley.
  2. Role of the Jordan Brand: It mentions the Jordan brand’s contribution by encouraging citizens to build tanks to collect rainwater, directly linking the brand to environmental efforts.
  3. Support from the Jordan River Foundation: It highlights the Foundation’s role in providing funding and resources to promote sustainable water management practices in the Jordan Valley.

While answer_original_query provides detailed information about the projects and diplomatic communications regarding water agreements with Israel, it does not as explicitly connect the Jordan brand’s activities or the Jordan River Foundation’s support to the specific question’s focus on policies in the Jordan Valley and the brand’s role in environmental efforts. Therefore, answer_refined_query answers the question more directly and relevantly within the given context.

--

--