Automated Knowledge Graph Construction with Large Language Models — Part 2

Harvesting the Power and Knowledge of Large Language Models

Research Graph
6 min readMay 13, 2024
Image generated with Google’s Gemini, 12 May 2024.

Author

Introduction

Knowledge graphs (KGs) are a structured representation of data in a graphical format, in which entities are represented by nodes and are connected by edges representing relationships between them. They have been employed across numerous domains, like retail, healthcare, and search engines. However, one critical factor limiting the usage of KGs is the difficult and costly knowledge graph construction (KGC) process, which requires various steps and human annotations or guidance.

In a previous article, Automated Knowledge Graph Construction with Large Language Models, we reviewed four methods that leveraged large language models (LLMs) for information extraction and ontology creation in KGC. In this article, we will follow up by reviewing three newer models that, unlike the models in Part 1, prompt LLMs to generate KG triples directly when given data.

TKGCon

The first method is TKGCon, a framework for theme-specific knowledge graph (ThemeKG) construction. In contrast to open-world KGs that contain general knowledge in a broad range of domains, domain-specific KGs contain specialised facts about specific topics. However, even domain-specific KGs lack fine-grained information as they have to cover comprehensive information about a topic from various sources. The need for human input in KGC also slows construction, resulting in KGs failing to keep up with evolving world knowledge. TKGCon addresses both concerns by developing an automatic framework to produce fine-grained theme KGs. TKGCon comprises two stages, theme ontology construction and theme KG construction.

The TKGCon framework, which consists of two main steps, theme ontology construction and theme KG construction. Source: Ding et al. (2024), https://doi.org/10.48550/ARXIV.2404.19146
  1. Theme Ontology Construction
  • Given a specific theme (such as “Electric Vehicle Battery”), an entity ontology is obtained from hierarchies of entity categories from Wikipedia.
  • For every pair of entity categories, a LLM is prompted to generate potential relations to form the relation ontology.

2. Theme KG Construction

  • Then, given theme-specific documents, entity mentions from the text are identified (entity recognition).
  • Entity mentions are mapped to the closest category from the entity ontology (entity typing).
  • For pairs of entity mentions, candidate relations are retrieved from the relation ontology based on their entity categories (candidate relations retrieval).
  • Contextual information, in the form of neighbouring sentences, is given to the LLM with the candidate relations, which decides the most suitable relation (relation extraction).
GPT-4 can produce up-to-date information when augmented with Theme KGs. Source: Ding et al. (2024), https://doi.org/10.48550/ARXIV.2404.19146

The resultant ThemeKGs can also be used in retrieval augmentation of LLMs. In their experiment, Ding et al. (2024) found that GPT-4 lacked up-to-date information, and still missed information when used with retrieval-augmented generation (RAG). Augmenting GPT-4 with ThemeKGs produced the most successful results as information was directly integrated from triples without summarising or reasoning over documents.

CodeKGC

Bi et al. (2023) framed KGC differently as code generation tasks instead, postulating that LLMs of code could model structure in graphs more effectively than natural language LLMs, especially since they have been successfully used for complex reasoning, structured commonsense reasoning, and structured prediction tasks. CodeKGC uses two main components: schema-aware prompt generation and rationale-enhanced generation.

Overview of CodeKGC. A schema prompt is defined to guide the LLM to convert natural language to a code format. Source: Bi et al. (2023), https://doi.org/10.48550/ARXIV.2304.09048

Using a schema prompt containing entities, relations, properties, and constraints, natural language was transformed into a code format which could be fed into a code LLM. Part of the schema prompt consists of a base definition containing Relation and Entity classes, from which specific entities in the schema inherit from.

Rational-enhanced generation includes three steps, relationship identification, entity extraction, and final KG construction.

Optional rationale enhancements could also be employed in the generation step as this has been proven to enhance LLMs’ reasoning abilities by guiding it. By guiding the LLM through the steps of relationship identification, entity extraction, and finally, KG construction, the resultant triples would be of higher quality.

Extract-Define-Canonicalise (EDC)

Zhang and Soh (2024) identified several other issues from LLM-based KGC techniques. Firstly, LLM-based KGC models typically do not scale up easily as KG schemas have to be included in the LLM prompts, but prompts are limited by the size of context windows. Secondly, redundancy and ambiguity can exist in KGs as LLMs can introduce multiple equivalent relations expressed differently.

Three subtasks in the Extract-Define-Canonicalise (EDC) framework. Source: Zhang & Soh (2024), https://doi.org/10.48550/ARXIV.2404.03868

Hence, they proposed the Extract-Define-Canonicalise (EDC) framework, which is flexible to the availability of a predefined schema. In EDC, KGC is performed with three subtasks to target these challenges, followed by an optional fourth step to further increase performance:

  1. Open Information Extraction

LLMs are tasked with extracting triples from text. As this step is independent of any schema, the results may contain redundant information that is eliminated in the subsequent steps.

2. Schema Definition

In this step, LLMs are prompted to define schema components. For example, the LLM could be asked to write definitions for each relation present in the extracted triples. These definitions are fed into the next step as supporting information for canonicalisation.

3. Schema Canonicalisation

This is the step that reduces redundancies in triples. The model can take one of two paths, based on the existence of a given predefined schema:

  • Target alignment: If a schema is provided, generated triples should conform to it. Hence, if no semantic equivalent within the provided schema can be found for a triple’s components, the triple is excluded.
  • Self Canonicalisation: If no such schema is available, the model constructs one dynamically and references that. Starting with an empty schema, the model uses vector similarity and LLM verification to merge similar schema components. Components that cannot be merged are added into the schema to expand it.

4. Refinement (Optional)

This final step enhances the quality of triples by repeating the first three EDC steps and providing a “hint”. Relations and entities extracted by EDC in the previous iteration are included in the hint to give the LLM a bigger pool of candidates to choose from, augmenting the LLM in a retrieval-like fashion.

Using the EDC or EDC+R frameworks, high-quality triples can be extracted without redundancy in a technique that is able to scale up to large schemas.

Conclusion

As a second part to Automated Knowledge Graph Construction with Large Language Models, we continue to focus on the challenges of knowledge graph completion (KGC) and the use of large language models (LLMs) to automate it. Specifically, we reviewed three models in this article that prompt LLMs to generate KG triples directly. The first was TKGCon, which focuses on constructing theme-specific KGs. This was followed by CodeKGC, which reframes the KGC task into a code generation one. Finally, we reviewed the EDC framework which could scale to large schemas and reduce redundancy in KGs.

As LLMs become more powerful, their impact on fields beyond natural language processing continues to grow, with KGC being one example. However, LLMs are still susceptible to hallucinating, which can trickle down and affect the quality of automatically-generated KGs. Moreover, larger LLMs like GPT-4 typically outperform smaller ones, at the expense of higher monetary costs and longer processing times. We can expect that as LLMs become more widely adopted for the task of KGC, newer techniques in the utilisation of LLMs will address these challenges.

References

--

--