LLMs for Knowledge Graph Construction and Reasoning

15 min readJun 21, 2023

Knowledge Graph (KG) is a semantic network comprising entities, concepts, and relations, which can catalyze applications across various scenarios such as recommendation systems, search engines, and question-answering systems. Large language models (LLMs), such as GPT-4 released by OpenAI, demonstrate extremely powerful general knowledge and problem-solving capabilities by pre-training on large amounts of data [1][2][3][4]. Despite numerous studies on LLMs, a systematic exploration of their application in the KG domain is still limited, so can large models such as GPT-4 facilitate efficient KG construction and reasoning?

To obtain the answers, we investigates the potential applicability of LLMs, exemplified by ChatGPT and GPT-4. The main contents of this blog are:

1. Analysis of GPT-4’s extraction and reasoning capabilities for different types of KGs, such as facts, events, and across different domains, including general and vertical knowledge.

2. Comparison of the extraction and reasoning capabilities between GPT-4 and ChatGPT, along with an analysis of error cases.

3. Analysis of GPT-4’s generalization capabilities for extracting unseen knowledge.

4. Prospects for new approaches to KG construction reasoning in the era of large-scale language models.

Paper: https://arxiv.org/abs/2305.13168

Github: https://github.com/zjunlp/AutoKG

LLMs for KG Construction and Reasoning

As we do not have access to the GPT-4 API, we utilize the interactive interface of ChatGPT-plus and conduct evaluations of GPT-4’s zero-shot and one-shot extraction and reasoning abilities in terms of entities, relations, and events. To perform the evaluations, we randomly sample test/validation set data and compare the results with those of ChatGPT and a fully supervised baseline model.

For the KG construction task, we select DuIE2.0[5], RE-TACRED[6], MAVEN[7], and SciERC[8] as the datasets for our experiments. Since some of the datasets do not provide entity types, we uniformly set them to only provide the desired relation/event types in the instruction prompts without explicitly specifying the entity types to be extracted. Additionally, for KG reasoning and question-answering tasks, we conduct experiments on the FB15K-237[9], ATOMIC 2020[10], FreebaseQA[11], and MetaQA[12] datasets.

Through random sampling data for testing, we find that GPT-4 achieved relatively good performance on multiple academic benchmark datasets, both in the zero-shot and one-shot manners. It shows improvements compared to ChatGPT. Introducing one demonstration in the prompt further improved the performance of the model in the zero-shot scenario. This to some extent indicates that GPT-4 possesses the ability to extract knowledge from different types and domains.

However, we also observe that the current performance of GPT-4 on KG construction tasks is still not as good as fully supervised smaller models. This finding is consistent with previous related works [2][4]. What’s more, in the KG reasoning task, all the large models with one-shot and GPT-4 with zero-shot approached or achieved state-of-the-art performance. It is worth noting that these results are based on random sampling tests and interactive interface evaluation (rather than API), which may be influenced by the data distribution and sample selection of the test set.

Furthermore, the design of prompts and the complexity of the datasets themselves also have a significant impact on the results of this experiment. Specifically, we find that the evaluation results of ChatGPT and GPT-4 on the 8 datasets may be influenced by the following factors:

Dataset: The presence of noise and unclear data types in some datasets (such as the absence of entity types for head and tail entities, complex contexts, etc.).

Prompt Design: Insufficiently semantic-rich prompts can affect extraction performance (e.g., incorporating relevant in-context learning [13] can improve performance; Code4Struct [14] found that leveraging code structure can facilitate structured information extraction). It should be noted that, due to the absence of entity types for some datasets, the prompt instructions in KG construction tasks do not specify the entity types to ensure a fair comparison of the capabilities of different models on the datasets, which may also affect the experimental results to some extent.

Evaluation Methodology: Existing evaluation methods may not be well-suited for assessing the extraction capabilities of large models like ChatGPT and GPT-4. For example, the provided labels in the datasets may not fully cover the correct answers, and some results that go beyond the answers may still be correct (e.g., due to synonymy).

Ability Comparison and Error Case Study

Table 1: KG Construction tasks (F1 score).

Entity and Relation Extraction

We conduct experiments on SciERC, Re-TACRED, and DuIE2.0, each involving 20 samples in the test/valid sets, and report outcomes using the standard micro- F1 score for evaluation. As is shown in Table 1, GPT-4 achieves relatively good performance on these academic benchmark extraction datasets in both zero-shot and one-shot manners. It also makes some progress compared to ChatGPT, even though its performance has not yet surpassed that of fully supervised small models.

Zero-shot

The zero-shot performance of GPT-4 shows advancements across all three investigated datasets, with the most notable improvement observed in DuIE2.0. Here, GPT-4 achieves a score of 31.03, in contrast to ChatGPT’s score of 10.3. This result provides further evidence of GPT-4’s robust capacity for generalization in the extraction of complex and novel knowledge.

As illustrated by (1) and (2) in Figure 1, GPT-4 exhibits a more substantial improvement in head-to-tail entity extraction compared to ChatGPT. Specifically, in the example sentence of Re-TACRED, the target triple is (Helen Keller International, org:alternate_names, HKI). ChatGPT, however, fails to extract such relations, possibly due to the close proximity of head and tail entities and the ambiguity of predicates in this instance.

In contrast, GPT-4 successfully extracts the “org:alternate_names” relationship between the head and tail entities, completing the triple extraction. This also demonstrates, to a certain extent, GPT-4’s enhanced language comprehension (reading) capabilities compared to ChatGPT.

One-shot

Simultaneously, the optimization of text instructions can contribute to the enhancement of GPT-4’s performance.

Using the DuIE2.0 dataset as an example, consider the sentence: “George Wilcombe was selected for the Honduras national team in 2008, and he participated in the 2009 North and Central America and Caribbean Gold Cup with the team”. The corresponding triple should be (George Wilcombe, Nationality, Honduras).

Although this information is not explicitly stated in the text, GPT-4 successfully extracts it. This outcome is not solely attributed to the single training example providing valuable information; it also stems from GPT-4’s comprehensive knowledge base. In this case, GPT-4 infers George Wilcombe’s nationality based on his selection for the national team.

Event Extraction

We conduct experiments on Event Detection on 20 random samples from the MAVEN dataset. Concurrently, GPT-4 has achieved commendable results on the event extraction task even without the provision of demonstration. Here we use the F-score as an evaluation metric.

Zero-shot

Table shows that GPT-4 outperforms ChatGPT in a zero-shot manner. Furthermore, using the example sentence “Now an established member of the line-up, he agreed to sing it more often.”, ChatGPT generates the result Becoming_a_member, while GPT-4 identifies three event types: Becoming_a_member, Agree_or_refuse_to_act, Performing.

It is worth noting that in this experiment, ChatGPT frequently provides answers with only one event type. In contrast, GPT-4 is more adept at acquiring contextual information, yielding more diverse answers, and extracting more comprehensive event types.

Consequently, GPT-4 achieves superior results on the MAVEN dataset, which inherently contains sentences that may have one or more relationships.

One-shot

During the one-shot experiments, we observe that ChatGPT’s performance improved more significantly under the same settings, whereas GPT-4 experienced a slight decline in performance. The inclusion of a single demonstration rectifies ChatGPT’s erroneous judgments made under the zero-shot setting, thereby enhancing its performance. In this context, we will concentrate on analyzing GPT-4.

Figure 2: An example of event detection by GPT-4 from the MAVEN dataset.

As depicted in Figure 2, the example sentence reads, “The final medal tally was led by Indonesia, followed by Thailand and host Philippines”. The event types provided in the dataset are Process_end and Come_together. However, GPT-4 generates three results: Comparison, Earnings_and_losses, and Ranking.

During the process, GPT-4 indeed notices the hidden ranking and comparison information within the sentence but overlooks the trigger word final corresponding to Process_end and the trigger word host corresponding to Come_together.

Simultaneously, our observations indicate that under the one-shot setup, GPT-4 tends to produce a higher number of erroneous responses when it is unable to correctly identify the type. This, in part, contributes to GPT-4’s performance decrease on the sample.

We hypothesize this might be due to the types provided in the dataset not being explicit. Furthermore, the presence of multiple event types within a single sentence further adds complexity to such tasks, leading to suboptimal outcomes.

Link Prediction

Table 2: KG Reasoning(Hits@1 /blue1) and Question Answering (AnswerExactMatch).

Task link prediction involves experiments sampled on two distinct datasets, FB15k-237 and ATOMIC2020. The former is a random sample set comprising 25 instances, whereas the latter encompasses 23 instances on behalf of all possible relations.

Zero-shot

In Table 2, GPT-4 outperforms both text-davinci-003 and ChatGPT models in the zero-shot link prediction task. Notably, GPT-4 on the FB15k-237 demonstrates that its hits@1 score achieves a state-of-the-art level, surpassing the performance of the fine-tuned model.

Regarding the ATOMIC2020, while GPT-4 still exceeds the other two models, there remains a considerable discrepancy in terms of blue1 score between GPT-4’s performance and the fine-tuned SOTA achieved.

Upon closer examination of the responses in zero-shot, it is apparent that ChatGPT exhibits a tendency to withhold a direct answer in cases where there exists ambiguity in the predicted link, and proactively requests additional contextual information to resolve such ambiguity. This behavior is not as frequently observed in the GPT-4, which tends to provide an answer directly. This observation implies a potential difference in the reasoning and decision-making processes employed.

One-shot

The optimization of text instructions also has proven to be a fruitful approach in enhancing the performance of the GPT series for the link prediction task. The empirical evaluation reveals that the one-shot GPT-4 yields improved results on both datasets, which aids in the accurate prediction of the tail entity for the triple.

In the example of Figure 3: (Primetime Emmy Award for Outstanding Guest Actress — Comedy Series, award award category category of, [MASK]), the target [MASK] is Primetime Emmy Award. In the zero-shot setting, GPT-4 fails to comprehend the relation properly, leading to an incorrect response of Comedy Series. However, when the demonstration is incorporated, GPT-4 can successfully identify the target tail entity.

Question Answering

We perform the evaluation on two widely used Knowledge Base Question Answering datasets: FreebaseQA and MetaQA. We randomly sample 20 instances from each dataset. For MetaQA, which consists of questions with different numbers of hops, we sample according to their proportions in the dataset. The evaluation metric we use for both datasets is AnswerExactMatch.

Zero-shot

As shown in Table 2, text-davinci-003, ChatGPT and GPT-4 have the same performance on the FreebaseQA dataset, and all of them surpass the previous fully supervised SOTAs, with a 16% improvement. However, GPT-4 shows no advantage over either text-davinci-003 or ChatGPT. As for the MetaQA, there is still a large gap between the large language models and supervised SOTAs.

One possible reason for this gap may be the presence of questions with multiple answers, as opposed to the FreebaseQA dataset, where questions typically have only one answer. Furthermore, the knowledge graph provided by the MetaQA dataset cannot be fed into the LLMs due to the input token length limitation, and LLMs can only rely on their inner knowledge to conduct multi-hop reasoning, therefore, LLMs often fail to cover all the correct answers.

Figure 4: An Example of Question Answering.

FigureNevertheless, GPT-4 outperforms text-davinci-003 and ChatGPT by 29.9 points and 11.1 points respectively, which indicates the superiority of GPT-4 against text-davinci-003 and ChatGPT on more challenging question-answering tasks.

Specifically, the example question in Figure 4 from the MetaQA dataset is: “when did the films written by [A Gathering of Old Men] writers release ?”. Answering this question requires a multi-hop reasoning process that involves connecting movies to writers to other movies and finally to the year of release. Remarkably, GPT-4 is able to answer this question correctly, providing both the 1999 and 1974 release dates. In contrast, ChatGPT and text-davinci-003 fail to provide the correct answer, highlighting the superior performance of GPT-4 in multi-hop question-answering tasks.

One-shot

We also conduct experiments under one-shot setting by randomly sampling one example from the train set as the in-context exemplar. Results in Table 2 demonstrate that only text-davinci-003 benefits from the prompt, while both ChatGPT and GPT-4 encounter a performance drop. This can be attributed to the notorious alignment tax where models sacrifice some of their in-context learning ability for aligning with human feedback.

Generalizability Analysis: Virtual Knowledge Extraction

Drawing from previous experiments, it is apparent that large models are adept at swiftly extracting structured knowledge from minimal information. This observation raises a question regarding the origin of the performance advantage in large language models: is it due to the substantial volume of textual data utilized during the pre-training stage, enabling the models to acquire pertinent knowledge, or is it attributed to their robust inference and generalization capabilities?

To delve into this question, we design a Virtual Knowledge Extraction task, intended to assess the large language models’ capacity to generalize and extract unfamiliar knowledge.

Data Collection

Recognizing the inadequacy of current datasets to address our requirements, we introduced a novel virtual knowledge extraction dataset — VINE. Specifically, we construct entities and relations that do not exist in the real world, organizing these into knowledge triples. Subsequently, we use instructions to have the model extract this type of virtual knowledge. The effectiveness of this extraction is utilized to gauge the large model’s capability in handling unseen knowledge. We construct VINE based on the test set of the Re-TACRED dataset. The primary idea behind the construction process is to replace existing entities and relations in the original dataset with unseen ones.

Due to the vast amount of training data available for models like GPT-4, it is difficult to find words that they are unfamiliar with. Therefore, we utilized responses from participants in two vocabulary challenge competitions hosted by The New York Times in January 2022[15] and February 2023[16] as one of the data sources, aiming to discover creative and memorable new words to fill obvious gaps in the English language. Additionally, to increase the diversity of the data sources, we generated a portion of new words by randomly generating letter sequences, creating random sequences ranging from 7 to 9 characters in length (including all 26 English letters and the symbol “-”), and randomly adding common noun suffixes.

Preliminary Results

We randomly select ten sentences in our experiments, each featuring a different relationship type, for evaluation. We assess the performance of ChatGPT and GPT-4 on these ten test samples after learning two demonstrations of the same relation.

Our findings reveal that ChatGPT significantly underperforms GPT-4 in virtual knowledge extraction after exposure to a certain amount of virtual knowledge. In contrast, GPT-4 is able to accurately extract knowledge of never-seen entities and relations based on instructions. Notably, GPT-4 successfully extracted 80% of the virtual triples, while the accuracy of ChatGPT is only 27%.

Figure 5: An Example of Virtual Knowledge Extraction.

In the example shown in Figure 5, we provide large models with a triple composed of virtual relation types and virtual head and tail entities — [Schoolnogo, decidiaster, Reptance] and [Intranguish, decidiaster, Nugculous] — along with the respective demonstrations. The results demonstrate that GPT-4 effectively completed the extraction of the virtual triple.

Consequently, we tentatively conclude that GPT-4 exhibits a relatively strong generalization ability and can rapidly acquire the capability to extract new knowledge through instructions, rather than relying solely on the memory of relevant knowledge（Related work[17] has empirically found that large models have extremely strong instruction generalization capabilities.）.

Future Opportunities: Automatic KG Construction and Reasoning with Mutiple Agents of LLMs

Recently, LLMs have garnered considerable attention and demonstrated proficiency in a variety of complex tasks. Nonetheless, the success of technologies such as ChatGPT still predominantly depends on substantial human input to guide the generation of conversational text. From a model development perspective, this process remains labor-intensive and time-consuming. Consequently, researchers have begun investigating the potential for enabling large models to autonomously generate guided text.

For instance, AutoGPT[18] can independently generate prompts and carry out tasks such as event analysis, marketing plan creation, programming, and mathematical operations. Concurrently, CAMEL[19] delves into the potential for autonomous cooperation between communicative agents and introduces a novel cooperative agent framework called role-playing. This framework employs revelatory cues to ensure alignment with human intentions. Building upon this research, we further inquire: is it feasible to utilize communicative agents to accomplish KG construction and reasoning tasks?

Figure 6: Illustration of AutoKG, which integrates KG construction and reasoning by employing GPT-4 and communicative agents based on ChatGPT.

In this experiment, we utilized the role-playing method in CAMEL.

As depicted in Figure, the AI assistant is designated as a Consultant and the AI user as a KG domain expert. Upon receiving the prompt and the specified role assignment, the task-specifier agent offers a detailed description to concretize the concept.

Following this, the AI assistant and the AI user collaborate in a multi-party setting to complete the specified task until the AI user confirms its completion. The experimental example indicates that the knowledge graph related to the film Green Book is more effectively and comprehensively constructed using the multi-agent approach. This result also underscores the superiority of LLM-based agents in constructing and completing knowledge graphs.

Conclusion and Future Work

In this paper, we seek to preliminarily investigate the performance of LLMs, exemplified by the GPT series, on tasks such as KG construction, and reasoning.

While these models excel at such tasks, we pose the question: Do LLMs’ advantage in extraction tasks stem from their vast knowledge base or their potent contextual learning capability? To explore this, we devise a virtual knowledge extraction task and created a corresponding dataset for experimentation. Results indicate that large models indeed possess robust contextual learning abilities.

Furthermore, we propose an innovative method for accomplishing KG construction and reasoning tasks by employing multiple agents. This strategy not only alleviates manual labor but also compensates for the dearth of human expertise across various domains, thereby enhancing the performance of LLMs.

While our research has yielded some results, it also possesses certain limitations.

1. As previously stated, the inability to access the GPT-4 API has necessitated our reliance on an interactive interface for conducting experiments, undeniably inflating workload and time costs.

2. Moreover, as GPT-4’s multimodal capabilities are not currently available to the public, we are temporarily unable to delve into its performance and contributions to multimodal processing.

We look forward to future research opportunities that will allow us to further explore these areas.

References

[1] Reasoning with Language Model Prompting: A Survey 2022

[2] Zero-Shot Information Extraction via Chatting with ChatGPT 2023

[3] Large Language Model Is Not a Good Few-shot Information Extractor, but a Good Reranker for Hard Samples! 2023

[4] Exploring the Feasibility of ChatGPT for Event Extraction 2023

[5] DuIE: A large-scale chinese dataset for information extraction NLPCC2019

[6] Re-tacred: Addressing shortcomings of the tacred dataset AAAI2021

[7] MAVEN: A Massive General Domain Event Detection Dataset EMNLP2020

[8] Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction EMNLP2018

[9] Representing Text for Joint Embedding of Text and Knowledge Bases EMNLP2015

[10] (Comet-) Atomic 2020: On Symbolic and Neural Commonsense Knowledge Graphs AAAI2021

[11] FreebaseQA: A New Factoid QA Data Set Matching Trivia-Style Question-Answer Pairs with Freebase NAACL-HLT2019

[12] Variational Reasoning for Question Answering With Knowledge Graph AAAI2018

[13] A Survey for In-context Learning 2022

[14] Code4Struct: Code Generation for Few-Shot Structured Prediction from Natural Language 2022

[15] https://www.nytimes.com/2022/01/31/learning/february-vocabulary-challenge-invent-a-word.html

[16] https://www.nytimes.com/2023/02/01/learning/student-vocabulary-challenge-invent-a-word.html

[17] Larger Language Models Do In-Context Learning Differently

[18] https://github.com/Significant-Gravitas/Auto-GPT

[19] CAMEL: Communicative Agents for “Mind” Exploration of Large Scale Language Model Society 2023

[20] https://github.com/zjunlp/EasyInstruct

LLMs for Knowledge Graph Construction and Reasoning

LLMs for KG Construction and Reasoning

Ability Comparison and Error Case Study

Entity and Relation Extraction

Zero-shot

One-shot

Event Extraction

Zero-shot

One-shot

Link Prediction

Zero-shot

One-shot

Question Answering

Zero-shot

One-shot

Generalizability Analysis: Virtual Knowledge Extraction

Data Collection

Preliminary Results

Future Opportunities: Automatic KG Construction and Reasoning with Mutiple Agents of LLMs

Conclusion and Future Work

Written by NLPer