Question Answering over Hyper-relational Knowledge Graphs

12 min readNov 21, 2023

The basic concept of KGQA system. — The concept of KGQA

This work explores the core chain approaches for Knowledge Graph Question Answering (KGQA) task. We adopt hyper-relational KG (e.g., Wikidata) as a new domain for previous work, focusing on pre-trained Language model, Sentence-BERT (SBERT) is a state-of-the-art sentence and text embedding, we proposed a method for core chain ranking on QA dataset LC-QuAD 2.0 over wikidata knowledge graph. Our system generates the core chains from a natural language question (NLQ) and then ranks these core chains in order to build an actual Sparql query. In addition, we explore the question’s intention, consider this task as a text classification task, and use a pre-trained BERT model to accomplish it. We converted the application to be used as an API using FastAPI.

Fig.1 — A) Triple-based facts. B) Hyper-relational facts. The entities are represented by ovals, both blue arrows and blue ovals form Qualifiers.

Question Answering is commonly investigated in two types of questions Simple Questions and Complex Questions, the latter are attracting considerable in the KGQA systems as the knowledge and reasoning become more complex to answer them, the motivation of this work is to apply new techniques for the purpose of improving the accuracy of answering Complex Questions over hyper-relational KG.

Knowledge Graph stores information as triple-based fact < subject — predicate — object> where the subject and object are entities and predicate is the relation between them. Hyper-relational facts have more information connected (Qualifiers) to the triple, figure 1 b) illustrates an example of a Wikidata triple and its qualifiers.

KGQA needs to ground one entity and one relation in the given simple question in order to build the formal query. With reference to Complex Questions, more than one entity and relation can be grounded in the input question with a consequent increase in the complexity of building a formal query.

Referring to figure 1 an example of a simple question: ”What university did Stephen Hawking attend?” this question could be answered based on a single triple (Stephen Hawking, educated at, University College Oxford). If the question refers to more than one fact, for instance, one triple and its qualifier or two triples at this point the question is considered a Complex Question, based on the above example a Complex Question could be ”What did Stephen Hawking study at Oxford University?”, the answer to this question can’t be found on the same triple.

The field of Knowledge Graph Question Answering (KGQA) aims to understand the Natural Language Question (NLQ) by using certain methods to reformulate and create a semantic representation of NLG then query a Knowledge Graph (KG) to capture the most appropriate answer. Figure 2 shows the basic concept of KGQA.

Problem Statement, we define the problem statement as shown in figure 3 a given a natural language question Q. The task is to retrieve a set of answers a ∈ A based on the topic entity TE of the Q on the given KG and then apply a neural ranking model NRM on A, follows that predicting the intent int of the question such as count, true/false or set of values, finally constructing a formal query FQ.

Core Chain Approach, Core refers to the Topic Entity and the chain refers to the path that can be attached to the core, every core chain will represent a candidate answer. In this work we limit the core chains to the length of two hops, we mark each predicate (edge) as follows:

+ “Outgoing”.
- “Incoming”.
* “Hyper-relational”.
, to indicate the combination of two predicates in the same hop in one core chain.

These signs will be used later to build the final representation of the question in order to generate the final SPARQL query of the given question.

Below is an example of all the predicates that can be collected to form the core chains. The topic entity is Albert Einstein.

Fig.4 — Sub-graph for the topic entity ”Albert Einstein”, based on Wikidata KG.

We create a set of core chain candidates {C1,C2,..,Cn} by collecting all two hops around the topic entity in the KG then we differentiate incoming and outgoing predicates by adding + and — respectively and for the hyper-relational predicate marked with *. Table 1 demonstrates all eight core chain templates {C1,C2,..,C8} where C1 to C4 are outgoing core chains and from C2 to C8 are incoming core chains, {C1, C5} represent one hop core chains with a single predicate, {C2, C6} also show one hop core chains with a hyper-relational predicate, two hops core chains are represented in {C3, C4, C7, C8}.

Table.1 — Core chain templates that are generated for the topic entity ”Albert Einstein” from fig.4, based on Wikidata KG.

Figure 5, draws the representations of the additional core chain templates, where TE refers to Topic Entity, blue dotted lines are qualifiers, and Wikidata predicates identifiers are shown above the edges. Question mark ? is the answer to the question. We form 24 templates to create the core chains based on the QA dataset LC-QuAD 2.0.

Fig.5 — Core chain templates. A) in case we have a one-hop length chain consist of two predicates, we add a comma to differentiate it from two-hop chain with two predicates. B)The last two predict in this two-hop chain are in the same level hop. C) We adopted this template and it’s not exist in LC-QuAD2.0.

We need to prepare a core chain dataset for all questions in the QA dataset LC-QuAD 2.0 to be trained and fine-tuned using the SBERT transformer.

After creating all core chains for the given questions from the QA dataset LC-QuAD 2.0 and since we know the correct answer for each question, we give a score for each core chain for the given question, as follows, score 1 if the core chain holds the correct answer of the question and score 0 if not. See Table 2 where the core chain C2 holds the correct answer for question Q1.

Table.2 — Example how to score the core chains for the respective question.

SentenceBert (SBERT):

Also called sentenceTransformers, is a BERT-based model that conveys a modification of the pre-trained BERT network that aims to overcome the regression problem. We use SBERT to perform the task of Semantic Textual Similarity (STS) since we have a corpus consisting of large sentences and we want to compute the similarity between pairs of sentences. We divided the dataset of core chains to train and test datasets.

SBERT uses siamese and triplet network structures to derive a fixed-sized sentence embedding vector that can be compared using similarity measures like cosine-similarity or Manhatten / Euclidean distance to compute semantically similar sentences.

In our work, we consider the input pair for SBERT network consisting of (q, ci) the given question and the core chain from above table 2.

Figure 6 shows SBERT architecture with classification objective function, where SBERT produces contextualized word embeddings mean-pooling layer to have a fixed-sized output of vectors q and Ci. The loss function is important to fine-tune our model as it defines how well our model is performing. We scaled the result of the cosine similarity function as a score from [0,1], where 1 is very similar. For the training purpose we need to know how well our model is performing, so we need to tell our network which core chain candidates are similar to the given question and which are dissimilar based on the above table in Table 2 and compare it with the result of loss function (cosine similarity function).

The intention of the Question:

In this task, we focus on predicting the intent of the question to have all information we need to construct the SPARQL query, in the previous task, we find the correct core chain with the highest rank that holds the correct answer, this core chain represent the statement in SPARQL query structure, as shown in figure 7. In order to find out other elements of the SPARQL query we need to find out the intention of the Question, to accomplish this task, we consider it as a text classification task.

We divide this task based on SPARQL query elements into three classification tasks as shown below, then we use this classification to determine the intent element on the left side and right side of the SPARQL query as shown in figure 7.

Type of the question: we consider that there are three types of the question based on SPARQL query structure: A) The boolean question, asks about the validity of the fact. B) Ask about the cardinality of a variable mentioned in the question. C) Ask about the set of values.
Constraints of the question: we check if the question has constraints like FILTER or ORDER, the classification task here is to check if a constraint exists in the question or not.

Constraints of the question: we check if the question has constraints like FILTER or ORDER, the classification task here is to check if a constraint exists in the question or not.

Fig.7 — SPARQL query structure, for the classification task.

Evaluation, in this task, we outline the training details and the datasets that we use in our experiment and we analyze several pre-trained transformers models, we present the performance of our system.

Datasets, QA datasets based on knowledge graphs boost the process of solving the task of KGQA, in our study we trained and evaluate over two datasets: the SimpleQuestions dataset, then we performed our experiment on LC-QuAD 2.0 dataset.

Table 3, shows metrics about the generated core chains for each dataset, number of generated core chains, number of unique predicates, CC length is maximum number of hops in the core chains and if it has a qualifier.

Training details:

Splits: We split our models as follow: 70% as train, 15% as validation and 15% as test data, we apply these percentage on both datasets, this means that could be for the same question a core chain annotated for train and other as validation or test data.
Epochs and Batches: We use many pre-trained models and for each, we perform different epochs numbers, but in general, we train our models for 1 to 10 epochs. we train various batch sizes, 16, 32, and 64.
Other implementation details: The server we use support NVIDIA GeForce GTX 1080 GPU, Python 3.7.3, PyTorch 1.7.1, CUDA v10.0, Transformers 4.9.0, SBERT 2.0.0.

The average time for core chain generation and ranking, as follows in table 4:

Table.4 — Max, Min and Avg time to create & rank core chains, (time is in seconds).

Evaluation criteria:

In this section, we present the evaluation of our system, we applied the following criteria:

Core chain Accuracy (CCA): CCA depends on the number of the question that the system retrieved its core chain correctly, annotated in the formula below as CCQ, and the number of the question in the test dataset, annotated as TQ:

Mean reciprocal rank (MRR): it is a measure to evaluate the return of a ranked list of the correct core chains to the respective question, ranki in the formula below refers to the rank of the correct core chain for each question in the test dataset.

Precision (P): it is the ratio of the correct answers retrieved to the total answers retrieved for that question for a given question.

RECALL (R): it is the ratio of the correct answers retrieved over the total number of correct answers for that question.

F-measure (F1): it is a common metric for classification problems, and widely used in QA. F1 based on the result of precision and recall.

Experiments (Models):

In this section, we list all ranking models we used and present a comparison of their metrics, we depict some of these metrics to give more understanding of our results. Below we mention the pre-trained sentence embedding models that we used and some information about them.

distilbert-base-uncased
stsb-mpnet-base-v2
paraphrase-mpnet-base-v2

We trained our data firstly on distilbert-base-uncased model on LC-QuAD 2.0 where start from epoch=1 to 10. Table 5 below shows some metrics, where training on epoch 5 register the highest score.

Table.5 — “distilbert-base-uncased” model performance on LC-QuAD 2.0

In our experiment we go deep in our analysis for LC-QuAD 2.0, LC-QuAD 2.0 supports 22 SPARQL unique templates that cover 10 types of questions such as boolean, two intentions, single fact, fact with qualifiers, and dual fact and other. Two of these templates have two representations for each, therefore we consider them as different templates, to get at the end 24 templates, in figure 8 below we depict how “distilbert-base-uncased” model performs for each template, we picked the highest 5 results from table in table 5 to compare them among these templates, the performance of the model is clearly registered high precision on the templates that represent complex questions, it is also drop down for template 18 and 23 and 24.

Fig.8 — “distilbert-base-uncased” model performance among LC-QuAD 2.0 question templates.

We picked the highest performance epoch for “distilbert-base-uncased” and compared it with other pre-trained models”paraphrase-mpnet-base-v2" and “stsb-mpnet-base-v2”, also we fine-tuned the model we trained on SimpleQuestions dataset and train it on LC-QuAD 2.0, we train the “distilroberta-base” model using another approach for encoding which is cross-encoder.

We noticed that the highest performance epoch for “distilbert-base-uncased” still scores the highest performance among other pre-trained models. We summarize the metrics of these models in one table in table 6 as below:

Table.6 — A comparison between pre-trained models on their performance on LC-QuAD 2.0 dataset.

We apply the same experiment for the Cross-encoder model to see how it performs among LC-QuAD 2.0 24 templates of the questions. The performance of the Cross-encoder model is notable on the question templates that have less complexity like 18 and 23 and 24, where the model overcomes other models, figuer 9 shows these metrics.

Fig.9 — Cross-encoder model performance among LC-QuAD 2.0 question templates.

We also compare the performance of our trained model with the performance of the approach of Gaurav called slot-matching, in their study they used an older version of the LC-QuAD dataset, we put our scores with their scores together in the table below:

Table.7 — A comparison between our pre-trained models and Slot Matching model.

On the other hand, we fine-tuned our model on the SimpleQuestions dataset, where all the questions in this dataset can be answered in one single fact or triple. Below we summarize some metrics of our model performance on this dataset.

Table.8 — Fine-tuning our model performance on SimpleQuestions dataset.

Regarding the intention of the Question, as we mentioned above, we considered this task a text-classification, it is another pre-trained BERT model to predict intent of the question, whether is boolean or ask for values. Table 9 represents the result.

Table 10, shows the metrics of another text-classification model predicts whether the question has constraint function or not

Table.9 — Metrics of the intent of the question, whether is boolean or ask for values.

Table.10 — Metrics on whether the question has constraint function ore not.

Conclusions:

The research addressed the task of Knowledge Graph Question Answering (KGQA) over Hyper-relational KG.
Our study highlights the key role that the QA datasets play in improving question answering systems.
The main focus of our work was on the transformers and their novel architecture that aims to solve sequence-to-sequence tasks particularly sentence embeddings.
SBERT proves effective solutions and high performance compared to traditional approaches.

Futur works:

Explore the use of graph embedding.
Perform our approach in other KGs & QAs datasets.
Improving the techniques to construct formal language from the question.
Exploring data augmentation methods using generative AI and LLMs.
Develop a graphical user interface and an API.

Thank you for taking the time to read my article! :)

Question Answering over Hyper-relational Knowledge Graphs

Conclusions:

Futur works:

Written by Ahmad Alzeitoun