Re-evaluation of Knowledge Graph Completion Methods

7 min readMar 25, 2020

Despite their large sizes, knowledge graphs are far from complete. Knowledge graph completion methods that aim to add missing facts to a knowledge graph have become a hot topic. Among them, knowledge graph representation learning models or embedding models are extensively studied with new embedding models being proposed rapidly. We did an experimental study to evaluate these methods and our paper will appear in SIGMOD 2020. Briefly speaking, these are the results we obtained:

The data redundancy and test leakage in widely-used benchmark datasets caused an overestimation of the accuracy in the range of 19% -175% for many models.
We identified Cartesian product relations which also lead to unrealistic evaluation performance.
Many test cases (e.g., 70% in one dataset) used to evaluate the models are unrealistic and nonexistent in real-world scenarios.

Knowledge Graphs

Knowledge graphs are rapidly growing in popularity. They store real-world facts in the form of triples. For example, the fact that Bill Gates founded Microsoft will be stated as (Bill Gates, founded, Microsoft). Companies and corporations such as Google, Microsoft, LinkedIn, Siemens, Thomson Reuters, etc. are exploiting knowledge graphs to manage and easily navigate highly connected, multi-relational data.

Gartner hype cycle for emerging technologies of 2019 has also highlighted knowledge graphs as one of the emerging technologies with a significant impact on business, society, and people over the next 5 to 10 years. A brief look at the big conferences such as ACL 2019, NeurIPS 2019, and AAAI 2020 also shows the growth in knowledge graph-related researches.

What’s our paper about?

Our paper investigates the true effectiveness of knowledge graph completion methods and the defects that exist in extensively-used benchmark datasets FB15k (a subset of Freebase) and WN18 (extracted from WordNet) as well as the recent dataset YAGO3–10 (a subset of YAGO3). These datasets that have been used to train and evaluate many of the embedding models contain a large number of reverse and duplicate triples. Our paper shows how data redundancy and test leakage existing in these datasets affect the embedding models. Another problem that we looked at is the existence of Cartesian product relations in FB15k. I will talk about these relations at the end of this post.

The bottom line is that all of the mentioned problems that exist with the datasets cause an unrealistic inflate to models’ accuracy. Moreover, training a knowledge graph completion model using these datasets is a form of overfitting and the learned model is optimized for the aforementioned triples which cannot be generalized to realistic settings.

Embedding models and the datasets used to train them

Embedding models learn the multi-dimensional representations h, r, and t of a triple (head entity (subject), relation, tail entity(object)) in a knowledge graph. Triples will be denoted as(h,r,t) in the rest of this post.

As we know, datasets play an important role in training a machine learning model. In the case of knowledge graph embedding models, the datasets that were widely used to train and test them have different problems. As a result, we have models that won’t be effective when used to do real-world knowledge graph completion.

FB15k & FB15k-237

FB15k contains many reverse triples. It includes many pairs of (h,r,t) and (h,r⁻¹,t) where r and r⁻¹ are reverse relations:

(avatar, film/directed_by, James Cameron)

(James Cameron, director/film, Avatar)

Freebase actually denotes reverse relations explicitly using a special relation reverse_property:

(film/directed_by, reverse_property, director/film)

About 70% of triples in the training set of FB15k form reverse pairs and, also for 70% of triples in the test set of FB15k, their reverse triples exist in the training set.

These data characteristics suggest that embedding models would have been biased toward learning reverse relations for link prediction. More specifically, the task can largely become inferring whether two relations r₁ and r₂ form a reverse pair. Given the abundant reverse triples in the dataset, this goal could potentially be achieved without using a machine learning approach based on complex embeddings of entities and relations. Instead, we can derive simple rules of the form (h,r₁,t)⇒ (t,r₂,h) using statistics about the triples in the dataset. In fact, we generated such a simple model that attained 71.6% for FB15k using FHits@1↑, a common accuracy measure for embedding models (the higher the better). Based on our results, the best performing embedding model has a FHits@1↑ of 73.8% on FB15K.

link prediction scenario, given such data, is non-existent in the real-world.

It’s important to notice that link prediction scenario, given such data, is non-existent in the real-world at all. With regard to FB15k, the redundant reverse relations, coming from Freebase, were just artificially created. New facts were always added to Freebase as a pair of reverse triples, denoted explicitly by the relation reverse_property. For such intrinsically reverse relations that always come in pair, we never need to predict a triple while its reverse is already in the knowledge graph. Training a knowledge graph completion model using FB15k is thus a form of overfitting in that the learned model is optimized for the reverse triples which cannot be generalized to realistic settings.

Toutanova and Chen noted the aforementioned problem with FB15k and created FB1k-237 by removing such redundancy. To investigate the effect of redundant data available in FB15k, we did some experiments to compare the results of several embedding models on FB15k vs. FB15k-237 and the following table displays the results for some of the methods, using different popular metrics. It is worth mentioning that by definition, higher Hits@1↑ (FHits@1↑), Hits@10↑ (FHits@10↑) and MRR↑ (FMRR↑), and lower MR↓ (FMR↓) indicate better accuracy.

Link prediction results on FB15k and FB15k-237

The overall observation from our experiments is that:

The performance of all methods worsens considerably after the removal of reverse relations. As you can see in the following radar chart, the performance of embedding models has decreased largely on FB15k-237. This result verifies that embedding-based methods may only perform well on reverse relations. However, a straightforward approach based on detection of reverse relations can achieve comparable or even better accuracy.

Performance of embedding models on FB15k vs. FB15k-237

Many successors of TransE (one of the first embedding models) were supposed to significantly outperform it. This was verified by our experiment results on FB15k. However, on FB15k-237, their margin over TransE became much smaller. We hypothesize that these models improved the results mostly on reverse and duplicate triples and hence, after removing those triples, they do not show a clear advantage. This hypothesis can be verified by our finding that most of the test triples on which these models outperformed TransE have reverse or duplicate triples in the training set.

WN18 & WN18RR

WN18 also suffers from data leakage, as 14 out of its 18 relations form 7 pairs of reverse relations, e.g., (europe, has_part, republic_of_estonia) and (republic_of_estonia, part_of, europe) are two reverse triples in reverse relations has_part and part_of. There are also 3 self-reciprocal (symmetric) relations: verb_group, similar_to, derivationally_related_form. Around 93% of triples in training sets are such triples and for 93% of triples in test set their reverse triples exist in the training set.

To remove the WN18 reverse relations, Dettmers et al. created WN18RR by keeping just one relation from each pair of reverse relations. We compared the results of embedding models on WN18 vs. WN18RR as well and reached the same conclusion as what was observed on FB15k vs. FB15k-237.

Link prediction results on WN18 and WN18RR

YGO3–10 & YAGO3–10-DR

YAGO3–10 has 37 relations and two relations isAffiliatedTo (r1) and playsFor (r2) account for 35% and 30% of its training triples respectively. Although r1 semantically subsumes r2 in the real world, they appear as duplicate relations in this particular dataset as their (subject, object) pairs overlap substantially. Based on our experiments, various models achieved much stronger results on r1 and r2 than other relations. We created another dataset called YAGO3–10-DR by removing the redundancies from YAGO3–10. On these datasets, we reached the same conclusion as what we observed on the other datasets.

Cartesian Product Relations

We also discovered another issue (called Cartesian product relations) with FB15k which makes existing performance measures of embedding models unrealistic. For a Cartesian relation, the subject-object pairs from all instance triples of the relation form a Cartesian product. In other words, there are a set of subjects and a set of objects, and the relation exists from every subject in the first set to every object in the second set. One example Cartesian product relation is climate since (a, climate, b) is a valid triple for every possible city a and month b. Another example is position, since every team in a certain professional sports league has the same set of positions. The link prediction problem for such relations thus becomes predicting whether a city has a climate in, say, January, or whether an NFL team has the quarter-back position. The existence of these relations also unrealistically inflates a model’s link prediction accuracy. Moreover, Such prediction tasks are not very meaningful. The same as what we observed for reverse relations, the existence of Cartesian product relations in FB15k is quite artificial. In fact, 60% of them are due to special “mediator nodes”. Even if we want to perform link prediction on Cartesian product relations, a simpler approach can be more effective than learning complex embedding models. We implemented a simple method to find Cartesian product relations and perform link prediction on them. Our experiments on 9 Cartesian product relations in FB15k obtained an average FHits@10↑ of 98.3% using this method, which is higher than the 96.3% FHits@10↑ of TransE on these relations.

You can find more details about this study and its results in the paper. All codes, experiment scripts, datasets, and results are in a public repository.