Empowering Knowledge Graph Link Prediction through In-Depth Analysis of Data Modeling
This article is a summary of a recent paper, “Comprehensive Analysis of Freebase and Dataset Creation for Robust Evaluation of Knowledge Graph Link Prediction Models” which was published in ISWC 2023. The authors of this paper are researchers at the Innovative Data Intelligence Research Laboratory (IDIR) at the University of Texas at Arlington. The methodology, code, datasets, and experiment results produced from this work are available at https://github.com/idirlab/freebases.
Knowledge graphs (KGs) encode semantic, factual information as triples of the form (subject s, predicate p, object o). KGs are highly popular in the industry. They are behind products in Google, Amazon, Microsoft, etc, and have become an essential asset to a wide variety of tasks and applications in the fields of artificial intelligence and machine learning, including natural language processing, search, question answering, and recommender systems.
To develop and robustly evaluate models and algorithms for tasks on KGs, access to large-scale KGs is crucial. However, publicly available KG datasets are often much smaller than what real-world scenarios require.
Freebase is amongst the largest public cross-domain knowledge graphs. It possesses three main data modeling idiosyncrasies:
1) It has a strong type system
2) Its properties are purposefully represented in reverse pairs
3) It uses mediator objects to represent multiary relationships
These design choices are important in modeling the real world. But they also pose nontrivial challenges in research of embedding models for knowledge graph completion, especially when models are developed and evaluated agnostically of these idiosyncrasies.
This paper lays out a comprehensive analysis of the challenges associated with the idiosyncrasies of Freebase and measures their impact on KG Link Prediction (LP) — the task of predicting missing s in triple (?, p, o) or missing o in (s, p, ?). LP has been studied so many times in the past couple of years. The first paper received more than seven thousand citations. Just in 2022, in all top conferences, 53 papers studied LP, 48 of which used the Freebase dataset. 3 of these papers used a large-scale problematic dataset and the rest only used a very small subset of Freebase. However, none of these datasets considered how the data is modeled in Freebase.
This paper makes available several variants of the Freebase dataset by inclusion and exclusion of the data modeling idiosyncrasies and provides a Freebase type system that is extracted to supplement the variants. The paper fills an important gap in dataset availability. To the best of our knowledge, ours is the first-ever publicly available full-scale Freebase dataset that has gone through proper preparation. The paper also fills an important gap in our understanding of embedding models for knowledge graph link prediction. Such models were seldom evaluated using the full-scale Freebase. When they were, the datasets used were problematic, leading to unreliable results.
Our Datasets
We created four variants of the Freebase dataset by inclusion/exclusion of reverse triples and CVT nodes. The type system we created is also provided as auxiliary information. Metadata and administrative triples are removed, and thus the variants only include subject matter triples.
Reverse Triples
When a new fact was included in Freebase, it would be added as a pair of reverse triples (s, p, o) and (s, p’, o) where p’ is the reverse of p. For instance, /film/film/directed_by and /film/director/film are reverse relations. Thus, (James Ivory, /film/director/film, A Room With A View) and (A Room With A View, /film/film/directed_by, James Ivory) in the figure above form reverse triples.
The pitfalls associated with reverse triples in a dataset can be summarized as follows.
1) Link prediction becomes much easier on a triple if its reverse triple is available. Hence, the reverse triples led to substantial overestimation of model accuracy.
2) Instead of complex models, one may achieve similar results by using statistics of the triples to derive simple rules of the form
(s, p1, o) ⇒ (o, p2, s) where p1 and p2 are reverse. Such rules are highly effective given the prevalence of reverse relations
3) The link prediction scenario for such data is non-existent in the real world.
This is a case of excessive data leakage — the model is trained using features that otherwise would not be available when the model needs to be applied for real inference.
Mediator Nodes
Mediator nodes, also called CVT nodes, are used in Freebase to represent n-ary relationships. For example, the figure above shows a CVT node connected to an award, a nominee, and a work. This or similar approach is necessary for accurate modeling of the real world. Note that, one may convert an n-ary relationship centered at a CVT node into binary relationships between every pair of entities, by concatenating the edges that connect the entities through the CVT node.
When multiary relationships (i.e., CVT nodes) are present, link prediction could become more challenging as CVT nodes are long-tail nodes with limited connectivity. Nonetheless, the impact of CVT nodes on the effectiveness of current link prediction approaches is unknown. This paper for the first time presents experiment results in this regard, on full-scale Freebase datasets.
Thank you for reading this article. Are you interested to learn more? Feel free to check out our paper and our GitHub repository! Should you have any further questions, please contact us at idirlab@uta.edu.