Whatโ€™s new for KGs in Graph ML?

Machine Learning on Knowledge Graphs @ NeurIPS 2020

Your guide to the KG-related research in NLP, December edition

Michael Galkin
The Startup

--

NeurIPS is a major venue covering a wide range of ML & AI topics. Of course, there is something interesting for Graph ML aficionados and knowledge graph connoisseurs ๐Ÿง. Tune in to find out!

This year, NeurIPS had 1900 accepted papers ๐Ÿ˜ณ and 100+ among them are on graphs. Plus take into account several prominent workshops like KR2ML, DiffGeo4ML, and LMCA. Be sure to check their proceedings as such workshop papers are likely to appear at future venues like ICLR, ICML, or ACL. Furthermore, ๐Ÿ‘‰ the GML Newsletter #4 ๐Ÿ‘ˆ by Sergey Ivanov presents an overview of Graph ML papers from NeurIPS including theory, oversmoothing, scalability, and many more, so check it out as well.

In this post, Iโ€™d like to put an emphasis on a particular type of graphs, knowledge graphs (KGs), and explore with you 10 papers that might be quite influential in 2021. NeurIPS papers are often a bit more high level with more theory than NLP applications in *CL conferences, so Iโ€™d summarize them as:

Behind the curtain of transductive link prediction we step into the valley of logical reasoning tasks for KGs to excel at

Get some โ˜•๏ธ, ๐Ÿต, or even some Glรผhwein - today in our agenda:

  1. Query Embedding: Beyond Query2Box
  2. KG Embeddings: NAS, ๐Ÿ“ฆ vs ๐Ÿ”ฎ, Meta-Learning
  3. SPARQL and Compositional Generalization
  4. Benchmarking: OGB, GraphGYM, KeOps
  5. Wrapping out

Query Embedding: Beyond Query2Box

Query embedding (QE) is about answering queries against a KG directly in the embedding space without any SPARQL or graph database engines. Given that most of KGs are sparse and incomplete, query embedding algorithms are able to infer missing links (with a certain probability). This is one of the hottest topics in Graph ML so far! ๐Ÿ”ฅ
In the ICLR 2020 post, we covered Query2Box, a strong QE baseline capable of answering logical queries with conjunction (โˆง), disjunction (โˆจ), and existential quantifiers (โˆƒ) by modeling entities as d-dimensional boxes ๐Ÿ“ฆ.

Ren and Leskovec (the authors of the original Query2Box) finally add a negation operator (ยฌ) in the BetaE framework. Neither points nor boxes have a usable denotation of negation so that BetaE models entities and queries as Beta distributions. Projection and intersection are modeled nicely with beta distributions as well (negation is a distribution with reciprocal alpha and beta parameters). In addition to DNF, we can use De Morganโ€™s law for replacing disjunctions with negations and conjunction. Check out the nice illustration of the approach below ๐Ÿ‘‡

๐Ÿงช BetaE slightly outperforms Query2Box on existing query patterns ๐Ÿ“ˆ while deriving and experimenting on new patterns with negations that can not be answered by any existing QE approach yet ๐Ÿ’€. Two more ๐Ÿ”‘ differences to Q2B: BetaE better captures query uncertainty (correlation between the differential entropy of the Beta embedding and the cardinality of the answer set, up to 77% better), and can estimate if a given query has zero answers.

Answering a FOL query โ€œList the presidents of European countries that have never held the World Cupโ€ with conjunction, disjunction, and negation operators. Source: Ren and Leskovec

On the other hand, Sun et al identified that Q2B and other systems are not logically faithful, that is, not all logically entailed query answers can be retrieved by the QE system. To bridge this ๐Ÿ•ณ gap, the authors introduce EmQL (Embedding Query Language). EmQL still embeds entities into a d-dimensional space and supports โˆง, โˆจ, and โˆƒ, but takes a different approach on modelling sets ๐Ÿค”. Instead of boxes or Beta distributions, the authors encode each set X with a pair (a_x,b_x) where a_x is a weighted centroid of set elements, and b_x is a count-min sketch (CM sketch). Each sketch consists of D hash functions of depth W (thus a Dร—W matrix, the authors select 20ร—2000).

How does it work?
* Using centroids, top-k MIPS a_x.T.mm(E) retrieves k possible candidate entities that belong to X;
* Importantly, for CM sketches we have a differentiable retrieval operator CM(i, b_x) that returns a weight of entity i in a set X;
* We then can combine MIPS with CM-based filtering ๐Ÿฅ‚
* The authors then define โˆง, โˆจ, and โˆƒ as operators over centroids and CM sketches

๐Ÿงช In experiments, the authors probe EmQL on generalization (answering queries, standard QE task) and entailment (when a full KG is given, no link prediction required). On average, EmQL outperforms Q2B by 10โ€“15 H@3 points on FB15k-237 and NELL on the generalization task, and completely dominates on the entailment task (94.2 vs 36.2) ๐Ÿ‘€.
Furthermore, EmQL was tested on multi-hop QA benchmarks like MetaQA and WebQSP where it outperforms even the recent EmbedKGQA from the ACL 2020 ๐Ÿ’ช
Note that EmQL does not support negations (ยฌ) allowed by BetaE. Yet? ๐Ÿ˜‰

Source: Sun et al

KG Embeddings: NAS, ๐Ÿ“ฆ vs ๐Ÿ”ฎ, Meta-Learning

Something really interesting this year at NeurIPS going beyond โ€˜yet-another-KG-embedding-algorithmโ€™. Youโ€™ve probably heard about Neural Architecture Search (NAS) and its successes in computer vision โ€” for instance, recent architectures like EfficientNet are not designed by humans ๐Ÿค–. Instead, a NAS system generates a neural net from a bunch of smaller building blocks ๐Ÿงฑoptimizing certain metrics. Can we have a NAS to generate efficient architectures for KG-related tasks?

When NAS for KG embeddings actually works

Zhang et al say yes! They propose Interstellar, an RNN-based NAS approach for relational paths. Interstellar first requires sampling paths from a KG (biased random walks in this case), and then those paths are fed into an RNN. The whole RNN net (cell and weights) is a subject ๐ŸŽฏ for NAS. The process is split into two parts: macro-level (e.g., scoring functions) and micro-level (activations and weights) which are governed by a controller ๐Ÿค–. As Hits@10 and related metrics are non-differentiable, the authors resort to policy gradients to optimize the controller.

๐Ÿงช In experiments, Interstellar is probed on link prediction and entity matching tasks showing competitive results. Each task requires certain seed architectures (like on a picture ๐Ÿ‘‡), and finding a good net might take a while โณ (about 30 hours on search and 70 hours on fine-tuning for FB15k-237), but look, itโ€™s the first step showing that NAS is generally applicable for KG-related tasks and can create new RNN architectures! ๐Ÿคฉ
Besides, letโ€™s see how fast the next Nolanโ€™s movie, Tenet, will get some traction in the model-naming world ๐Ÿ˜‰

Source: Zhang et al

Geometric embedding models enjoy ever-increasing attention ๐Ÿ‘€ from the community! Last year, in the NeurIPSโ€™19 post, we noticed a surge in approaches that use hyperbolic geometry ๐Ÿ”ฎ for graph representation learning. This year, we have a new strong geometric competitor: hyper-rectangles, aka boxes ๐Ÿ“ฆ !

Whereas Query2Box used boxes for query embedding, Abboud et al develop the idea further and design BoxE, a provably fully-expressive KG embedding model where entities are points in a vector space and relations are boxes ๐Ÿ“ฆ. Each relation is modeled with as many boxes as the arity of the relation is, e.g., for a binary predicate capitalOf(Berlin, Germany) there will be two boxes for head and tail entities, and for n-ary predicates, there will be n boxes. Each entity, in addition to the base position, has an additional parameter translational bump which aims at bringing entities occurring in the same relation closer ๐ŸŽณ (check out an illustrated example ๐Ÿ‘‡).

The authors do invest into theory ๐Ÿ“š and prove several important properties of BoxE: it can model many inference patterns except composition, it allows for rule injection ๐Ÿ’‰ (and hence, for injecting ontological axioms), and it is fully expressive. However, itโ€™s fully expressive only when the embedding dimension is |E|x|R| for binary relations and |E|^(n-1)x|R| for n-ary predicates, which is, hmm, a bit too much ๐Ÿ˜• (interestingly, the authors of Query2Box also showed that you need about |E| embedding dimension for modeling an arbitrary FOL query).

โš—๏ธ BoxE was evaluated on triple-based benchmarks like FB15k-237 as well as on n-ary graphs like JF17K. Although the embedding dimension varies in the range 200โ€“1000 (not 15000x237 as theory needs for FB15k-237, for example), BoxE is still quite competitive and performs on par โš–๏ธ with current SOTA on the graphs w/o many compositional patterns. The authors also compiled a nice experiment on injecting logical rules over the NELL-sports dataset and showed impressive >25 MRR points gains ๐Ÿ’ช.

As 2020 is the year of boxes ๐Ÿ“ฆ, do not miss the work of Dasgupta et al published here at NeurIPS who study boxes deeper on the subject of local identifiability and come up with an idea of using Gumbel distributions to model box parameters.

Source: Abboud et al

We also remember E2R from NeurIPS 2019, a KG embedding model based on quantum logic with interesting properties (either very high ๐Ÿ‘ or very low ๐Ÿ‘Žperformance). By that time, E2R only worked in the transductive setup (which means the whole graph is seen during training). This year, Srivastava et al further extend the model and come up with IQE (Inductive Quantum Embedding). ๐Ÿ”‘ Essentially, IQE now accepts node features so that an entity embedding has to correlate with its feature vector. Furthermore, IQE is now optimized with a novel Alternating Minimization scheme which the authors find to be approximately 9 times faster ๐Ÿš€ than vanilla E2R. The authors also provide a solid theoretical justification of modelโ€™s properties and when one should expect the model to be NP-hard.

๐Ÿ‘ฉโ€๐Ÿ”ฌ Conceptually, the model supports binary predicates, but the authors concentrate on the fine-grained entity typing task (FIGER, Ontonotes, TypeNet) using BiLSTM as a context encoder. Note that IQE needs only about 6 epochs to converge (on FIGER โ€” in comparison, E2R required 1000 iterations)! Qualitatively, IQE outperforms the original transductive model by 25โ€“30 accuracy and F1 points ๐Ÿ“ˆ

Continuing on inductive tasks, Baek et al study two particular link prediction setups: 1) given a training seen graph, a new unseen ๐Ÿ‘ป node arrives, and you need to predict its connections to seen ๐Ÿ‘“ nodes (๐Ÿ‘ป -> ๐Ÿ‘“); 2) more unseen nodes arrive and you need to predict links among unseen nodes themselves (๐Ÿ‘ป -> ๐Ÿ‘ป). Sounds pretty complex, right? Usually, in transductive tasks, a model learns entity and relation embeddings of all seen nodes, and inference is performed on a set of seen nodes. Here, we have unseen nodes, and, often, without node features.

Source: Baek et al

The authors resort to meta-learning and propose Graph Extrapolation Networks (GEN) designed to extrapolate the knowledge from the seen entities to unseen. Furthermore, the authors define the task in the few-shot setting, that is, unseen new nodes might have 3โ€“5 (K) links to existing nodes or between other unseen nodes ๐Ÿค”.

The meta-learning ๐Ÿ‘ฉโ€๐Ÿซ task for GEN relies mostly on relations: given a support set of K triples for an unseen node e_i, apply neighborhood aggregation through a learnable relation-specific weight Wr. In fact, ๐Ÿ‘‰ any relation-aware GNN architecture might be plugged in here. In other words, we meta-learn an embedding of an unseen entity using its neighbors' representations. To cater for the uncertainty of the few-shot scenario, the authors stochastically embed unseen entities as a distribution which parameters are learned with 2 GEN layers through MC sampling (somewhat resembles GraphVAEs).

๐Ÿงช GEN has been evaluated on 1- and 3- shot LP tasks on FB15k-237 and NELL-995 yielding significant ๐Ÿ‘ improvements when considering unseen-to-unseen links. In addition, GEN has been applied to relation prediction on DeepDDI and BioSNAP-sub datasets with impressive gains over baselines, e.g., 0.708 vs 0.397 AUPRC on DeepDDI.

๐Ÿ”ฅ Overall, NeurIPSโ€™20 opened up several prospects in the KG embedding area: look, Neural Architecture Search ๐Ÿ”Ž works, Meta-Learning works, Quantum and ๐Ÿ“ฆ models are getting more expressive! Thanks to that, we can now solve much more complex tasks than vanilla transductive link prediction.

SPARQL and Compositional Generalization

๐Ÿ“ In question answering over KGs (KGQA), semantic parsing transforms a question into a structured query (say, in SPARQL) which is then executed against a database. One of the ๐Ÿ”‘ problems there is compositional generalization, that is, can we build complex query patterns after observing simple atoms? In the ICLRโ€™20 post, we reviewed a new large-scale dataset Complex Freebase Question (CFQ) (letโ€™s forgive them for ๐ŸงŸโ€โ™‚๏ธ Freebase) that was designed to measure compositional generalization capabilities of NL 2 SPARQL approaches. Notably, baselines like LSTMs and Transformers perform quite poorly: <20% accuracy on average ๐Ÿ˜•

๐Ÿš’ Guo et al present a thorough study of potential caveats, i.e., one of the biggest issues is sequential decoding โ›“ or any kind of ordering bias when generating queries or logical forms including tree decoding. Instead, they propose to leverage partially ordered sets (posets) and, conversely, Hierarchical Poset Decoding (HPD). Posets allow us to enforce permutation invariance in the decoding process (for instance, predicting two branches of a logical AND operator independently) so that a model could concentrate on generalization. Posets can be represented as DAGs. Components of that DAG can be predicted by a simple RNN (which the authors resort to).

However, direct prediction of posets does not bring benefits (works even worse than LSTMs and Transformers ๐Ÿ“‰). The essential part is hierarchical decoding (check the ๐Ÿ–ผ below) which consists of 4 steps. 1๏ธโƒฃ First, we predict a post sketch (de-lexicalized DAG). 2๏ธโƒฃ Independently, we predict primitives of our query (sort of entity and relation recognition). 3๏ธโƒฃ Then, we fill in the primitives into the poset sketch in all possible permutations, and 4๏ธโƒฃ, predict which particular paths actually do belong to the correct target poset.

๐Ÿงช Experimentally, HPD performs surprisingly well ๐Ÿ‘€ โ€” on average, 70% accuracy on 3 MCD splits compared to 20% by Universal Transformer and 40%-ish by mighty T5โ€“11B. Ablations show that seq2seq and seq2tree sketch predictions only worsen the performance, and the hierarchical component is crucial (otherwise minus 50% accuracy). ๐Ÿ”ฅ Hopefully, this work will inspire more research on compositional generalization and complex KGQA!

Source: Guo et al

๐Ÿ‹ Benchmarking: OGB, GraphGYM, KeOps

Tired of seeing Cora/Citeseer/Pubmed in every other GNN paper? You should be: they are small, expose certain biases, and modelsโ€™ performance has pretty much saturated. Time for a big change! โ˜„๏ธ

Source: Hu et al

Open Graph Benchmark (OGB) (paper by Hu et al) is a great new effort by the Graph ML community to create a set of complex and diverse tasks on different forms of graphs (leaderboards included ๐Ÿ†). OGB offers node classification, graph classification, link prediction tasks on graphs of various sizes (as of now, the biggest graph contains ~100M nodes and ~1.6B edges) and domains (KGs are here, too ๐Ÿ˜: Wikidata-based and BioKG link prediction datasets).

๐Ÿ”ฅ OGB leaderboards have already generated several Twitter storms: for instance, suddenly, a simple label propagation algorithm of 10K-100K parameters outperforms big and slow GNNs of 1M+ parameters by a large margin on transductive node classification tasks ๐Ÿ™ˆ. Clearly, there is still an unexplored room of capabilities and limitations of GNNs. Could Cora/Citeseer/Pubmed demonstrate it? Probably not ๐Ÿคทโ€โ™€๏ธ.

Okay, we have such a big variety of tasks now! On the other hand, we have dozens of GNN architectures and hundreds of hyperparameters to tune. Is there a sweet spot, a good starting point to dig into a certain task? The space is so large! ๐Ÿคฏ You, Ying, and Leskovec tackle exactly this problem of exploring design spaces of GNNs and introduce GraphGYM, a comprehensive suite for creating and evaluating GNNs (and flexing your GNN muscles ๐Ÿ’ช). The authors define GNN design and task spaces, each consisting of fine-grained details, e.g., 12 design dimensions: batch norm, dropout rates, aggregation functions, activation functions, node features pre-/postprocessing layers, number of message passing layers, skip-layers, batch size, learning rate, optimizers, and training epochs. Couple that with dozens of tasks, and a Cartesian product of possible combinations surpasses 10M options! ๐Ÿ‘€

In the rich experimental agenda, the authors find the best working combinations that you could adopt as good starting points and produce very insightful charts ๐Ÿ‘‡. The repo is openly available, you could start experimenting pretty much right away!

๐Ÿ˜‰ By the way, if youโ€™re looking for something similar in the KG embeddings domain, our team has recently completed a huge survey of models and hyperparameters concentrating on the link prediction task.

96 setups sampled from 10M possible combinations. Source: You, Ying, and Leskovec

โšก๏ธFinally, I would like to outline the work by Feydy et al on KeOps, a blazing fast kernel operations library with NumPy, PyTorch, R, and Matlab bindings. In addition to widely used dense and sparse matrices, the authors support symbolic matrices (where ij-th member is computed via a certain formula F, often a matrix reduction formula). Symbolic matrices are computed on the fly ๐Ÿš€ and optimized for CUDA computations. The authors do invest into benchmarking: on a rather standard server workstation with a 8-core Xeon, 128 Gb RAM, RTX 2080 Ti, KeOps outperforms is 5x โ€” 20x faster than PyTorch implementation on the same tasks (then PyTorch crashes with OOM while KeOps works fine).

  • You can also perform a kNN search and be competitive with FAISS!
  • Some implementations in PyTorch-Geometric already work well with KeOPS

Personally, Iโ€™ve been using PyKeOps since summer and find it extremely helpful when working with large-scale KGs. Besides, I compiled the library on a PowerPC + CUDA cluster, please feel my pain ๐Ÿ˜…

KeOps uses symbolic matrices! Source: Feydy et al

Wrapping Up

NeurIPS concludes the line-up of top AI conferences, but ICLR 2021 scores are already out there ๐Ÿ˜‰. If you want to keep updated on Graph ML topics, you could subscribe to the regular newsletter by Sergey Ivanov or join the Telegram GraphML channel!

Merry Christmas, happy New Year, and stay safe ๐Ÿค—

--

--

Michael Galkin
The Startup
Writer for

AI Research Scientist @ Intel Labs. Working on Graph ML, Geometric DL, and Knowledge Graphs