## What’s new for KGs in Graph ML?

# Machine Learning on Knowledge Graphs @ NeurIPS 2020

## Your guide to the KG-related research in NLP, December edition

NeurIPS is a major venue covering a wide range of ML & AI topics. Of course, there is something interesting for Graph ML aficionados and knowledge graph connoisseurs 🧐. Tune in to find out!

This year, NeurIPS had 1900 accepted papers 😳 and 100+ among them are on graphs. Plus take into account several prominent workshops like KR2ML, DiffGeo4ML, and LMCA. Be sure to check their proceedings as such workshop papers are likely to appear at future venues like ICLR, ICML, or ACL. Furthermore, 👉 the GML Newsletter #4 👈 by Sergey Ivanov presents an overview of Graph ML papers from NeurIPS including theory, oversmoothing, scalability, and many more, so check it out as well.

In this post, I’d like to put an emphasis on a particular type of graphs, *knowledge graphs (KGs)*, and explore with you 10 papers that might be quite influential in 2021. NeurIPS papers are often a bit more high level with more theory than NLP applications in *CL conferences, so I’d summarize them as:

Behind the curtain of transductive link prediction we step into the valley of logical reasoning tasks for KGs to excel at

Get some ☕️, 🍵, or even some Glühwein - today in our agenda:

- Query Embedding: Beyond Query2Box
- KG Embeddings: NAS, 📦 vs 🔮, Meta-Learning
- SPARQL and Compositional Generalization
- Benchmarking: OGB, GraphGYM, KeOps
- Wrapping out

# Query Embedding: Beyond Query2Box

Query embedding (QE) is about answering queries against a KG directly in the embedding space without any SPARQL or graph database engines. Given that most of KGs are sparse and incomplete, query embedding algorithms are able to infer missing links (with a certain probability). This is one of the hottest topics in Graph ML so far! 🔥

In the ICLR 2020 post, we covered **Query2Box**, a strong QE baseline capable of answering logical queries with *conjunction* (∧), *disjunction* (∨), and *existential quantifiers* (∃) by modeling entities as d-dimensional boxes 📦.

**Ren and Leskovec** (the authors of the original Query2Box) finally add a *negation* operator (¬) in the **BetaE** framework. Neither points nor boxes have a usable denotation of negation so that *BetaE* models entities and queries as Beta distributions. Projection and intersection are modeled nicely with beta distributions as well (negation is a distribution with reciprocal alpha and beta parameters). In addition to DNF, we can use De Morgan’s law for replacing disjunctions with negations and conjunction. Check out the nice illustration of the approach below 👇

🧪 *BetaE* slightly outperforms *Query2Box* on existing query patterns 📈 while deriving and experimenting on new patterns with negations that can not be answered by any existing QE approach yet 💀. Two more 🔑 differences to *Q2B*: *BetaE* better captures query uncertainty (correlation between the differential entropy of the Beta embedding and the cardinality of the answer set, up to 77% better), and can estimate if a given query has zero answers.

On the other hand, **Sun et al**** **identified that Q2B and other systems are not *logically faithful*, that is, not all *logically entailed* query answers can be retrieved by the QE system. To bridge this 🕳 gap, the authors introduce **EmQL** (*Embedding Query Language*). *EmQL* still embeds entities into a d-dimensional space and supports ∧, ∨, and ∃, but takes a different approach on modelling sets 🤔. Instead of boxes or Beta distributions, the authors encode each set *X* with a pair `(a_x,b_x)`

where `a_x`

is a weighted centroid of set elements, and `b_x`

is a count-min sketch (CM sketch). Each sketch consists of *D* hash functions of depth *W* (thus a *D×W* matrix, the authors select 20*×*2000).

How does it work?

* Using centroids, top-k MIPS `a_x.T.mm(E)`

retrieves *k* possible candidate entities that belong to *X*;

* Importantly, for CM sketches we have a differentiable retrieval operator `CM(i, b_x)`

that returns a weight of entity *i* in a set *X;*

* We then can combine MIPS with CM-based filtering 🥂

* The authors then define ∧, ∨, and ∃ as operators over centroids and CM sketches

🧪 In experiments, the authors probe *EmQL* on *generalization* (answering queries, standard QE task) and *entailment* (when a full KG is given, no link prediction required). On average, *EmQL* outperforms *Q2B* by 10–15 H@3 points on FB15k-237 and NELL on the generalization task, and completely dominates on the entailment task (94.2 vs 36.2) 👀.

Furthermore, EmQL was tested on multi-hop QA benchmarks like MetaQA and WebQSP where it outperforms even the recent EmbedKGQA from the ACL 2020 💪

Note that *EmQL* does not support negations (¬) allowed by *BetaE. *Yet? 😉

# KG Embeddings: NAS, 📦 vs 🔮, Meta-Learning

Something really interesting this year at NeurIPS going beyond ‘*yet-another-KG-embedding-algorithm*’. You’ve probably heard about **Neural Architecture Search (NAS)** and its successes in computer vision — for instance, recent architectures like EfficientNet are not designed by humans 🤖. Instead, a NAS system generates a neural net from a bunch of smaller building blocks 🧱optimizing certain metrics. Can we have a NAS to generate efficient architectures for KG-related tasks?

**Zhang et al** say yes! They propose **Interstellar**, an RNN-based NAS approach for relational paths. *Interstellar* first requires sampling paths from a KG (biased random walks in this case), and then those paths are fed into an RNN. The whole RNN net (cell and weights) is a subject 🎯 for NAS. The process is split into two parts: macro-level (e.g., scoring functions) and micro-level (activations and weights) which are governed by a controller 🤖. As Hits@10 and related metrics are non-differentiable, the authors resort to policy gradients to optimize the controller.

🧪 In experiments, *Interstellar* is probed on link prediction and entity matching tasks showing competitive results. Each task requires certain seed architectures (like on a picture 👇), and finding a good net might take a while ⏳ (about 30 hours on search and 70 hours on fine-tuning for FB15k-237), but look, it’s the first step showing that NAS is generally applicable for KG-related tasks and can create new RNN architectures! 🤩

Besides, let’s see how fast the next Nolan’s movie, Tenet, will get some traction in the model-naming world 😉

Geometric embedding models enjoy ever-increasing attention 👀 from the community! Last year, in the NeurIPS’19 post, we noticed a surge in approaches that use hyperbolic geometry 🔮 for graph representation learning. This year, we have a new strong geometric competitor: hyper-rectangles, aka boxes 📦 !

Whereas Query2Box used boxes for query embedding, **Abboud et al** develop the idea further and design **BoxE**, a provably fully-expressive KG embedding model where entities are points in a vector space and relations are boxes 📦. Each relation is modeled with as many boxes as the *arity* of the relation is, e.g., for a ** binary** predicate

`capitalOf(Berlin, Germany)`

there will be **boxes for head and tail entities, and for**

*two***predicates, there will be**

*n-ary***boxes. Each entity, in addition to the base position, has an additional parameter**

*n***which aims at bringing entities occurring in the same relation closer 🎳 (check out an illustrated example 👇).**

*translational bump*The authors do invest into theory 📚 and prove several important properties of *BoxE*: it can model many inference patterns except composition, it allows for rule injection 💉 (and hence, for injecting ontological axioms), and it is fully expressive. However, it’s fully expressive only when the embedding dimension is **|E|x|R|** for binary relations and **|E|^(n-1)x|R|** for n-ary predicates, which is, hmm, a bit too much 😕 (interestingly, the authors of Query2Box also showed that you need about **|E|** embedding dimension for modeling an arbitrary FOL query).

⚗*️ BoxE* was evaluated on triple-based benchmarks like FB15k-237 as well as on n-ary graphs like JF17K. Although the embedding dimension varies in the range 200–1000 (not 15000x237 as theory needs for FB15k-237, for example), *BoxE* is still quite competitive and performs on par ⚖️ with current SOTA on the graphs w/o many compositional patterns. The authors also compiled a nice experiment on injecting logical rules over the NELL-sports dataset and showed impressive >25 MRR points gains 💪.

As 2020 is the year of boxes 📦, do not miss the work of **Dasgupta et al** published here at NeurIPS who study boxes deeper on the subject of local identifiability and come up with an idea of using Gumbel distributions to model box parameters.

We also remember E2R from NeurIPS 2019, a KG embedding model based on quantum logic with interesting properties (either very high 👍 or very low 👎performance). By that time, E2R only worked in the transductive setup (which means the whole graph is seen during training). This year, **Srivastava et al** further extend the model and come up with** ****IQE (Inductive Quantum Embedding)**. 🔑 Essentially, *IQE* now accepts node features so that an entity embedding has to correlate with its feature vector. Furthermore, *IQE* is now optimized with a novel *Alternating Minimization* scheme which the authors find to be approximately 9 times faster 🚀 than vanilla E2R. The authors also provide a solid theoretical justification of model’s properties and when one should expect the model to be NP-hard.

👩🔬 Conceptually, the model supports binary predicates, but the authors concentrate on the fine-grained entity typing task (FIGER, Ontonotes, TypeNet) using BiLSTM as a context encoder. Note that IQE needs only about 6 epochs to converge (on FIGER — in comparison, E2R required 1000 iterations)! Qualitatively, IQE outperforms the original transductive model by 25–30 accuracy and F1 points 📈

Continuing on inductive tasks, **Baek et al** study two particular link prediction setups: 1) given a training seen graph, a new *unseen* 👻 node arrives, and you need to predict its connections to *seen* 👓 nodes (👻 -> 👓); 2) more *unseen* nodes arrive and you need to predict links among *unseen* nodes themselves (👻 -> 👻). Sounds pretty complex, right? Usually, in transductive tasks, a model learns entity and relation embeddings of all seen nodes, and inference is performed on a set of seen nodes. Here, we have unseen nodes, and, often, without node features.

The authors resort to **meta-learning** and propose **Graph Extrapolation Networks (GEN)**** **designed to *extrapolate* the knowledge from the seen entities to unseen. Furthermore, the authors define the task in the *few-shot* setting, that is, unseen new nodes might have 3–5 (**K**) links to existing nodes or between other unseen nodes 🤔.

The meta-learning 👩🏫 task for GEN relies mostly on **relations**: given a support set of **K** triples for an unseen node *e_i*, apply neighborhood aggregation through a learnable relation-specific weight **Wr**. In fact, 👉 any relation-aware GNN architecture might be plugged in here. In other words, we meta-learn an embedding of an unseen entity using its neighbors' representations. To cater for the uncertainty of the few-shot scenario, the authors stochastically embed unseen entities as a distribution which parameters are learned with 2 GEN layers through MC sampling (somewhat resembles GraphVAEs).

🧪 GEN has been evaluated on 1- and 3- shot LP tasks on FB15k-237 and NELL-995 yielding significant 👍 improvements when considering unseen-to-unseen links. In addition, GEN has been applied to relation prediction on *DeepDDI* and *BioSNAP-sub* datasets with impressive gains over baselines, e.g., 0.708 vs 0.397 AUPRC on DeepDDI.

🔥 Overall, NeurIPS’20 opened up several prospects in the KG embedding area: look, Neural Architecture Search 🔎 works, Meta-Learning works, Quantum and 📦 models are getting more expressive! Thanks to that, we can now solve much more complex tasks than vanilla transductive link prediction.

# SPARQL and Compositional Generalization

📝 In question answering over KGs (KGQA), semantic parsing transforms a question into a structured query (say, in SPARQL) which is then executed against a database. One of the 🔑 problems there is compositional generalization, that is, can we build complex query patterns after observing simple atoms? In the ICLR’20 post, we reviewed a new large-scale dataset *Complex Freebase Question (**CFQ*) (let’s forgive them for 🧟♂️ Freebase) that was designed to measure compositional generalization capabilities of NL 2 SPARQL approaches. Notably, baselines like LSTMs and Transformers perform quite poorly: <20% accuracy on average 😕

🚒 **Guo et al** present a thorough study of potential caveats, i.e., one of the biggest issues is sequential decoding ⛓ or any kind of ordering bias when generating queries or logical forms including tree decoding. Instead, they propose to leverage **partially ordered sets ( posets)** and, conversely,

**Hierarchical Poset Decoding (HPD)**.

*Posets*allow us to enforce permutation invariance in the decoding process (for instance, predicting two branches of a

*logical AND*operator independently) so that a model could concentrate on generalization. Posets can be represented as DAGs. Components of that DAG can be predicted by a simple RNN (which the authors resort to).

However, direct prediction of posets does not bring benefits (works even worse than LSTMs and Transformers 📉). The essential part is hierarchical decoding (check the 🖼 below) which consists of 4 steps. 1️⃣ First, we predict a post sketch (de-lexicalized DAG). 2️⃣ Independently, we predict primitives of our query (sort of entity and relation recognition). 3️⃣ Then, we fill in the primitives into the poset sketch in all possible permutations, and 4️⃣, predict which particular paths actually do belong to the correct target poset.

🧪 Experimentally, **HPD** performs surprisingly well 👀 — on average, 70% accuracy on 3 MCD splits compared to 20% by Universal Transformer and 40%-ish by mighty T5–11B. Ablations show that seq2seq and seq2tree sketch predictions only worsen the performance, and the hierarchical component is crucial (otherwise minus 50% accuracy). 🔥 Hopefully, this work will inspire more research on compositional generalization and complex KGQA!

# 🏋 Benchmarking: OGB, GraphGYM, KeOps

Tired of seeing Cora/Citeseer/Pubmed in every other GNN paper? You should be: they are small, expose certain biases, and models’ performance has pretty much saturated. Time for a big change! ☄️

**Open Graph Benchmark (OGB)** (**paper by Hu et al**) is a great new effort by the Graph ML community to create a set of complex and diverse tasks on different forms of graphs (leaderboards included 🏆). OGB offers *node classification*, *graph classification*, *link prediction* tasks on graphs of various sizes (as of now, the biggest graph contains ~100M nodes and ~1.6B edges) and domains (KGs are here, too 😍: Wikidata-based and BioKG link prediction datasets).

🔥 OGB leaderboards have already generated several Twitter storms: for instance, suddenly, a simple label propagation algorithm of 10K-100K parameters outperforms big and slow GNNs of 1M+ parameters by a large margin on transductive node classification tasks 🙈. Clearly, there is still an unexplored room of capabilities and limitations of GNNs. Could Cora/Citeseer/Pubmed demonstrate it? Probably not 🤷♀️.

Okay, we have such a big variety of tasks now! On the other hand, we have dozens of GNN architectures and hundreds of hyperparameters to tune. Is there a sweet spot, a good starting point to dig into a certain task? The space is so large! 🤯 **You, Ying, and Leskovec** tackle exactly this problem of exploring design spaces of GNNs and introduce **GraphGYM**, a comprehensive suite for creating and evaluating GNNs (and flexing your GNN muscles 💪). The authors define GNN design and task spaces, each consisting of fine-grained details, e.g., 12 design dimensions: batch norm, dropout rates, aggregation functions, activation functions, node features pre-/postprocessing layers, number of message passing layers, skip-layers, batch size, learning rate, optimizers, and training epochs. Couple that with dozens of tasks, and a Cartesian product of possible combinations surpasses 10M options! 👀

In the rich experimental agenda, the authors find the best working combinations that you could adopt as good starting points and produce very insightful charts 👇. The repo is openly available, you could start experimenting pretty much right away!

😉 By the way, if you’re looking for something similar in the KG embeddings domain, our team has recently completed a huge survey of models and hyperparameters concentrating on the link prediction task.

⚡️Finally, I would like to outline the work by **Feydy et al** on **KeOps**, a blazing fast kernel operations library with NumPy, PyTorch, R, and Matlab bindings. In addition to widely used dense and sparse matrices, the authors support *symbolic matrices* (where *ij-th *member is computed via a certain formula *F, *often a matrix reduction formula). Symbolic matrices are computed on the fly 🚀 and optimized for CUDA computations. The authors do invest into benchmarking: on a rather standard server workstation with a 8-core Xeon, 128 Gb RAM, RTX 2080 Ti, KeOps outperforms is **5x — 20x** **faster** than PyTorch implementation on the same tasks (then PyTorch crashes with OOM while KeOps works fine).

- You can also perform a kNN search and be competitive with FAISS!
- Some implementations in PyTorch-Geometric already work well with KeOPS

Personally, I’ve been using PyKeOps since summer and find it extremely helpful when working with large-scale KGs. Besides, I compiled the library on a PowerPC + CUDA cluster, please feel my pain 😅

# Wrapping Up

NeurIPS concludes the line-up of top AI conferences, but ICLR 2021 scores are already out there 😉. If you want to keep updated on Graph ML topics, you could subscribe to the regular newsletter by Sergey Ivanov or join the Telegram GraphML channel!

Merry Christmas, happy New Year, and stay safe 🤗