If you still had any doubts — it’s time to admit. Machine Learning on Graphs becomes a first-class citizen at AI conferences while being not that mysterious as you might have imagined 🧙. Let’s check out the goodies brought by NeurIPS 2019 and co-located events!
Graphs were well represented at the conference. The main venue alone had more than 100 graph-related publications, and even more were available at three workshops: Graph Representation Learning (about 100 more papers), Knowledge Representation & Reasoning Meets Machine Learning (KR2ML) (about 50 papers), Conversational AI. Nice 👀 So we’ll consider all events jointly.
In this (by no means exhaustive) post we will discuss general trends in ML on graphs as well as touch upon Knowledge Graphs (KGs) as it is what we are doing at Fraunhofer IAIS (end of shameless self-promotion 😊). Here is the clickable outline:
Hyperbolic Graph Embeddings 🔮
Traditional embedding algorithms place the learned vectors in Euclidean “flat” space of possibly high dimensionality (50–200) where distance between two vectors can be expressed with Euclidean geometry. In contrast, hyperbolic algorithms employ Poincare balls and hyperbolic space. Applied to embeddings, an easy analogy represents Poincare balls 🔮 as a continuous version of trees 🌲where a root is in the middle of the ball and branches & leaves are placed closer to the “circumference” of the ball (check out the illustration). So that hyperbolic embeddings can model hierarchies much better AND use less dimensions compared to Euclidean spaces. On the other hand, training and optimization of hyperbolic nets is arguably hard. NeurIPS’18 presented several papers with deep theoretical studies of building hyperbolic neural nets. At NeurIPS’19 we see applications of hyperbolic geometry to graphs! 😍
Chami et al present Hyperbolic Graph Convolutional Neural Networks (HGCN) and Liu et al propose Hyperbolic Graph Neural Networks (HGNN). Conceptually similar, both papers aim at merging benefits of hyperbolic spaces with expressivity of GNNs but through different models. Chami et al (HGCN) focus on node classification and link prediction tasks achieving impressive error reduction rates compared to Euclidean methods especially on the datasets with lower Gromov hyperbolicity score, i.e., how tree-like a graph is (0 is for trees). Liu et al (HGNN) put an emphasis on graph classification tasks. Reference implementations of HGCN and HGNN are available on GitHub 👍
Balažević et al (creators of TuckER model from EMNLP 2019) apply hyperbolic geometry to knowledge graph embeddings in their Multi-Relational Poincaré model (MuRP). Intuitively, correct objects of triples should fall within a certain hypersphere around the subject and those decision boundaries are guided by learned parameters. The authors also apply Riemannian SGD to optimize the model (**MATH ALERT HERE 🤯**). On two standard benchmark datasets WN18RR and FB15k-237 MuRP outperforms compared algorithms on WN18RR since it is more “hyperbolic” and tree-like (actually would have been interesting to compute their Gromov hyperbolicity score as from the above papers). What is even more interesting, MuRP yields comparably accurate numbers already with 40D embeddings whereas standard Euclidean models usually employ 100D or 200D vectors! Clearly, hyperbolic models can save you some space and memory without losing in accuracy.
And our local winner in the hyperbolic KG embedding challenge is RotationH presented in another paper by Chami et al (yes, the authors of HGCN). The model employs hyperbolic rotations (conceptually somewhat close to the RotatE model but the latter one works in complex space) and learnable curvature. RotationH achieves new SOTA on WN18RR and also performs well in low dimensional settings, e.g., 32D RotationH is comparable to 500D RotatE. Impressive! 😍
If you were always wondering why you studied at Uni all those hyperbolic operations like
sinh, Poincaré balls, Lorentz hyperboloids, and where to apply them: here you go, hyperbolic geometry on graphs works 😉
Logics & Knowledge Graph Embeddings
If you keep an eye on Arxiv or proceedings of AI conferences you’d notice that every new year brings us more and more sophisticated KG embedding models that move a bit further SOTA numbers. Is there a theoretical expressivity limit or study what can be modelled and what can not? Luckily, we have some!
Chen Cai studies KG embeddings from the group theory perspective. It is shown that you can model Abelian groups in complex space and demonstrated that RotatE (that employs rotations in complex space) is able to represent any finite Abelian group 🤔. Additionally, the papers contains a gentle introduction of groups and their representations even if you didn’t study it before. However, right now there is a gap when modelling 1-N or N-N relationships. The author then hypothesizes what if we move from complex space C to quaternion field H and …
… and it was done here at NeurIPS’19 by Zhang et al! They present QuatE: a quaternion KG embedding model. Briefly, complex numbers have one real and one imaginary component, e.g., a+ib. Quaternion numbers have one real and three imaginary components, e.g., a+ib+jc+kd giving two more degrees of freedom and being more computationally stable compared than rotation matrices. QuatE models relations as rotations in 4D (hypercomplex) space generalizing ComplEx and RotatE. So in RotatE you have one plane of rotation, but in QuatE you have two! 👀 Moreover, the capability of modelling symmetry, antisymmetry and inversion is still preserved, and QuatE requires 80% less free parameters to train on FB15k-237 compared to RotatE. No analysis from the group theory perspective though, you can step in 😉
🔥Garg et al present Embed2Reason (E2R): a quantum KG embedding approach inspired by Quantum Logic which allows embedding classes (concepts), relations, and instances. Cool down, there is no quantum computing here 😃 The theory of Quantum Logic (QL) was originally proposed by Birkhoff and von Neumann in 1936 to describe subatomic processes, and the authors of E2R leverage it to preserve logical structure of KGs. In QL (hence in E2R) all unary, binary and compound predicates are actually subspaces of some complex vector space, so that entities and their combinations in a certain relation fall within a special subspace, e.g., check the illustration where Alice, Bob, and Sam belong to
physician subspace and Alex is in
dentist subspace. Originally, the distributive law
a AND (b OR c)=(a AND b) OR (a AND c) does not work in QL, but the authors apply a clever trick to bypass the issue 🤓. They also show how terms from Description Logics (DL) such as inclusion, negation, and quantifiers can be modelled with QL! The experimental results are really interesting: E2R yields whopping 96.4% of Hits@1 (hence H@10 as well) on FB15K 👀 albeit less effective on WN18. Turns out that E2R ranks correct facts either on the first place or faaar below top10, that’s why H@1 is equal to H@10 in all experiments.
Additionally, the authors used LUBM as a benchmark for deductive reasoning that contains an ontology with classes and their hierarchy. This is actually one of my concerns regarding standard benchmarking datasets FB15K(-237) and WN18(RR) that they contain exclusively instances and relations without any class attributions. Though clearly large knowledge graphs have thousands of types and tackling this information can potentially improve the link prediction and reasoning performance. Glad to see more and more approaches like E2R that advocate for including symbolic information in embeddings. 👍
🔎 Continuing on logical expressiveness of Graph Neural Nets, Barceló et al extensively study which GNN architectures able to capture which logic level. So far the study is limited to two-variable fragment
FOC_2of first-order logic since
FOC_2 is connected to the Weisfeiler-Lehman (WL) test for checking graph isomorphism. The authors demonstrate that expressiveness of Aggregate-Combine GNNs (AC-GNNs) corresponds to the description logic ALCQ [recall that OWL 2 is based on SROIQ(D)], a subset of
FOC_2. It is further proved, that if we add a readout component (proposed by Battaglia et al in their seminal paper on relational inductive biases) turning a GNN into an Aggregate-Combine-Readout GNN (ACR-GNN), then each formula in
FOC_2 can be captured by an ACR-GNN classifier. Great work! 👍
Xie et al propose LENSR - Logic Embedding Network with Semantic Regularization to embed logical rules in d-DNNF (decision-Deterministic Negation Normal Form) with Graph Convolutional Nets (GCN). The authors focus on propositional logic (in contrast to more expressive Description Logic from the above papers) and show that it is enough to add two regularization components for AND and OR to a loss function in order to embed such rules. The framework is then applied to the Visual Relation Prediction task when given an image 🖼 you need to predict correct relations between two objects (that you obtained, say, as bounding boxes from you favorite Computer Vision CNN pipeline). Top-5 accuracy grows from previous SOTA of 84.3% to all new shiny 92.77% 💪!
Markov Logic Networks Strike Back
Markov Logic Networks aim at combining first-order logic rules with probabilistic graphical models. However, direct applications of MLNs are limited by scalability issues and computational complexity of the inference process. In recent years there is a noticeable increase in improving MLNs in the context of neural approaches. And this year we have a bunch of promising architectures combining symbolic rules with probabilistic models!
🔥 Qu and Tang propose pLogicNet: a model for KG reasoning where KG embeddings are combined with logic rules. The model is trained using the variational EM algorithm (actually, there is also a surge in using EM for training & optimization in recent papers which deserves a separate article). In a nutshell, you have an MLN that defines a joint distribution over triples in your KG (of course with certain limitations on unobserved triples since enumerating all possible triples of all entities and relations is intractable) and assigns logic rules a certain weight, and you have pre-trained KG embeddings of your choice (e.g., TransE or ComplEx, literally any will do). In the E-step (inference) the model uses rules and KG embeddings to find missing triples, and in the M-step (learning) the weights of the rules are updated based on seen and inferred triples. pLogicNet demonstrates competitive results on standard link prediction benchmarks 💪. Would be interesting to see what happens if you’d plug in some fancy KG embedding models, e.g., GNNs 🤔
Next, Marra and Kuželka present Neural Markov Logic Networks, a super-class of MLNs without explicit first-order logic rules, but with a neural potential function that is supposed to encode inherent rules in a vector space. The authors also employ a min-max entropy for model optimization which is another clever (but not widely used) move. The downside is scalability as the authors report experimental results on relatively small datasets and accept the challenge as future work 👍
Finally, Zhang et al study expressivity of GNNs compared to MLNs in logic reasoning and probabilistic inference. Their analysis suggests that vanilla 🍦GNN embeddings are able to encode latent facts in KGs but fail to model dependencies between predicates a-la posterior parametrization in MLNs. To address this issue, the authors present ExpressGNN architecture with additional layers of tunable embeddings that perform hierarchical encoding of entities in a KG.
Conversational AI & Graphs
Let’s have a short break from hardcore ML algorithms and talk about NLP applications a bit. Workshops co-located with the main conference contain an interesting line-up of ConvAI+Graphs papers!
🔥 Zhou and Small propose a Dialogue State Tracking via Question Answering (DSTQA) model for task-oriented dialogue systems within MultiWOZ setup where you need to help a user achieve a certain goal among 5 domains with 30 slots and 4500+ values. A general framework is Question Answering (QA) where a system always asks questions having some slots and values in mind, and user answers the questions confirming or changing the proposed values. Another relevant hypothesis says that slots and values within one dialogue are not totally independent, i.e., when you book a 5-star hotel and then ask for a restaurant you’d probably want smth from moderate to high price range. The architecture diagram below is pretty cumbersome 😬 so let’s find key novelties:
- First, the authors model dialogue state as a dynamic knowledge graph evolving over time based on the context. Nodes are taken from domains, slots and values, relations are constructed according to the hypothesis above that slots might have same/overlapping/correlating values
- Second, the Graph Attention Net (GAT) learns to weight nodes in the graph and its output is then fed into the gating mechanism that decides ‘how much of a graph’ 🍰you want in the question context 🎂
- The model is trained jointly on system and user utterances thanks to the role embeddings
- Finally, both CharCNN and ELMO embeddings are used to encode the dialogue context
DSTQA yields SOTA 💪 results on MultiWOZ 2.0 and MultiWOZ 2.1 (97% of slot accuracy and 51% of joint accuracy) while performing on par with the best approaches on WOZ 2.0 (still 90+% of joint accuracy). According to the error analysis, most of the errors come from ground truth annotations which, unfortunately, often happens in large-scale crowdsourcing tasks 😞
Neelakantan et al present Neural Assistant, the architecture for dialogue systems that takes into account conversation history together with facts from a knowledge base. An extension of the Transformer architecture, the approach first encodes tokens from the conversation history. The KB contains triples in words like
(restaurantX, price, cheap) (no fancy KG schemas like Wikidata), and the triples get encoded by the Transformer as well. Finally, the decoder attends to both history and KB facts in order to generate an output utterance and optionally the next action to take. 💡Instead of applying a softmax function over all KB triples (quite inefficient for any reasonably large KB), the authors apply weak supervision based on the presence of entities from the KB in the ground truth responses. The architecture outperforms vanilla Transformers in a MultiWOZ setup showing 90+% F1 score on predicting actions and involved entities. However, further analysis shows that the accuracy deteriorates rather quickly already when you have about 10K facts, and has memory issues at the point of 30K facts. So if you wanted to plug in the whole Wikidata with 7B triples: well, not yet 🤔
When working with task-oriented systems you need to retrieve data stored in an external database which content you can’t keep in memory. Those can be graph databases, e.g., that work with SPARQL or Cypher, or classical SQL databases. For the latter case there is a bunch of tasks appeared recently and WikiSQL was among the first tasks that gained academic attention. Now we can say that less than in 2 years this dataset is pretty much solved and neural approaches achieve super-human 🏆 performance! Hwang et al present SQLova, a semantic parsing model that uses BERT to encode questions and table headers, and attention-based decoder that produces SQL constructs (for instance, SELECTs, WHERE clauses, aggregation functions, etc) that are later ranked and evaluated. The authors also outline that brute-force BERT decoding without semantic parsing works much worse, so use language models wisely 🧙♂️. Test accuracy reaches 90% (there is one more system X-SQL that yields almost 92%) where humans show only 88%, and the systems pretty much hit the annotation error wall (similar to the MulitWOZ case above).
And shortly some more NLP-related papers I’d encourage you to explore: Vivona and Hassani propose a relational GNN with attention for solving Question Answering task both on text and KG having WebQuestionsSP as a target dataset. Dash et al tackle simultaneous relation extraction from text and immediate fact checking of candidates in an underlying KG via pre-trained KG embeddings. The approach is scalable to KGs with millions of triples (as to Common Crawl — DBpedia corpus with >6M triples). Razumovskaia and Eskenazi study how rules can be incorporated into end-to-end dialogue systems and their context such that generated utterances are more diverse, e.g., if a database has been queried you wouldn’t ask for a slot or greet a user again. The best performing option encodes rules together with the dialogue context. The approach is pretty generic and can be used with any response generation-style architecture 👍
Pre-training and Understanding Graph Neural Nets
In this section I’d like to point out several papers that study GNNs on a more general level including some research on explainability of GNN models.
🔥 Hu, Liu et al propose and explain one of the first frameworks for pre-training GNNs. Conceptually it is somewhat similar to pre-training language models on huge amounts of text and then fine-tuning the model on a certain task. The authors question whether it is possible to apply the same approach for graphs. Short answer: Yes 😍 but you need to use it wisely. The authors also provide valuable insights on pre-training methods that should capture structural and domain knowledge on both node level (e.g., node classification) and graph level (e.g., graph classification). That is, on the node-level for learning structural properties the Context Prediction task aims at predicting surroundings of a node based on the embeddings with the help of negative sampling (resembles word2vec training, right?), whereas Masking randomly masks node/edge attributes and asks to predict them.
The authors show why Aggregate-Combine-Readout GNN architectures (ACR-GNNs) are preferred since they allow to obtain the entire representation of a graph with a permutation invariant pooling function. Experiments show that applying only graph-level supervised pre-training brings negative transfer in downstream tasks so you need to combine both node-level and graph-level representations. Such a combination of features yields 6 to 11% improve in ROC-AUC score across 40 different prediction tasks. Transfer learning on graphs is officially open? 🤔 Would great folks from Hugging Face 🤗 also provide the best library to spin up pre-trained GNN models? 😉
Yun et al propose Graph Transformer architecture for heterogeneous graphs, i.e., graphs that have multiple types of nodes and edges. Graph Transformer Networks (GTNs) employ 1x1 convolutions to obtain representations of meta-paths (chain of edges). The idea is to generate then a set of new meta-paths (meta-meta-paths? 🤔) of arbitrary length (defined by the number of transformer layers) that supposedly encode more valuable signal for downstream tasks. Experimental results prove GTNs achieve SOTA results on node classification tasks though the number are close to Graph Attention Nets (GATs).
🔎 Last, but not least, Ying et al tackle an important task of explaining outcomes of GNNs and present GNN Explainer — a model-agnostic framework that outputs interpretations for predictions of any graph-based model on any task! That is, you have node/graph classification problem with, say, Graph Attention Nets, and you’d like to see explainable results of your problem — spin up GNN Explainer. Conceptually, GNN Explainer maximizes mutual information between model predictions and possible subgraph structures combining graph and node features (of course with clever optimization tricks as testing all possible subgraphs is intractable). As an explanation the framework returns a subgraph with the most important pathways and features which can be easily interpreted by humans. Have a look 👀 at the illustrative example nearby from the paper. Hands down, great work!
Machine Learning on graphs works! And works in different domains, i.e., from CV to NLP and Reinforcement Learning. As NeurIPS is huge I’m sure we’ll see even more exciting reviews and detailed insights from the community. I’m quite sure some of the workshop papers we explored will be accepted at the upcoming ICLR 2020 😉.
Thank you for reading!