Graph Machine Learning Meets Graph Databases

Published in

ArangoDB

15 min readJun 18, 2021

In this blog post, we deep dive into the graph machine learning universe starting with the personal gist of how I came to know about this exciting area, then some motivation about the graphs i.e how important the graph data structure is, afterward we will see some applications of graph machine learning in various domains from both the academia and industry perspective. Then, I will also talk about the recent Open Graph Benchmark: Datasets for Machine Learning on Graphs. The last section is about the graph databases i.e. where they currently stand in the universe of graph machine learning. So in brief here is the outline of the blog:

Motivation
Applications of Graph ML from various Perspectives
Open Graph Benchmark: Datasets for Machine Learning on Graphs
The Intersection of Graph ML and Graph Databases

Let’s start a journey together (image credit)

Motivation

I started my research project (as an M.Sc student) in the area of Graph Machine Learning (GML)at DFKI in the year 2019 when the PyTorch Geometric library was not the official part of PyTorch and there were not any dedicated courses teaching machine learning with graphs. My project was based on an intersection of Convolutional Neural Networks and Graph Neural Networks towards an application of Medical Image Diagnosis. In a very short passage of time (from 2019 to 2021) the field of Graph Machine Learning has seen massive success in numerous applications like Computer Vision, Natural Language Processing, Social Network Analysis, Recommendation Systems, Algorithmic Reasoning, Computer Networks, Event Graphs, Economics, Internet of Things, Code Graphs, Life Sciences, Knowledge Graphs, Physics (Physical System Modelling), Graph Mining, Traffic Networks, Graph Generations, Automating Botnet Detection (detecting DDoS attacks and spams).

Euclidean (left) Vs Non-Euclidean (right) Space (image credit)

Therefore, in this blog post, I will motivate you with some examples of how we can leverage the expressive power of Graph Neural Networks in traditional machine learning/deep learning methods of Computer Vision and NLP. Afterward, I will talk about the recent research work/applications of GML in multiple domains. We will also investigate in brief the Open Graph Benchmark which is a collection of large-scale diverse benchmark datasets specifically designed to perform machine learning on graphs. One of the authors of this paper is Matthias Fey who is the developer of the PyTorch Geometric library. In the end, I have provided an introductory study of the intersection between Graph ML, Graph Analytics, and Graph Databases.

Note: I will not go into the detail of the working process of various application scenarios of Graph ML but rather motivate you about Graph ML and its applications in various domains.

Why should you care about Graphs and Graph Machine Learning?

The answer has a few parts: First, one of the things which make the graphs fascinating is their ability to connect isolated data points in space which is however missing in other forms of data representations like 2D grid structures (used to represent image pixels), 1D sequences (used for texts, speech or time series), or even 3D structures (like point clouds). Second, graphs allow us to store relational knowledge among interacting entities i.e. how different sets of entities are related to each other based on some measure.

All the aforementioned representations except the graph lie in Euclidean space i.e. regular unstructured data. The well-known method to extract meaningful patterns from the data in the Euclidean domain is Convolutional Neural Networks (CNNs). However, in the real world, we are surrounded by graph data where its nodes represent real-world entities, and edges are associated with relationships between those entities.

Some of the examples of graph data (image credits: Thomas Heuer, Gusi Te et al, Social Networks, Nouran Amin, Axway)

Graph as a unique non-euclidean data structure cannot be operated by CNNs and require a special method to handle its non-regular structures which led to the recent progress in the area of Graph Neural Networks (GNNs). In machine learning, universe graph data structures are exploited by the GNNs to perform the tasks of node classification, link prediction, graph classification/generation, and clustering.

Applications of Graph Machine Learning from Various Perspectives

Graph Machine Learning applications can be mainly divided into two scenarios: 1) Structural scenarios where the data already exists in graph form i.e. nodes and edges. This type of scenario occurs in scientific research (graph mining, modeling physical systems, and Life Sciences) and industrial applications (knowledge graphs, traffic networks, and recommendation systems). 2) Non-structural scenarios where the data does not come in graph form e.g. image (computer vision ) and text (NLP).

A diagrammatic illustration of various applications in which GML can be used (image credit)

Scene Graph Generation using Graph Machine Learning

The current state-of-the-art object detection models like Yolo or R-CNN localize one or more objects in an image by drawing bounding boxes around them and then classify these localized objects into one of the several classes. However, if a machine wants to comprehend the visual scene in a much better context rather than recognizing objects in isolation, then it also needs to extract rich semantic information (using relationships among objects) about the visual scene.

Let's understand this with an example:

Detecting objects on two images (image credit)

In the above image, the object detection model understands the scene by just detecting two objects in it i.e. horse and a man. However, the semantics of the two images are totally different. In the left image, a man is standing beside the horse, and in the second image, a man is feeding the horse. Therefore, what is missing in object detectors is the ability to comprehend how different objects are interacting in a visual scene. To overcome this limitation, Scene graphs were introduced which can explicitly model objects and relationships. Given an input image, a scene graph can generate a visually-grounded graph in which the nodes of the graphs are represented by object instances in the image and the edges show the pairwise relationships among the object instances.

A visual example of scene graph generation from the input image (image credit)

The above image illustrates that the scene graphs did not just perceive the objects in isolation but also model their pairwise interactions. Unlike object detector, it’s not just telling you the man and horse but also captures the semantic information as well for e.g. man is feeding the horse, a man wearing glasses, etc. Application scenarios of scene graph generation include but are not limited to Visual-Textual Transformer, Image retrieval, Visual Question Answering, Image Understanding and Reasoning, 3D scene graph, and Human-Object Interaction.

Note:

1) To read in detail about its working process the reader can go through Scene Graphs: A Survey of Generations and Applications (March 2021) by Xiaojun Chang et. al 2) Scene Graph Generation by Iterative Message Passing by Danfei Xu et. al

Graph Machine Learning for Interpretability in NLP Tasks

Interpretability is defined as the degree to which a human can comprehend why the machine learning model has made a specific decision. Graph Machine Learning can help NLP practitioners to find out which linguistic information a given model encodes and how that encoding happens, particularly in comprehending which parts of the graphs (e.g. syntactic trees or co-reference structures) add to a prediction. Therefore, the authors of the paper INTERPRETING GRAPH NEURAL NETWORKS FOR NLP WITH DIFFERENTIABLE EDGE MASKING developed a technique for interpreting the predictions of Graph Neural Networks (GNNs) which recognizes unnecessary edges. This method is applied to a trained GNN, on top of which we learn a simple classifier that, for every edge in each layer, predicts if that edge can be removed. To explain in another way, the classifier identifies which edges in the graph it can depend on and at which layer they are used. This classifier can be trained in an end-to-end fashion.

The approach to performing interpretation is based on erasure search where we search for a maximal subset of features that can be entirely removed without affecting model predictions. When applied to GNNs, erasure search would involve searching for the largest subgraph which can be completely removed. However, erasure search fails on tractability which makes them infeasible to use for this problem. Therefore, to overcome this limitation in a scalable manner, the authors of the above-mentioned paper have introduced the GraphMask approach.

GraphMask uses vertex hidden states and messages at layer k (left) as input to a classifier g that predicts a mask z. (image credit)

GraphMask can be perceived as a differentiable form of subset erasure, where, instead of finding an optimal subset to erase for every given example, we learn an erasure function that predicts for every edge at every layer k whether that connection should be retained.

Authors of the paper illustrated that we can remove the number of edges from the graph without affecting the performance of the model, and the leftover edges can be used for interpreting the model predictions.

Graph Machine Learning for Life Sciences

Recently, GML has made a lot of breakthroughs in various subfields of life sciences like genetics, protein-protein interactions, protein folding, drug discovery, molecular graphs, biological networks, applications in healthcare data mining, etc. In addition to this, now there are some good libraries that can be used right away for the pre-processing of raw input data from the life sciences domain to a machine learning suitable format, graph construction, evaluation, model architectures, training, etc. Here are some popular libraries that a reader can go through; DeepChem, Rdkit, and dgl-lifesci.

Now, let's deep dive and discuss some of the latest research work around the subfields of Life Sciences using GML:

Leveraging graph‑based hierarchical medical entity embedding for healthcare applications by Tong Wu and et. al: In this work, the authors have proposed ME2Vec, a method for learning continuous low-dimensional embedding vectors of the most common entities in electronic health records (EHR) such as diagnoses, prescriptions, lab test results, medical procedures, doctor profiles, and patient demographics, etc. EHR can be a mixture of tabular values, text notes, and medical codes. We can represent these heterogeneous medical entities by introducing complex relationships between each other, hence forming a graph-based data structure. Once the graph is formed, we can exploit the rich amount of interactions (like patients to doctors, doctors to services, and patients to services) between different types of medical entities. For example, patients sharing the same doctors on multiple medical treatments tend to have similar profiles of disease progression, which can be leveraged for improved diagnosis prediction and risk reduction.

The above figure shows that the medical graph created from EHR consists of three clusters of nodes representing different entities (patients, doctors, and medical services) as well as edges connecting nodes that denote intra-cluster or inter-cluster relations.

Hence, the main question of how to model EHR data with graphs and leveraging the expressive power of GML to learn the powerful representations of medical entities from graphs, has attracted a lot of researchers and engineers from academics and healthcare industries.

Towards multi-modal causability with Graph Neural Networks enabling information fusion for explainable AI by Andreas Holzinger et al.: This paper emphasizes the importance of explainable AI (xAI) in the complex domain of medicine. Artificial Intelligence (AI) is bringing a revolution to the healthcare industry. Decisions made by the medical diagnostic models can be responsible for human life, hence we need to be confident enough to treat patients with a trustworthy AI (transparent and explainable) rather than the black-box model. One crucial characteristic of the medical domain is that multiple heterogeneous modalities (images, text, genomics data) can contribute to one single result and can produce more effective decisions than using a single source of input data.

Fusing four different data modalities i.e. time-series, histopathological images, knowledge database, and patient histories, and representing them as an interaction and correspondence graph. (image credit)

Therefore, In this work the authors address the question of using Graph Neural Networks to construct a multi-modal feature representation space (information fusion) using knowledge bases as a starting point for the development of an effective explainable human-AI interface.

A Novel Method to Predict Drug-Target Interactions Based on Large-Scale Graph Representation Learning by Bo-Wei Zhao et. al: Prediction of drug-target interactions (DTI) is an important part of drug discovery and repositioning. Drug repositioning is the method where we investigate the unseen outcomes of existing drugs except for the original indications for medical treatment. Efficient predictions of DTIs can improve the robustness of drug clinical trials, which decreases the risk of experiments.

A) Represents associations between drugs and targets. B) An example of the graph embedding in drug-target interactions (DTIs) C) An example of the graph convolutional network (image credit)

Hence, the authors proposed a new method called LGDTI to predict DTIs based on large-scale graph representation learning. LGDTI can extract the local (using GCN to aggregate first order neighbor information) and global structural (or high-order neighbor information of nodes by DeepWalk) information of the graph. In the end, the two sets of features are given to the random forest classifier to train and predict probable DTIs.

Large-scale graph representation learning (LGDTI) framework. (image credit)

In contrast to previous GNNs based methods for DTIs (mentioned in [1]), LGDTI focus on comprehending relationships between known drugs and targets networks in a detailed fashion using two different graph-based representation learning techniques.

Program Execution with Instruction Pointer Attention Graph Neural Networks

Model path comparison. The edges in these graphs represent a possible set of paths along which information may propagate in each model. The pills indicate the positions where a model makes a learned branch decision. (image credit)

By taking advantage of program structures (like control flow graphs, parse trees, and data flow graphs), GNNs have appeared to be an influential method for learning software engineering tasks like bug finding [2], code completion [3], and program repair [4]. All the mentioned tasks help to perform static analysis (the process of analyzing programs without executing them) in software. However, if we want to perform tasks like program execution which is dependent on more sequential reasoning steps than the number of GNN propagation steps then, GNNs alone cannot perform well on this task. On the other hand, Recurrent Neural Networks (RNNs) handle long sequential chains of reasoning quite well, but it lacks program structural information. Hence, David Bieber et.al from Google proposed an architecture to combine the advantages of both the models (GNNs and RNNs) and called it Instruction Pointer Attention Graph Neural Network (IPA-GNN). The aim of the architecture is to imitate the structure of the interpreter and explore the close relationships with the models.

Authors observed that after the training step, IPA-GNN can generate discrete branch decisions most of the time, and in fact has learned to execute by taking short-cuts, using fewer steps to execute programs than used by the ground truth trace.

Graph Machine Learning for Recommendation Systems

Recently, recommender systems have played a key role in various industries, like product suggestions on online e-commerce websites (e.g Amazon, eBay) or playlist generators for video and music services (e.g., YouTube, Netflix, and Spotify). The main goal of recommender systems is to learn the potential user/item representations from their existing relationships. We can model this problem using graph data structure and can apply GNNs over it to find meaningful graph representations.

Let's see one of the interesting applications:

Food Discovery with Uber Eats:

It uses the power of Graph ML to suggest to its users the dishes, restaurants, and cuisines they might like next. To make these recommendations Uber Eats uses the GraphSAGE algorithm because of its inductive nature and the power to scale up to a billion nodes.

Since then we have discussed how different research communities/Industries are leveraging the power of Graph Machine Learning. Let us now hop over to the discussion of available open-source benchmark datasets which can be used to facilitate GML research.

Open Graph Benchmark: Datasets for Machine Learning on Graphs

Open Graph Benchmark (OGB) library provides a comprehensive list of ready-to-use benchmark datasets for various GML tasks. It comprises a diverse set of competitive and practical datasets covering the domains of social, information, and biological networks, molecular graphs, source code AST, Knowledge graphs, etc. OGB is fully compatible with PyTorch and its associated graph libraries: PyTorch Geometric and Deep Graph Library. This library also provides its own easy-to-use data loader and task-specific evaluators.

Features of OGB datasets:

Large Scale: It comes into 3 ranges i.e small (up to 100 thousand nodes), medium (more than 1 million nodes), and large (up to 100 million nodes)
Diverse domains: These datasets are accumulated from diverse domains like Amazon Products Co-purchasing Network, Protein-Protein Association Network, Paper Citation Network, Drug-Drug Interaction Network, Abstract Syntax Tree of Source Code and etc. Apart from that, the dataset splits are domain-dependent i.e. based on time, species molecule structure, etc.
Multiple task categories: It comprises datasets that can perform Graph ML tasks at 3 different levels of graphs i.e. node, link, and entire graph predictions.

Where Graph ML and Graph Databases Intersect

Graph query vs Graph Analytics vs Graph ML (image credit)

Graph databases are those databases that connect and store schema-free isolated pieces of information (nodes) by introducing meaningful relationships (edges) among them. Edges can be undirected in nature (no direction) or can explain the to-and-from relationship between the nodes. This contrasts with relational databases which handle vast amounts of records without explaining any strong implicit interconnections between the records. The graph databases are typically used in scenarios (but not limited to) where the data is highly connected (e.g. social networks, IoT, Transportation networks, etc.), or we have very dynamic data models (where schema structure changes frequently), or we need to perform the graph analytics over the stored relationships (for e.g. graph traversals, pattern matching, clustering, page rank, etc.). Recently, we have seen a surge in the development of powerful graph databases like ArangoDB, Amazon Neptune, Neo4j, and Cassandra (to name a few) which helps you to store and exploit the complex relationships between isolated entities. In this blog post, I will give you a brief introduction to starting with graph analytics using ArangoDB which is a free and open-source native multi-model database system.

But before moving forward, you might be thinking what is so fascinating about this popular graph database system known as ArangoDB?

So the answer is, the ArangoDB database system supports three data models (key/value, documents, graphs) with one database core and a unified query language AQL (ArangoDB Query Language). The query language is declarative and allows the combination of different data access patterns in a single query. ArangoDB is a NoSQL database system but AQL is similar in many ways to SQL. (reference: wikipedia)

ArangoDB is the foundation for Graph ML (image credit)

Graph Analytics

Once the graph is constructed after carefully designing the meaningful relationships among node entities, graph analytics comes into the picture. Graph analytics aids you in leveraging the full potential and direction (graph can be directed or undirected in nature) of associations among entities in a graph.

For example, let us say we construct a graph where we can represent nodes with either users or movies and the associations between them (edges) are denoted by the ratings given by a user to movies (kind of constructing a bipartite graph), graph analytics, in this case, can help answer the questions like the following :

Is the graph connected? (i.e. In the context of this application, a disconnected graph implies that there exist users in the graph who may not have rated a movie in common)
Which users are the most influential ones in the graph? (i.e in this case which users have rated most of the movies, we can compute this via different node centrality measures)
Which movies are mostly rated by the user?
Which users are similar to a user who has viewed a specific movie (e.g. Titanic)?

If you want to get some hands-on-experience on applying the above example on a real world dataset, then you could explore the already created google colab notebook which can help you to find the answers for the above questions.
Note: In addition to the above example, there are lot of other interesting graph analytics/ML examples (like NLP with ArangoSearch, Fraud Detection, Graph Embeddings, etc. ) which exploits the potential of graphs using the graph database such as ArangoDB (ArangoDB Interactive Tutorials)

Conclusion

Hooh, That was a long journey! So, if you are able to follow along till this section then congrats, today you have learned a lot related to the graph universe starting from a general motivation about the graphs and GML, then how different research communities are leveraging the expressive power of graphs, we had also discussed in brief the standard benchmark dataset for performing GML research. In the end, we also got introduced to one of the popular graph database companies known as ArangoDB which is a scalable, fully-managed graph database, document store, and search engine in one place.

If you are interested in reading more articles on graph machine learning and its intersection with graph databases then you can follow me here on Medium or Linkedin.

Acknowledgments

I would like to thank the whole ML team of ArangoDB for providing me the valuable feedback about the blog.

Learning Material to Get Started

CS224W: Machine Learning with Graphs (most recommended course to get started)
Graph Representation Learning Book by William L. Hamilton
Geometric Deep Learning
Graph Convolutional Networks
PyTorchGeometric
Deep Graph Library
ArangoDB Interactive Tutorials