A Journey in Graphs ML Techniques

Filippo Minutella

Published in

LARUS

12 min readJun 21, 2021

What is a graph?

A graph is a data structure composed of a set of nodes and arches (mathematically G= (V, E) G = Graph, V = vertices, E = edges), this representation explicit connections amongst data, we’ll see more in detail what it means, first let’s list how a graph can be:

Direct or not direct: Relationships can be direct meaning having a sense, for example representing some transactions between subjects, the relationship direction can be the one of money, or not direct, for example I am modelling a social network and in the friendship relationship the sense is not expressed.
K-partite graph: a graph can contain several types of nodes and the connections are allowed only among different types of node: for example we are modelling an e-commerce to do recommendation and we thus we have 2 types of nodes, the buyers and the products and relationships are allowed only between buyers and products (this is a particular case called bipartite graph
Weighted graph: it is a graph where to every relationship it is associated just one single weight, for example we can model a graph that represents transactions amongst subjects by making explicit the direction and a weight represented by the amount of the transaction
Heterogeneous Graphs: it’s not a classification found in mathematics but it comes from real world graphs: often we find ourselves in front of graphs that connects various nodes of different types with a features number and distribution that can be very distinct between the nodes themselves, moreover, nodes can be connected with several types of relationships, having different attributes as well. For this type of graph we will see that there are some GNN built just to improve the performances in these situations since they are very common, it is very easy that during a system modelling we find nodes of several types and with various attributes.

There are many more available classifications and types that takes into consideration how nodes are connected, if all nodes are connected or not and how they are connected, on wikipedia (https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)) you can see others but for the reasoning we wanna do and for the purpose of the article they are not interesting.

After these very first notions on graphs I would like to introduce some of the possible tasks in a graph (avoiding the standard examples such as IMDB, DBLP, ACM, that are very recurrent in papers):

Car liability insurance

An accident graph with the subjects, the various insurance companies, the vehicles and the witnesses that assisted to the accident. It is clear from the very beginning that the graph is heterogenous, but let’s add some details:

An accident is represented as a node and has its features, connected there are the people involved in the accident and more types of relationships to represent how the subject is connected to the accident (for example if was directly connected or a witness), also the vehicles are directly connected to the accident
Subjects have themselves other features and connected vehicles, moreover they are connected to the insurance company with which they have the insurance.
The witnesses are connected to many accidents

The goal is to classify the accidents in fraudulent or non fraudulent, it is clear that by utilizing the graph I can reach for sure better performances, also just by looking at the graph and by exploring it I could spot with the naked eyes people with strange behaviour, for example they are involved in many accidents or are often witnesses (we will later deep dive on this consideration).

The task is hence to classify the accidents that are nodes and it’s called Node Classification.

Product Recommendation

Let’s imagine having my customer purchases and the products that were bought: it is, as we said before, a bipartite graph because the clients are connected just to products.

To make recommendations on what a customer could buy I need to understand future links that can be generated between a client and a product, this task is called Link Prediction.

Proteins Classification

Each protein can be represented by the set of amino acids connected among themselves, so the protein is a graph that needs to be classified, this task is called Graph Classification.
(for who wants to see more details, this dataset has been introduced by the following paper: https://academic.oup.com/bioinformatics/article/21/suppl_1/i47/202991)

Before jumping into the techniques let’s add some other brief notions and considerations: the graph, as we’ve seen, is a structure that allows to link nodes among themselves, often these nodes have some features and in many applications the features are collected in an matrix X^d where d is the number of features in every node, every node is thus represented by a row and every attribute is a column.
Another very important matrix is the adiances matrixes: how can I mathematically represent a graph and its relationships? With a squared matrix NxN where N is equal to the number of nodes and this matrix (in the case of not weighted graphs) contains 1 if the two nodes are connected, otherwise 0. Let’s look at an example:

Some interesting properties of the adjacency Matrix:

Usually it is a sparse matrix, meaning it has few 1s and many 0s, in the example for space issues, it’s difficult to note it, but let’s imagine of having thousands of nodes connected just to some others, it soon becomes clear that many of the values are 0
It is a matrix that grows quadratic with the number of nodes (the growth of the matrix is NxN)
The sum of a row of a matrix tells me with how many other nodes, the node corresponding to the row (called also Node Degree) is connected.
If I want to find who I can reach with a 2 hop, that is for example from A I can reach C passing trough B, what can I do? I can multiply the adjacency matrix for itself and I obtain:

Now the matrix counts the number of length path 2 between the node i and j, let’s take as an example A, that is the first cell at the top left, there are 2 length paths 2, I go to B and I go back or I go to D and go back.

If for example we take the blue cell corresponding to D, C tells me that there are 2 paths and in fact they are D — B — C and D — E — C. This property is extended also to other powers, meaning multiplying the matrix for itself once it’s raising it to the 2nd power, if I multiplied it another time I would raise it to the 3rd power and I would find the number of long 3 paths among nodes.

Let’s assume to have 3 features on nodes and so I have an X matrix like this:

If I multiplied this above matrix for the adjacency one I would obtain:

What does this matrix mean? Is nothing more than the sum of the neighbours for every node, for example A = [2 0 1] that is the sum of B and D.
Neighbours of a node means the nodes to which it is connected.
( To try this procedure you can use numpy.dot(Adjacency, X)

The last two operations with the adjacency matrix are used in some approaches to work with the graph.

Now that we have some notions we can start to get to the heart of the methodologies utilized until we get to the graph neural network, below are shown, always trying to have practical examples, some techniques used with the insights that lays behind and the reference paper.

The problem is simple: I want to apply ML algorithms, or to classify some nodes or other as we saw by utilizing the graph, what can we do?

First idea is to use the adjacency matrix, for example if I wanted to do Link Prediction or Node Similarity in a naive way, I could take two rows, that represent to whom nodes are connected and calculate a score among them by utilizing for example the cosine similarity (or count the neighbours in common). This is clear it’s not utilizing information coming from the most distant nodes and it’s not very informative, moreover the adjacency matrix grows with the number of nodes and it’s not a good property if for example we want to calculate the scores by addition of nodes.

Features engineering using the graph structure

An approach, utilized more often than today, was the one of calculating various information from the graph and adding them to the node features, for example from the graph I could extract these information:

Node Degree: number of incoming and/or outcoming
Centrality: how much a node is central to the graph, for example how many paths between 2 nodes passes through it (more a node is crossed to reach other nodes, more it is central)
Community Detection: understand how communities are inside the graph and put as features the belonging community. A community is a set of nodes strongly connected with the nodes inside the community and less connected with the external nodes.

To extract indicators from the graph there are many techniques and methodologies, the list is not exhaustive, to have a more clear list and directly a way to apply them, the Neo4j database can be a great starting point (https://neo4j.com/docs/graph-data-science/current/algorithms/).

To apply these algorithms it is necessary to make an argument, meaning that instead of applying them to all the graph, I could apply them on some graph projections. A projection is a simplification of the graph itself, let’s take as an example the liability insurance. Counting the degree for the accidents on all the graph as it is (heterogeneous) would give me as an answer the people linked to the accident plus the expert (or experts if there is more than one), plus the vehicles involved and having this information aggregated could not be great. Hence I could consider of filtering the type of nodes before calculating the degree or I could do an even more elaborated thing, that is creating a projection where I link directly the accidents like this: I only take the accidents and insert an arch between two accidents if a subject has been involved in both, otherwise no, so the new graph is a projection of the first, easier but where I could extract more relevant information.

The projection is reutilized in other contexts and in the real world it could be necessary to make the issue manageable or to study the graph from a certain specific point of view.

This method reminds a bit of the path made by the computer vision, before they were trying to extract from images some “significant” points or a set of indicators and then with the data science algorithms the purpose was reached. It is evident that in computer vision the end to end learning offered by the CNN (that are able to perform autonomously features selection) has overcome in many applications the “old” approach.

Extract embeddings

(Also the GNN can be covered in this chapter, but I preferred to set them aside). The idea of these methods is to utilize the network structure to learn some Dense Vectors (embeddings), that are some informative vectors with a fixed volume.

Here we have methods such as DeepWalk (https://arxiv.org/pdf/1403.6652.pdf) e Node2Vec (https://arxiv.org/pdf/1607.00653.pdf), and Node2Vec (https://arxiv.org/pdf/1607.00653.pdf), Node2Vec can be seen as an evolution of DeepWalk, that they are said reutilizing the concept of SkipGram, meaning that are defined some RandomWalk in the graph, starting from a node I randomly move to another one (how many times jump randomly and how many times repeat the jump starting from the same node are algorithm parameters).

Let’s take an example:

By starting from A and choosing the maximum path length we could have:

A — B — C — E

A — D — A — D

A — B — D — E

And so on.

Once the desired number of RandomWalk is completed, every RandomWalk can be seen as words of a sentence.

The idea is in fact to start with some Random Dense Vectors for every node and optimize them so that some properties are respected, usually you want the scalar product between 2 nodes, belonging to the same Random Walk, to be superior from the one of nodes that belongs to different random walks.

Another example of an algorithm to generate these embeddings is FastRP (https://arxiv.org/pdf/1908.11512.pdf). This algorithm has a completely different approach from the first two ones, it is based on creating an information matrix and reducing the dimensionality of this. The matrix is created starting from the adjacency matrix, only it’s built in a way that it becomes more informative than this, meaning for example that it is able to contain information even between distant nodes (we have seen that but multiplying the adjacency matrix for itself it has an interesting property for example) and the reduction of dimensionality cannot follow standard approaches such as SVD, but it is precisely done with Random Projection. Mathematical details are different from FastRP, but there is one important thing to notice, that by not having a learning phase, but by only performing dimensionality reduction, the algorithm is able to extract useful embeddings incredibly faster than DeepWalk.

There are many other methods to do this and here somehow I wanted to present 2 “extremes”. Before moving on to the next section we need to notice 2 things regarding these algorithms:

They don’t use node attributes.
To manage heterogeneous graphs we need to make some additional arguments

Graph Neural Network

Finally here we are, “finally” because the article was meant to be only on Graph Neural Networks, but then I saw that there are plenty of them and I understand that it would have been yet another and perhaps useless (because there are some really good ones)

The idea that lays behind the GNN is quite simple and consists of 3 steps:

The first is to understand how to “aggregate” information on the node in question and the ones nearby. We have seen above an example on how to perform feature aggregation in an X matrix and the adjacency matrix, therefore an idea is to sum the features of the near ones.
Understand once aggregated the features of the neighbours and of the node in question how to combine them to create new node embeddings
Do it in an iterative way, meaning that at the first step I aggregate the information of my direct neighbours plus my own information, then I now repeat and at the second step I take back the information of my neighbours plus mine, here we have to notice that the information of my neighbours will contain also information of their neighbours, so more time we iterate more a node is able to collect distant information.

Once in mind the above framework, many GNN differ on how they perform the aggregation steps. An article that I really liked to have this steps clarified is the following: https://tkipf.github.io/graph-convolutional-networks/

Compared to the methods above, the Graph Neural Networks are able to “use” also the attributes, we have not yet solved the problem of heterogeneous graphs that can be faced with many solutions.
For example HAN (https://arxiv.org/pdf/1903.07293. pdf) is based on the concept of meta path, where a meta path is a projection that connects nodes of the same type: for example in the IMDB dataset we want to classify movies, from one movie I can reach another by directly connecting movies that have at least one actor in common (metapath 1) or by directly connecting movies that have the director in common (metapath 2), a “normal” GNN is run on these metapaths.

For heterogeneous graphs there are many other GNNs available. At Larus and together with our partner Fujitsu we use an insight data platform we built, called Galileo XAI, and to extract the most from data we use Fujitsu’s Deep Tensor solution (if you want to know more: https://galileox.ai/)

Conclusion

The article didn’t want to go into details of any algorithm and methodology, it wanted to highlight the potential and possible applications of graphs, some ways to work on it and give a hint on the GNN hot topic, trying to match it through examples of real cases and problems that rise up in the real world, if you liked it, clap! For any doubt, clarification or even just to say 2 words (I enjoy talking about this :)) you can write in the comments or reach me by email: filippo.minutella@larus-ba.it

A Journey in Graphs ML Techniques

Written by Filippo Minutella