Detour: Graph Analysis [1]

Published in

Learn about Cancer with Code

6 min readMar 29, 2023

Credit: https://en.wikipedia.org/wiki/Semantic_network

Background

In last post, I have covered the tasks to convert a sentence into a graph. In this post, I will cover how to analyse the graph in order to extract meaningful relationship.

Approach

Since I have converted a sentence to a directed graph (see below as an example), I need some way to turn the graph into some format that can be digested by a machine learning algorithm.

First approach is employing Random Walk [1]. There are two options I can choose from. One is a pure random walk of which the probability to one of various possible state is roughly equal while a weighted random walk on a graph is the probabilities of a jumps from its current state to one of various potential new states are loaded. The output from the random walk can then be used to construct a correlation matrix of which the conditional probability will push/pull each token closer to other tokens that have higher co-occurence rate together.

Second algorithm that comes to my mind is word2vec [1]. It was created, patented, and published in 2013 by a team of researchers at Google. Since then, this model has been extended to other usage [2] whenever one wants to convert some discrete items into vector embeddings that reside in the real value domain. The embeddings of each token can then be used for further analysis.

In this post, I will focus on the random walk.

A. Graph Construction

In order to construct a random walk generator, I will make use of the spanner function to tokenise the sentence. For the details of the spanner function , you can refer to below post.

Medium

Edit description

medium.com

And you can refer to below post on the graph construction function.

Medium

Edit description

medium.com

I am using following sentence to demonstrate the construction.

In contrast, the product of the human CDKN2A beta transcript, p14ARF, activates a p53 response manifest in elevated levels of MDM2 and p21CIP1 and cell cycle arrest in both G1 and G2/M.

You will see there are special character ‘/’ that is used. It requires special care as the default tokeniser will just split it into ‘G2’, ‘/’ and ‘M’ which does not acknowledge the biomedical meaning of this term.

When I am using the spanner function to tokenise the sentence, I will get the following output

It’s pretty straightforward with changes made to those entities from biomedical NER processor only. The overall sentence is close to using the raw tokeniser version. And using set of tokens to construct a graph will look like below.

There is nothing much to tell from this graph because every node is plotted according to how many connections each have. Those sparsely connected will be pushed to the peripheral region while those densely connected will be placed closer to the centre. So this network is plotted against connectivity and that is just a direct result of how I construct my graph based on the dependency and linear chain!

What I want to know is how often one token will co-occur with another. It’s the frequency that matters to the analysis.

And that’s what random walk comes into the picture!

B. Random Walk

I have written following helper functions to generate a random walk for this purpose.

There are quite a number of helper functions here and some needs explanation.

The first function ‘get_Weighted_Neighbours’ is to calculate the transition probability from current node to its neighbours and it is the cornerstone of my random walk algorithm. This function captures the following calculations.

Equation 1 is just the weight of the current node and I have it populated to become a vector of the same length to the number of neighbours of current node. Equation 2 is the weight of the neighboring nodes of current node. Equation 3 is the edge weight between any current node and its neighboring nodes. Equation 4 is just the Hadamard product [4] of all the weight vectors. Equation 6 is the transition probability from current node to any one of its neighboring node.

Reason I am having this function built is I will run another simulation with different weights assigned to nodes and edges. I want to find out how that will change the output of random walk.

For function ‘get_random_walk’, there is one parameter that is worth some explanation. Parameter ‘n_steps’ controls the length of each random walk. I first experimented it with different numbers and the observation is the number of steps cannot be too small.

After some trial and error, I observe the number of steps have to be at least twice the number of discrete tokens of the sentence. And I will cover this in separate post on how I come to this conclusion.

I will generate 100 random walks per token with each walk having 50 steps. This will generate over 2,000 rows of data with tokens as column headings. I have following code snippet to collate the data collected from this process to generate another network.

Created by author

Running on the original graph generated by spanner, the ‘generateCORRnetwork’ function gives following new network.

We obtain another graph on the sentence representation based on the correlation matrix. You will find it very different from the one generated from the tokenisation output alone. Tokens are ‘pushed’ to two distinct communities with two critical nodes connecting two communities.

My hypothesis is that the dependency parsing result actually segregates the sentence into distinct segments that are self-consistent in terms of relatedness.

Let’s take a look at the cluster at the upper half of the graph. It contains keywords like ‘p53’, ‘G1’, ‘G2/M’, ‘p21CIP1’, ‘cell’, ‘cycle’, ‘arrest’, ‘elevated’, ‘levels’, ‘response’, ‘manifest’, and other connecting words. Basically, it’s the second half of the sentence.

In contrast, the product of the human CDKN2A beta transcript, p14ARF, activates a p53 response manifest in elevated levels of MDM2 and p21CIP1 and cell cycle arrest in both G1 and G2/M.

For the cluster at the lower half of the graph, it contains keywords like ‘p14ARF’, ‘CDKN2A beta’, ‘human’, ‘transcript’, ‘product’ and other connecting words. It is the first part of the sentence.

In contrast, the product of the human CDKN2A beta transcript, p14ARF, activates a p53 response manifest in elevated levels of MDM2 and p21CIP1 and cell cycle arrest in both G1 and G2/M.

And two communities are connected by ‘activates’ and ‘MDM2’.

In contrast, the product of the human CDKN2A beta transcript, p14ARF, activates a p53 response manifest in elevated levels of MDM2 and p21CIP1 and cell cycle arrest in both G1 and G2/M.

It looks like after this ‘enhancement’ step the sentence is properly segregated into Subject (i.e. the product of the human CDKN2A beta transcript, p14ARF), Verb (i.e. activates) and Object (i.e. a p53 response manifest in elevated levels of MDM2 and p21CIP1 and cell cycle arrest in both G1 and G2/M) of which a clear interaction/relationship is extracted with specific conditions spelled out (e.g. in elevated levels of MDM2 and p21CIP1).

Intermission

Now we have found a way to analyse each sentence. In next post, I will explore how reliable is this approach when the sentence is simpler or more complex. And the performance will be steady and consistent if the parameters of the functions are changed.

Stay tuned. Ciao.

[1] https://en.wikipedia.org/wiki/Random_walk

[2] https://en.wikipedia.org/wiki/Word2vec

[3] https://github.com/MaxwellRebo/awesome-2vec

[4] https://en.wikipedia.org/wiki/Hadamard_product_(matrices)