Detour: Graph Analysis [2]

Simon Tse
Learn about Cancer with Code
5 min readApr 12, 2023
Credit: https://en.wikipedia.org/wiki/Semantic_network

Background

In last post, I covered my approach to convert a sentence to a network and used random walk to generate a new network that reflects the connectivity of each token based on their relative position in a sentence.

In this post, I am going to step back a little and review how different parameters in my functions will change the output of those functions.

Background

In last post, I have a function named “generateCORRnetwork” of which I employed to obtain the correlation matrix among the tokens. I have updated it to expose the two hidden parameters that I control the random walk. Based on the belowmentioned sample sentence, I have generated the following correlation network.

In contrast, the product of the human CDKN2A beta transcript, p14ARF, activates a p53 response manifest in elevated levels of MDM2 and p21CIP1 and cell cycle arrest in both G1 and G2/M.
Created by author
Created by author

Loops” represents the number of simulation I will run for each token. “Steps” represents the number of walks in each walk per token. I have turned them into input arguments in the function to see how these will affect the output of the of the correlation graph.

A. Steps

The first experiment is on how the number of steps will change the performance. I run the following script to see how the eigenvector_centrality [1], which is a centrality measure for a node based on the centrality of its neighbor, of changes with longer walk. I have picked some tokens in the sample sentence to see the impact. I have increased the loops to 1,000 instead of the default 100 because I will run the steps from 10 to 100 with 10 steps increment. At the extreme, I manage to maintain 220 samples per token when the step is 100 (i.e. 1,000 walks x 22 tokens / 100 steps = 220 walks/token). That should provide enough statistical power to discern any pattern.

I run the following script to collect the data needed. It took me 2h 25min 42s to run on a M1 Ultra machine.

Created by author

Based on the data collected, I have generated following graphs on six tokens of interest, namely “activates”, “MDM2”, “p53”, “p21CIP1”, “CDKN2A” and “p14ARF”.

Created by author
Created by author
Created by author
Created by author
Created by author

From above six graphs, you will notice that there is a clear trend that the centrality score of each token will settle at a tight range after a walk that exceeds 50 steps. Since the number of tokens is 22 for this particular sample sentence, 50 steps are exactly more than twice of the number of tokens. This is why I set the number of steps to be at least twice of the length of the tokens to take advantage of the convergence.

A side note: I am surprised to see that token “p21CIP1” has such a high centrality score when compared with other tokens’. When I look at the correlation graph again, I notice it’s situated at the largest cluster.

In contrast, the product of the human CDKN2A beta transcript, p14ARF, activates a p53 response manifest in elevated levels of MDM2 and p21CIP1 and cell cycle arrest in both G1 and G2/M.

This became obvious when you notice this noun phrase as the object of the sentence is much longer than the noun phrase as the subject (i.e. the product of the human CDKN2A beta transcript). The connection within this noun phrase will boast the centrality of the tokens.

Hypothesis: Using unweighted random walk on a sentence could be able to segregate tokens into cluster that comprises a self-contained, -consistent noun phrase.

B. Loops

The next experiment is on the number of loops of a given number of steps. I will fix the walk length to 50 as it is proven from above and vary the number of walks per token. The heuristic would suggest the centrality measure should converge and settle down with enough samples (i.e. number of walks) collected.

Created by author
Crated by author
Created by author
Created by author
Created by author
Created by author
Created by author

From the above graphs, it’s clear that the number of walks doesn’t really affect the centrality score much. Most of the tokens have a very stable measure and stay at a tight range irrespective of the number of walks in the simulation. I will conclude that as long as the number of walks is over 100 the centrality measure will converge to a tight range.

Intermission

This post studied the impact of the two parameters controlling the generation of random walk and it’s clear that walk length will control the convergence performance of the centrality measure.

Next post I will study how weight assignment will change the correlation network.

Stay tuned!

--

--

Simon Tse
Learn about Cancer with Code

Try to apply my ML/NLP knowledge to problems I am interested in and create a narrative with the data. Current Interest: Cancer Biology