Research

Analyzing Disease Co-occurrence Using NetworkX, Gephi, and Node2Vec

Analyzing Electronic Health Records (EHR) of ICU patients and developing machine learning models

Jinhang Jiang
Analytics Vidhya
7 min readMay 5, 2020

--

This analysis is part of a project focusing on analyzing Electronic Health Records (EHR) of ICU patients and developing machine learning models for the early prediction of diseases. In this article, we show how to create a network of diseases using EHR records, and generate network embedding using the adjacency matrix or an edge list of the disease network. We use python, R, and Gephi software, and Node2Vec, Networkx, and K-means for analysis. We used Rstudio, Spyder, and Jupyter Notebook as IDE.

(Some minor changes made on May 4th, 2021 due to the most recent updates in node2vec)

Preview of the Dataset

The raw data containing 2,710,672 patient visit records containing 3,933 unique diagnoses. Out of these observations, 2,193,860 rows were labeled with valid icd10 codes.

ICD10 (https://icdcodelookup.com/icd-10/codes) is the 10th revision of the International Statistical Classification of Diseases and Related Health Problems (ICD), a medical classification list by the World Health Organization (WHO), indicating diseases, signs and symptoms, abnormal findings, complaints, social circumstances, and external causes of injury or diseases.

These codes are input during or after a patient visit in hospital by hospital staff, nurses, and coders, and have multiple uses such as for documentation and filing medical insurance claims.

Codes:

Phase One — Prepare Data:

Firstly, we have to clean up the data and transform the table to a sample edge list, which looks like the following (Figure 1):

Figure 1 — The Sample Edge List

The “pid” is the column with patients’ IDs, and the “icd10” is the column with the codes representing the diseases. One patient may have multiple diseases at the same time. One patient may capture the same disease more than one time. This table gives all the diseases or conditions each patient had from the point of admission to the point of readmission, and then to the point of discharge.

Phase Two — Make the Adjacency Matrix:

An adjacency matrix is a square matrix used to display the potential co-existence relationship between one disease and all the others (even include itself). Since 821 unique icd10 codes were existing in the sample edge list, we expected an 821*821 adjacency matrix.

The codes to do so are provided below:

```
import pandas as pd
import osimport numpy as npprint (os.getcwd())os.chdir(‘Your File Location’)print (os.getcwd())diag = pd.read_csv(‘pid_icd10.csv’)## create the matrixmatrix = pd.get_dummies(diag.set_index(‘pid’)[‘icd10’].astype(str)).max(level=0).sort_index()## transpose the matrixdiag_matrix = np.asmatrix(matrix)diag_matrix_transpose = diag_matrix.transpose()## multiply the matricesfinal_matrix = diag_matrix_transpose.dot(diag_matrix)network_table = pd.DataFrame(final_matrix)## append index nameicd10 = list(diag.icd10.unique())icd10.sort()network_table.index = icd10network_table.columns = icd10
```

After running the code, we got an 821*821 adjacency matrix (Figure 2):

Figure 2 — Partial View of The Adjacency Matrix

Here, the diagonal elements indicate the prevalence of the diseases in the dataset, i.e., how many times a particular icd10 code was recorded. The non-diagonal element E_xy indicate the co-occurrence of R_x and C_y, where R_x is the xth row and yth column across all patient records. For example, A09 (infectious gastroenteritis and colitis) and A04.7 (enterocolitis due to Clostridium difficile) have co-occurred in 29 patient visits in the dataset.

More information about how to generate such disease networks is explained in this paper: https://ieeexplore.ieee.org/document/8194838.

Phase Three — Generate Node2Vec Features:

In this step, we apply Node2Vec (https://cs.stanford.edu/~jure/pubs/node2vec-kdd16.pdf) to generate node embeddings. The network is created using the adjacency matrix we generated from the previous step. This function will allow us to get two essential outputs: a dataset contains the records of random walks for plotting purposes and a model to predict the diseases’ neighbors.

The codes are provided below (make sure you have Gensim 4.0.0 or above installed):

```
import networkx as nx
from node2vec import Node2Vec
graph=nx.from_pandas_adjancecy(network_table)node2vec = Node2Vec(graph, dimensions=20, walk_length=5, num_walks=200, workers=4)model = node2vec.fit(window=10, min_count=1)
```

The output of the dataset looks like the following (Figure 3):

Figure 3 — The Output of Node2Vec

The model can be used to find the neighbors or the values of dimensions of a specific node (a disease). It also can help find the similarity among the nodes. And here are the examples of its usage (Figure 4, Figure 5):

Figure 4 — Read the Value of a Specific Node
Figure 5 — Find the Similarity of a Node

To save the embedding to a CSV file:

```
vocab, vectors = model.wv.vocab, model.wv.vectors
# get node name and embedding vector index.
#index
name_index = np.array([(v[0], v[1].index) for v in vocab.items()])
# init dataframe using embedding vectors and set index as node name
node2vec_output = pd.DataFrame(vectors[name_index[:,1].astype(int)])
node2vec_output.index = name_index[:,0]
```

Notes for the parameters:

· The “graph” has to be a “networkx” graph. Node names must be all integers or all strings.

· As the paper “node2vec: Scalable Feature Learning for Networks” mentioned: “performance tends to saturate once the dimensions of the representations reach around 100.” Even though the default of dimensions is 128, we decided to use 20 instead. It should be large enough, given the number of nodes we have.

· The default for walk_length is 80, and it represents the number of nodes in each walk. And num_walks is the number of walks per node, and its default is 10. According to the paper, both parameters will improve performance as increasing.

Phase Four — Plot with K-means Clustering in R:

Since the feature representations are clustered using k-means (Grover & Leskovec, 2016), we can use the output from the previous step as input using the k-means clustering algorithm and plot the results in R.

Firstly, I ran a quick loop to look for the choice of “k” based on With-in-Sum-of-Squares (WSS). Figure 6 is the output:

Figure 6 — WSS Decreases as “k” Increases

It looked like the larger the “k” is, the better the model will perform. Yet, the elbow method was not working in this case since I could not see an obvious elbow curve in the plot. Probably, “k=8” would be a good start. However, when we tried to plot the clusters, we found out that the output was a bit too messy since the number of nodes was big (Figure 7):

Figure 7 — The Clustering Plot When “k=8”

Then, I decided to use “k=4” instead. Figure 8 is the output:

Figure 8 — The Clustering Plot When “k=4”

Since Gephi is an elegant tool to plot network relationships, I also plotted results with Gephi (Figure 9) for comparison:

Figure 9 — Plot in Gephi

It looked like Gephi automatically assigned the nodes to three major clusters, which is close to the parameter (k=4) I set in R.

For analysis purposes, I will go with K-means clustering this time because the algorithm will label the nodes with group numbers, which makes it easier to analyze.

Phase Five — Inferences:

To analyze the results, I ran a loop to subsample the clusters. Then, I got four groups of diseases:

Group1 consists of diagnoses related to all kinds of burns/trauma conditions, respiratory failure resulted from surgery, neurologic problems, and infectious diseases. (40 unique diagnosis string in total.)

Group2 contains diagnoses related to gastrointestinal problems, neurologic problems, and infectious diseases. (45 unique diagnosis strings in total.)

Group3 contains diagnoses mainly related to toxicology and neurologic problems. It looks like many drug overdose problems presented in the records. (26 unique diagnosis strings in total.)

Group4 contains diagnoses mainly related to cardiovascular problems, transplant conditions, and comorbidities from surgeries. (16 unique diagnosis strings in total.)

Plus, by looking at Figure 8, we can tell that Group2 and Group4 are very close to each other. Group3 and Group4 also overlapped.

Summary:

Using the node2vec package to parse the data and using k-means to plot is an excellent way to reveal the insights that are hidden deeply in the data which has network relationships. This model has the potential to help us discover the neighbors (possible co-existence diseases) for each disease at the point of admission. It can help save tons of time and money for both doctors and high-cost patients in the future.

In this blog, I have presented how to analyze electronic health records using network science, especially, for understanding disease co-occurrence across patient visits. Such an exploratory analysis can be useful to develop preliminary inferences as well as starting points for more in-depth analysis. Not only are the insights gained using our analysis useful for healthcare professionals, but the approach shown can be used by data scientists interested in analyzing large EHR datasets.

I completed the blog under the guidance of Dr. Karthik Srinivasan, Assistant professor — Business Analytics, School of Business, University of Kansas.

For your references, you can find all the Python/R codes here.

Please feel free to connect with me on LinkedIn.

--

--