Stories by FHIR Shot Learning on Medium

XplainMD Part 4: From Graph Reasoning to Natural Language — Integrating GNNs with LLMs and Gradio

FHIR Shot Learning — Wed, 09 Apr 2025 14:28:38 GMT

In Part 3 of this series, the project advanced into the realm of deep learning, training a Relational Graph Convolutional Network (R-GCN)…

Continue reading on Medium »

XplainMD Part 3: Relational GCN & GNNExplainer: Learning & Explaining Links

FHIR Shot Learning — Wed, 09 Apr 2025 13:48:49 GMT

Introduction

Continue reading on Medium »

XplainMD Part 2: Finding the Missing Links with Machine Learning

FHIR Shot Learning — Wed, 09 Apr 2025 12:47:09 GMT

In Part 1 of the XplainMD series, we zoomed out to explore the architecture of biomedical knowledge — unpacking the rich topology of PrimeKG through centrality analyses, causal subgraphs, and community detection. By mapping how diseases, drugs, proteins, and phenotypes interconnect in a vast biomedical graph, the foundation has been laid for understanding not just what exists — but what might be missing.

Now, in Part 2, the gears will shift from exploration to prediction.

Graphs open the door to a wide range of prediction tasks — from node classification (predicting properties of nodes) to link prediction (inferring missing or potential connections). Since our focus is on understanding hidden relationships in biomedical data, this series dives into link prediction.

So, what is link prediction?
It’s the task of asking: “Given what we know about this graph’s structure, can we infer meaningful relationships that aren’t explicitly present?”

This is where representation learning steps in.

By applying Node2Vec, the rich topology of the biomedical graph is transformed into dense vector embeddings that capture both semantic proximity and structural context. These embeddings serve as the foundation for downstream tasks — in this case, predicting missing or unknown edges between entities like drugs, diseases, and phenotypes.

These embeddings further become the input to downstream machine learning models like Logistic Regression and XGBoost, enabling us to tackle the powerful task of link prediction — estimating whether a biologically plausible, yet currently unobserved, connection exists between entities such as a disease and a phenotype.

This is where XplainMD begins to evolve from graph understanding into graph reasoning.

Data Pre-processing : Structuring the Graph

Before training any Machine Learning or Deep Learning models, it is essential to ensure the input data is well-formatted, clean, and consistent. This preprocessing step converts the raw PrimeKG CSV into a form that can be used to build a graph structure for machine learning tasks and deep learning as well.

1. Data Loading

In this project, a filtered subset of PrimeKG was loaded into a pandas DataFrame to focus on the most clinically relevant biomedical relationships. Specifically, only the following relation types were extracted:

selected_relations = [
    "protein_protein",
    "disease_phenotype_positive",
    "bioprocess_protein",
    "disease_protein",
    "drug_effect",
    "pathway_protein",
    "disease_disease",
    "contraindication",
    "drug_protein",
    "indication"
]

What Does Each Row Represent?

Each row in the DataFrame corresponds to a single edge in the biomedical knowledge graph — that is, a meaningful connection between two biomedical entities.

The relevant columns include:

x_name, y_name: The actual names of the two nodes connected by the relation (e.g., “Alzheimer’s disease”, “APP”).
x_type, y_type: The entity types for each node — such as disease, protein, drug, phenotype, etc.
relation: The type of biomedical relationship between the nodes — e.g., disease_protein or drug_effect.
x_source, y_source: These fields do not indicate directionality of the edge — instead, they refer to the original source database (like NCBI or DrugBank) from which the node was extracted.

️ Direction ≠ Semantics

While the table structure follows a source → target format (x_name to y_name), this does not mean the graph is directed. According to the official PrimeKG paper, the graph is treated as undirected during analysis and modelling. This means that relationships are bidirectional, even though they are stored in a structured row format.

Why This Matters for Graph Construction

Understanding the true semantics of these edges is critical. When building the graph later in PyTorch Geometric (or any GNN library):

Treat the edges as undirected for most graph algorithms and embeddings like Node2Vec.
Ensure the edge type (relation) and node types (x/y_type) are preserved in a mapping — enabling construction of typed heterogeneous graphs.

2. Text Normalisation

Biomedical datasets often contain inconsistent casing, hidden unicode characters, or stray spaces. To ensure uniformity across node names, every node label is lowercased, stripped of whitespace, and normalised using unicodedata.

def clean_text(text):
    return unicodedata.normalize("NFKD", str(text)).strip().lower()

df["x_name"] = df["x_name"].apply(clean_text)
df["y_name"] = df["y_name"].apply(clean_text)

3.Type Mapping

Node types in PrimeKG can vary in format — e.g., “gene/protein”, “chemical/drug”, or redundant variants like “bioprocess”. These are mapped to canonical categories to simplify modelling and ensure consistency.

node_type_mapping = {
    "gene/protein": "protein",
    "chemical/drug": "drug",
    "drug": "drug",
    "disease": "disease",
    ...
}

4. Extracting Node Names, Types, and Normalized Relations

Before applying any graph machine learning technique, we need to structure the biomedical data in a way that respects its semantic complexity. In PrimeKG, each row represents a biologically meaningful link — such as a drug treating a disease, a gene associated with a phenotype, or a protein interacting with another protein.

But models like R-GCN or Node2Vec don’t just want a list of edges — they need a clear map of what each node is, what role it plays, and how it’s connected.

Step 1: Assign Global Node IDs

The first step is to collect all unique node names across both columns (x_name and y_name) and assign each one a global integer ID. This gives us a consistent reference for each entity throughout the graph.

all_nodes = pd.concat([df["x_name"], df["y_name"]]).dropna().unique()
node_maps = {name: i for i, name in enumerate(sorted(all_nodes))}
print(f"[INFO] Total unique nodes: {len(node_maps):,}")

Analogy: Think of this like assigning a library index number to every book — whether it’s in the “Medicine” section or “Biochemistry,” a unique ID is aasigned to keep everything organised.

Step 2: Normalise the Relation Map

In a heterogeneous biomedical graph, relationships connect different types of nodes:

A disease–protein interaction is different from a drug–effect link
Some relationships are directional, others symmetrical

To build a flexible yet structured graph, the relation types were normalised by sorting their source and target node types alphabetically. This ensures consistency and avoids duplication (e.g., drug→disease is treated the same as disease→drug if the model doesn't care about directionality).

relation_map = {}

for rel in df["relation"].unique():
    subset = df[df["relation"] == rel]
    if subset.empty:
        continue

    type_pairs = set(zip(subset["x_type"], subset["y_type"]))

    for x_type, y_type in type_pairs:
        if x_type in node_type_mapping.values() and y_type in node_type_mapping.values():
            normalized_pair = tuple(sorted([x_type, y_type]))
            relation_map[rel] = normalized_pair

print(f"[INFO] Total unique normalized relations: {len(relation_map):,}")

Analogy: This is like grouping roads on a map based on which areas they connect, regardless of direction — a road from “Hospital to Lab” is still the same route as “Lab to Hospital.”

Step 3: Build the Node Metadata Table

To keep track of each node’s type and global ID, a unified node_df was created that holds every unique node, its type (e.g., "gene", "disease"), and the global ID that was previously assigned.

node_df = pd.concat([
    df[["x_name", "x_type"]].rename(columns={"x_name": "node_name", "x_type": "node_type"}),
    df[["y_name", "y_type"]].rename(columns={"y_name": "node_name", "y_type": "node_type"})
]).dropna().drop_duplicates().reset_index(drop=True)

node_df["global_id"] = node_df["node_name"].map(node_maps)

Analogy: This is like creating a clean catalog where every book (node) has its genre (type) and index number (ID) — critical for graph construction.

Constructing the Graph with Global Node Mapping

Step 1: Global Node Indexing

Earlier in the pipeline, each unique biomedical entity (e.g., gene, disease, phenotype) was assigned a global integer ID using:

node_maps = {name: i for i, name in enumerate(sorted(all_nodes))}

This ensures that every node — regardless of its type — is mapped to a unique identifier, creating a flat, consistent index space that simplifies downstream processing.

Step 2: Adding Edges to the Graph

With node IDs in hand, looping through each relationship type (from the normalised relation_map) and add the corresponding edges:

G = nx.Graph()

for rel in relation_map:
    rel_df = df[df['relation'] == rel]
    
    src_indices = rel_df['x_name'].map(node_maps).fillna(-1).astype(int)
    dst_indices = rel_df['y_name'].map(node_maps).fillna(-1).astype(int)

    valid_edges = [(s, d) for s, d in zip(src_indices, dst_indices) if s != -1 and d != -1]
    G.add_edges_from(valid_edges)

print("\n[INFO] Graph constructed successfully with global node map.")

Here’s what this does:

For each relation, it selects the relevant rows from the dataset.
It converts the source and target node names into global IDs using the node_maps dictionary.
Any missing or invalid mappings are filtered out (using -1 as a sentinel).
All valid edges are added to the graph.

Learning Node Representations with Node2Vec

Using Node2Vec, a model is trained to convert nodes into dense, continuous embeddings — capturing semantic and topological relationships between the entities. It is an unsupervised learning algorithm that learns low-dimensional embeddings for nodes by simulating random walks on the graph.

At its core, Node2Vec learns by simulating random walks across the graph — just like how Word2Vec learns word embeddings from natural language. It treats each node like a “word” and each walk like a “sentence.” By walking through the graph in flexible, biased ways (some walks stay local, others explore far), it captures both structural roles (e.g., hubs, bridges) and semantic proximity (e.g., diseases linked by shared phenotypes or pathways).

Image Generated using ChatGPT-4o

The result? A high-dimensional representation where nodes with similar roles or connections are embedded close together — even if they aren’t directly connected.

This makes it possible for ML models to detect missing links, suggest biological analogies, and uncover latent similarities — all from the geometry of the graph.

Converting NetworkX Graph to PyTorch Geometric Format

Once the undirected graph G is constructed using NetworkX, the next step is to embed its nodes into a continuous vector space using the Node2Vec algorithm. These embeddings are designed to capture the structural roles and semantic context of nodes based on their local and global neighbourhoods.

To do this effectively in PyTorch Geometric (PyG), the graph needs to be transformed into a format that PyG understands. This is done using:

pyg_graph = from_networkx(G)

This line converts the G object (a standard NetworkX graph) into a torch_geometric.data.Data object. The resulting pyg_graph includes PyG-friendly attributes like edge_index, a 2D tensor that defines the graph's connectivity in terms of source and target node indices.

This format allows PyG models like Node2Vec, GCN, or RGCN to efficiently process the graph, optimise over its structure, and learn expressive embeddings.

The edge_index serves as the backbone of all graph-based computations in PyG, enabling operations like random walks, message passing, and convolution to be implemented seamlessly.

Sending to Device (CPU/GPU)

pyg_graph.edge_index = pyg_graph.edge_index.to(device)

To leverage GPU acceleration (if available), the graph’s edge list is moved to the appropriate device.

Initializing Node2Vec

Node2Vec doesn’t just look at who’s connected to whom — it walks the graph like a tourist, exploring local and global neighbourhoods to uncover hidden structural patterns.

It combines two clever ideas:

Random Walks:
For each node, Node2Vec simulates multiple random walks — like sending out a curious explorer to roam the neighbourhood. These walks create sequences of nodes, kind of like sentences in a language.
Skip-Gram Model:
Inspired by Word2Vec, the skip-gram model learns to predict a node’s neighbours (context) from these sequences. It treats nodes like words and walk sequences like sentences, capturing how often and in what order nodes appear together.

node2vec = Node2Vec(
    pyg_graph.edge_index,
    embedding_dim=128,
    walk_length=10,
    context_size=5,
    walks_per_node=20,
    num_negative_samples=1
).to(device)

The configuration used in this project is carefully tuned to balance exploration and efficiency during the Node2Vec training process:

embedding_dim=128: Each biomedical entity—be it a disease, gene, or drug—is represented by a 128-dimensional vector, capturing its structural and semantic context in the graph.
walk_length=10: Each simulated random walk explores 10 steps from a starting node, allowing it to traverse across nearby biological relationships (e.g., a disease → protein → drug → pathway).
context_size=5: For every node, only its 5 closest neighbours in a walk are treated as context. This is akin to saying: “Which other genes are typically discussed near BRCA1 in biomedical pathways?”
walks_per_node=20: The model simulates 20 random walks per node, giving it enough exposure to both local and global graph structure. For instance, breast cancer might co-occur with immune genes, metabolic pathways, or co-morbid phenotypes in different walks.
num_negative_samples=1: For every positive pair (e.g., Breast Cancer ↔ TP53, which co-occur in a walk), one negative pair is generated by randomly sampling unrelated nodes (e.g., Breast Cancer ↔ Toe curvature). This teaches the model to pull meaningful pairs closer while pushing irrelevant pairs apart.

This setup enables Node2Vec to learn embeddings that reflect biomedical semantics, even though the training is entirely unsupervised. The internal buffers rowptr and col, essential for sampling operations, are also moved to the correct device (GPU or CPU) to ensure efficient execution.

Edge Sampling for Training

train_edges, val_edges = train_test_split(...)

Instead of training on the entire graph at once, the training loop samples batches of edges and trains on mini-batches. A 90–10 split is used for training and validation.

Training Loop with Early Stopping

The model is trained using Adam optimizer. For each epoch:

A batch of nodes is sampled based on the edge list.
Positive random walks and negative random walks are generated.
The model computes the loss, backpropagation, and updates the weights.
Validation loss is computed and monitored.

if val_loss.item() < best_loss:
    ...
    torch.save(...)

If the validation loss improves, the model is saved. Otherwise, a counter is incremented. Training stops early if no improvement is seen for 200 consecutive epochs.

Visualizing Node2Vec Embeddings with t-SNE

The Node2Vec model was trained on an undirected biomedical graph to learn vector representations for each node based on its local and global connectivity. After training, these embeddings were projected into two dimensions using t-SNE, a non-linear dimensionality reduction technique that preserves local structure and neighbourhood relationships.

Before Node2Vec training

Embedding after Node2Vec training

Visualising the Graph Embeddings with t-SNE

The scatter plot above presents a t-SNE projection of node embeddings from the PrimeKG graph — with each point representing a node (e.g., disease, phenotype, drug, protein, pathway, or biological process). The axes (Component 1 and Component 2) are abstract latent dimensions created by t-SNE and don’t correspond to specific biomedical properties. Instead, they are used to help us visualise structural similarity in a 2D space.

Nodes that appear closer together in this plot were likely embedded with similar structural contexts — meaning they share common neighbors, appear in similar paths, or participate in similar types of relationships within the graph.

This version of the plot is subsampled to improve visibility while maintaining the distributional structure of the full graph.

Key Observations

No single dominant cluster is present, but we observe dense regions of overlap where nodes of different types co-locate — reflecting the highly interconnected nature of biomedical entities in PrimeKG.
Drugs and proteins appear more uniformly scattered, consistent with their broad connectivity across multiple biomedical contexts (e.g., a drug linking to diseases, pathways, and targets).
Phenotypes and diseases still form partial clusters, often overlapping — which makes sense biologically, as phenotypes are often clinical manifestations of diseases, and their embeddings are shaped by similar neighbourhood structures.
Pathways and biological processes are sparsely spread out, possibly due to lower edge density or fewer random walk interactions — indicating they may function more as semantic anchors in the graph than highly connected hubs.

What This Means

Despite the noise introduced by subsampling and the non-deterministic nature of t-SNE, there are clear semantic signals emerging from the structure:

Nodes of similar types often drift toward local neighborhoods, showing that the graph structure preserves contextual semantics.
The fact that different biomedical entities aren’t isolated but rather entangled in shared regions is reflective of real-world biology, where interdependencies are the norm.

These patterns validate that the graph construction and embedding pipeline is working — capturing not just node proximity, but meaningful biomedical associations that reflect the underlying complexity of healthcare knowledge.

Cosine Similarity Between Top Disease Embeddings

After training Node2Vec embeddings on the PrimeKG graph, it becomes possible to quantify how “similar” any two nodes are in the latent space using cosine similarity. The heatmap below visualises pairwise similarities between a curated set of disease nodes, helping assess whether the learned embeddings reflect intuitive medical relationships.

The heatmap above shows the cosine similarity between the learned embeddings of three related disease nodes: hypertension, insulin resistance, and metabolic syndrome. Each cell represents the cosine similarity score between two disease embeddings. As expected:

A score of 1.00 (along the diagonal) reflects perfect self-similarity.
Values closer to 0.00 suggest low or orthogonal similarity.
Higher off-diagonal values indicate that the model sees those diseases as structurally or contextually similar in the graph.

Interpreting the Patterns

The embedding similarity between hypertension and metabolic syndrome is the highest among the pairs (0.18), which may reflect their shared connection to cardiovascular and metabolic pathways.
Insulin resistance has modest similarity with metabolic syndrome (0.14) and hypertension (0.07), indicating weaker but non-random alignment — possibly due to sparse shared phenotypes or indirect links via common co-morbidities.
Despite their biomedical relevance to each other, the similarity values are still low in absolute terms, which highlights the structural sparsity and specificity of disease nodes in PrimeKG.

Why This Is Useful

This kind of similarity analysis provides a semantic lens into the embedding space — giving us clues about how the model interprets disease relationships based on graph structure. It can be valuable for:

Clustering diseases by mechanism or shared phenotypes
Identifying potential co-morbidities based on shared neighbourhoods
Prioritising links for drug repurposing or phenotype prediction
Filtering noise by removing structurally irrelevant candidates in downstream tasks

Why Are Similarity Scores Still So Low?

At first glance, one might expect diseases like insulin resistance and metabolic syndrome to have much higher similarity. But here’s why the scores remain low — and why that’s not necessarily a flaw:

1. Node2Vec is structure-aware, not domain-aware

Node2Vec learns from walk patterns, not domain semantics. Two diseases might be biologically related but embedded in separate neighbourhoods if they don’t share enough graph connectivity.

2. Cosine similarity focuses on direction, not magnitude

Cosine similarity captures directional alignment, but ignores vector magnitude. So even two influential nodes with meaningful overlap might show low similarity if they vary in connectivity or feature strength.

Link Prediction with Logistic Regression

Once the Node2Vec model has learned low-dimensional vector representations (embeddings) for each node in the graph, the next step is to predict whether two nodes should be connected — even if they currently aren’t. This task is called link prediction.

Think of it like asking:

“Based on their embedding vectors, is there a high chance that disease A and phenotype B are biologically connected?”

Why Embeddings Matter Here

Each node (like asthma or IL6 protein) is now represented by a 128-dimensional vector that encodes its structural and contextual role in the graph. These embeddings serve as features for traditional machine learning models.

Step 1: Extract Positive Edges for a Specific Relation

This begins by collecting real edges for a biomedical relation of interest — for example, "disease_phenotype_positive". These are the known connections that serve as positive training examples.

relation_edges = np.sort(
    df[df["relation"] == relation_name][["x_name", "y_name"]].values.astype("U"),
    axis=1
)
relation_edges = np.unique(relation_edges, axis=0)

The dataset is filtered for the desired relation type.
The node pairs are sorted alphabetically to treat edges as undirected.
The duplicates are removed with np.unique to ensure each positive edge is counted only once.

Step 2: Collect Valid Nodes by Type

The list of valid nodes is extracted for the given source and target types (e.g., disease, phenotype) from the cleaned node metadata:

src_nodes = list(node_df[node_df["node_type"] == src_type]["node_name"])
tgt_nodes = list(node_df[node_df["node_type"] == tgt_type]["node_name"])

This ensures that sampling is being done from the correct subsets when generating negatives.

Step 3: Map Positive Edges to Global Node IDs

All the valid node pairs in the positive edge set into their corresponding global integer IDs (as required for embedding lookup and modelling):

pos_edges = np.array([
    [node_maps[x], node_maps[y]]
    for x, y in relation_edges
    if x in node_maps and y in node_maps
])

Step 4: Generate Negative Samples

Since link prediction is a binary classification task, we also need negative examples — node pairs that are not connected in the graph. For this reason the same number of fake edges are generated by randomly sampling node pairs that don’t exist in the original relation set:

num_samples = len(pos_edges)
neg_edges = np.array([
    [node_maps[random.choice(src_nodes)], node_maps[random.choice(tgt_nodes)]]
    for _ in range(num_samples)
])

Note: These are synthetic and may occasionally include real but unlabelled edges — which introduces noise, but is common in graph-based negative sampling.

Step 5: Train–Test Split

The positive and negative edges are split separately into 80% training and 20% testing:

pos_train, pos_test = train_test_split(pos_edges, test_size=0.2, random_state=42)
neg_train, neg_test = train_test_split(neg_edges, test_size=0.2, random_state=42)

Step 6: Compute Edge Features

Each edge (whether positive or negative) is represented by a dot product of its two node embeddings:

def edge_features(edges):
    return (embeddings[edges[:, 0]] * embeddings[edges[:, 1]]).sum(dim=1).view(-1, 1)

The dot product measures vector alignment — a simple proxy for similarity.
Higher values suggest a stronger connection between the two nodes.

Then the feature matrices and labels are constructed:

X_train = torch.cat([edge_features(pos_train), edge_features(neg_train)], dim=0).cpu().numpy()
y_train = np.array([1] * len(pos_train) + [0] * len(neg_train))

X_test = torch.cat([edge_features(pos_test), edge_features(neg_test)], dim=0).cpu().numpy()
y_test = np.array([1] * len(pos_test) + [0] * len(neg_test))

Step 7: Train Logistic Regression

model = LogisticRegression(class_weight="balanced", max_iter=1000)
model.fit(X_train, y_train)

class_weight="balanced" helps account for class imbalance.
max_iter=1000 ensures convergence for larger datasets.

Step 8: Score a Specific Node Pair

Lastly, the model can be queried for a specific disease–phenotype pair — like:

"permanent neonatal diabetes mellitus" ↔ "retinopathy"

The dot product is computed, passed through the classifier, and a probability is returned:

pair_feat = (embeddings[u] * embeddings[v]).sum().item()
pair_score = model.predict_proba(np.array([[pair_feat]]))[:, 1][0]

Logistic Regression for Disease–Phenotype Link Prediction

This evaluation tests whether simple logistic regression on Node2Vec embeddings can predict meaningful biomedical associations. Specifically, it targets the disease_phenotype_positive relation—i.e., known links between diseases and observable phenotypes.

The classifier was trained using dot-product-based features between node embeddings for each disease–phenotype pair. The results are broken down into:

Evaluation Metrics

Metric Value Description Accuracy 0.5009. The model is only marginally better than random guessing (which would be ~0.50 in a balanced binary setup). Precision is 0.5010 which is slightly more than half the predicted links are correct. Recall is 0.4677 which means the model missed out a fair number of actual links. F1 Score is 0.4709 which is a harmonic mean of precision and recall — reflects overall balance. ROC-AUC 0.5100 Shows poor separation between true and false links — close to chance level (0.5).

These metrics highlight a limitation: although embeddings are informative, logistic regression alone cannot capture the complexity of biomedical graph structures.

Specific Pair Score

The logistic regression model was also queried for a specific edge:

“permanent neonatal diabetes mellitus” ↔ “retinopathy”

Predicted probability: 0.5027

This score is barely above 0.5, suggesting the model has low confidence in this edge’s existence.

Summary

This experiment demonstrates that logistic regression over simple dot-product embeddings is insufficient for nuanced biomedical link prediction. While this setup works as a baseline, it motivates the use of more powerful models like XGBoost, GNNs, or transformer-based approaches for improved prediction quality.

Will XGBoost be any better?

XGBoost for Link Prediction on Biomedical Graphs

To improve upon the earlier baseline, an XGBoost classifier was trained using concatenated Node2Vec embeddings for disease–phenotype pairs. This setup leverages tree-based learning to better capture nonlinear relationships between node representations in the biomedical graph.

The evaluation focused again on the disease_phenotype_positive relation.

Performance Metrics

Accuracy: 0.80
Over 80% of the predictions were correct.
Precision: 0.80
High precision means most predicted links are actually relevant.
Recall: 0.79
The model successfully retrieved a large portion of the true links.
F1 Score: 0.80
Reflects a strong balance between precision and recall.
ROC-AUC: 0.88
Indicates a decent discrimination between positive and negative link predictions.

These results reflect a clear performance boost over logistic regression, highlighting XGBoost’s ability to capture richer, non-linear patterns from the node embeddings.

Specific Pair Score

The model was queried for the link:

“permanent neonatal diabetes mellitus” ↔ “retinopathy”

Predicted probability: 0.4467

Interestingly, while overall performance is strong, the probability for this specific pair is lower than expected, possibly due to data sparsity or lack of direct co-occurrence in the walk-based embedding generation process.

XGBoost proves to be a powerful link predictor in the biomedical domain when trained on structural node embeddings. It outperforms logistic regression by a large margin and serves as a strong baseline for future comparison with more complex models like Graph Neural Networks or attention-based link predictors.

Conclusion

This blog explored the use of Node2Vec embeddings on the PrimeKG biomedical graph, followed by traditional machine learning models (Logistic Regression and XGBoost) for link prediction between disease and phenotype nodes. While XGBoost outperformed Logistic Regression with significantly better precision and AUC scores, both models struggled to capture the complex semantics of biomedical relationships. The cosine similarity heatmap further revealed that even with high-dimensional embeddings, the latent space remained weakly informative when it came to reflecting true biological proximity.

This outcome highlights an important limitation: traditional ML models operating on static embeddings are not sufficient for relational reasoning in multi-relational graphs like PrimeKG. They treat the problem as a classification task over vector pairs, overlooking the rich contextual interactions and multi-hop dependencies within the graph.

The full code is available on Github

Coming Up Next:

XplainMD Part 3: Relational GCN + GNNExplainer: Learning & Explaining Links

In this blog we explored how shallow models like Node2Vec + XGBoost can uncover patterns in biomedical graphs. Now, it’s time to level up.

In the next part of this series, we dive into Relational Graph Convolutional Networks (R-GCN) — a graph-native neural architecture built to learn directly from multi-relational knowledge graphs like PrimeKG.

Unlike traditional pipelines, R-GCN dynamically updates node representations based on both edge types and neighbourhood structure, capturing the true semantics of biomedical relationships.

But we won’t stop at prediction.

Explainability will take center stage as GNNExplainer will be introduced, a tool that reveals the “why” behind each link prediction — uncovering the subgraph structures and features that drive the model’s decisions.

This next post will show how R-GCN + GNNExplainer work together to produce trustworthy, interpretable insights — a must-have in domains like drug discovery, clinical reasoning, and precision medicine.

Stay tuned — as this one’s where machine learning meets meaning.

References:

Grover, A. and Leskovec, J., 2016, August. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 855–864): https://arxiv.org/abs/1607.00653

XplainMD Part 1: A visual exploration of PrimeKG

FHIR Shot Learning — Wed, 09 Apr 2025 10:23:55 GMT

Island Clusters of Biomedical Relations in PrimeKG

Mapping the PrimeKG Graph: A Visual and Analytical Journey

In the Introductory post, a future was envisioned where AI isn't just making clinical predictions but also explains them. A future where biomedical knowledge is structured, transparent, and interactive — not buried in black-box models or locked away in tabular data.

But for an AI system to reason this way — to connect diseases with phenotypes, drugs with proteins, and uncover the biological logic behind them — it needs a foundation that is both rich in context and grounded in structure.

That foundation is PrimeKG!

Before training any model, it is essential to first understand the data — not just its format, but its form, its relationships, and its interpretability. This post offers a visual deep dive into PrimeKG, a richly curated precision medicine knowledge graph developed by researchers at Harvard. It explores the graph’s composition, its entities and relationships, and why it serves as an ideal substrate for building explainable, graph-based reasoning systems in healthcare systems.

Graphs, unlike traditional datasets, speak a different language. They capture connections, structures, and hierarchies — and as such, they require unique visualisation techniques. This post also serves as a gentle introduction to the world of graph data science, offering a glimpse into how graphs can enhance transparency and inference in biomedical AI applications.

But first, to truly appreciate the visualisations, it’s important to understand a few foundational concepts in graph theory and how they apply to structured biomedical data.

Understanding Graph Basics Through a Biomedical Example

Shortest Path between Autism Spectrum Disorder and Tamoxifen

The image above represents a miniature subgraph — a small portion of a much larger biomedical knowledge graph like PrimeKG. Even with just three nodes and a few connections, it demonstrates how graphs capture biological relationships in a structured, interpretable way.

Nodes (Entities)

Each circle represents a node — an entity in the biomedical world:

Tamoxifen (green) → a drug
AR (blue) → a protein (Androgen Receptor)
Autism Spectrum Disorder (red) → a disease

In a traditional dataset, these might be separate rows or values in unrelated columns.
But in a graph, they’re connected — and those connections carry a meaning.

Edges (Relationships)

The curved lines represent edges, or relationships between entities:

drug_protein → Tamoxifen interacts with the AR protein
disease_protein → AR is associated with Autism Spectrum Disorder

Each edge is labelled— meaning it’s not just a connection, but a specific kind of biological relationship.

This distinction is important: graphs allow the model to know that “Tamoxifen targets AR” is a different kind of interaction than “AR is linked to Autism” — even though they share a common node.

Degree

The degree represents the number of edges attached to the node. In the subgraph AR has a degree of 2.

Why This Matters

This small subgraph illustrates how knowledge graphs integrate diverse biomedical data into a single, connected structure. With this one visualisation, we can begin to ask:

Could Tamoxifen, typically used in other contexts, have a potential repurposing role in conditions linked to AR?
Is AR acting as a bridge between different diseases and drug mechanisms?
What other entities are connected to this path?

In traditional tabular data, surfacing questions like this would require manually stitching together databases. In a graph, the structure makes these paths visible — and ready for both human interpretation and machine reasoning.

Now that some of the basics are clear, lets dive in and explore the beauty of graph theory and how it can revolutionise healthcare in the long run.

1. Introduction & Setup

So as I had mentioned earlier: PrimeKG is a massive precision medicine knowledge graph containing diverse entities: drugs, diseases, phenotypes, proteins etc. Each relation (like disease_protein or drug_effect) forms a valuable edge type. The raw dataset is available on their github page.

Unique nodes and Edges

As you can see it contains total unique 129375 nodes and 8100128 total unique edges. For the ease of the project, I selected a list of relationships and created another dataframe and throughout the project this dataframe will be considered.

So the final list looks like this:

selected_relations = [
    "protein_protein",
    "disease_phenotype_positive",
    "bioprocess_protein",
    "disease_protein",
    "drug_effect",
    "pathway_protein",
    "disease_disease",
    "contraindication",
    "drug_protein",
    "indication"
]

Total unique nodes: 68857
Total unique edges: 1803526

2. Basic Graph Exploration

Let’s begin with basic graph exploration by examining which types of nodes are most prevalent in the dataset. Unsurprisingly, the figure below shows gene/protein nodes dominate — a pattern that’s typical in biomedical knowledge graphs. Genes and proteins serve as central players in biological systems:
they are associated with diseases, often as causal factors or biomarkers, and they also interact directly with drugs through mechanisms like binding, inhibition, or activation. Their high connectivity and biological significance make them pivotal in understanding disease mechanisms and therapeutic strategies, which is reflected in their heavy representation within PrimeKG.

Node Types (e.g., disease, protein, drug)

Next, the Edge type distribution is explored to understand how relationships are represented in the precision medicine knowledge graph. As shown in the figure below, the protein_protein edge type overwhelmingly dominates the graph. This is expected — proteins are the functional workhorses of the cell and engage in a wide variety of interactions, ranging from signalling and metabolic processes to forming structural complexes. These interactions are not only abundant but also critical for downstream biological effects, which explains their heavy representation in datasets like PrimeKG. Other prominent relations include disease–phenotype, bioprocess–protein, and disease–protein interactions, all of which are vital for modelling biological mechanisms and clinical conditions.

To understand which entities are most connected in the graph, the Top 15 nodes by degree were computed. As mentioned above, degree here refers to the number of connections (edges) a node has with other nodes. The results, shown in the bar chart below, highlight UBC (a ubiquitin C protein) as the most connected node in the graph. This makes biological sense, as UBC plays a central role in protein degradation and signalling pathways, interacting with a large number of other proteins.

Interestingly, Autosomal Recessive Inheritance and Autosomal Dominant Inheritance also appear among the highest-degree nodes, underlining the prevalence of inheritance patterns across multiple diseases and phenotypes in biomedical data.

Other high-degree nodes include various diseases like breast cancer, hereditary breast ovarian cancer syndrome, and squamous cell carcinoma, as well as proteins like TRAF2, ETS1, and PLCG1 — all known for their involvement in major biological and pathological processes.

These highly connected nodes (or hubs) can play a pivotal role in learning embeddings, as they influence the message passing process(which will be discussed in the upcoming blogs) more significantly than low-degree nodes.

Local subgraph around ‘breast cancer’ illustrating multi-relation connections with key genes, proteins, and phenotypes.

The figure above is a biomedical constellation, with breast cancer at its center.

Each line, each node, each colour tells a story:

In the center, Breast Cancer (gold) acts as the anchor point — the node that was queried.
Surrounding it are 25 of its most connected neighbours, forming a mini-universe of proteins, genes, and other diseases that play a role in its biology.
The sky-blue coloured circles are proteins or genes linked to breast cancer — some may be drug targets, some might influence pathways related to tumour growth or suppression.
The salmon coloured nodes represent other diseases(breast carcinoma, breast neoplasm), possibly sharing phenotypes or risk factors, or even common strategies of therapy.
The purple coloured edges labeled disease_protein highlight known associations — like "this protein is known to interact with this disease.”

The more connected a node, the more central it may be to the disease’s behaviour — or, potentially, to its treatment.

Top 10 neighbours of breast cancer. The yellow node represents breast cancer while the blue nodes represents the proteins it is associated with.

Why This Matters?

Graphs like this aren’t just visual aids — they’re interactive blueprints of biomedical reasoning.

Instead of parsing flat gene lists or static co-morbidity tables, this structure invites deeper questions:

What proteins serve as bridges between breast cancer and other diseases?
Are there central hubs in this neighbourhood — nodes that consistently appear across multiple disease pathways?
Could any of these connections hint at drug repurposing opportunities or open up new lines of inquiry?

This is the power of graph-based storytelling: not just identifying what is connected, but uncovering how and why those connections could matter.

To explore these patterns further, the next section dives into centrality analysis — a key step toward prioritising influential biomedical entities.

3. Centrality Measures

In graph theory, centrality is a key concept that helps quantify the importance or influence of a node within a network.

Let’s break it down with a simple analogy.

Imagine you’re part of a social network. You have 10 friends, while your friend has just 2.
In this scenario, your degree — your number of direct connections — is 10, compared to their 2.

Now ask: who’s more likely to have greater influence, broader reach, or higher visibility?

The answer is obvious: you.

This is the core idea behind centrality.
Just like a celebrity with millions of followers can spread information faster and wider, nodes with high centrality in a graph can exert greater influence on the entire structure.

In the context of biomedical graphs, this has powerful implications:

If a gene, disease, or protein node has thousands of connections, then any disruption — be it a mutation, a treatment, or an interaction — can create a cascade of effects.
Such nodes are not just important; they are biological bottlenecks, gateways, or vulnerabilities within the system.

That’s why centrality isn’t just a mathematical measure — it’s a strategic lens for identifying critical biomedical entities and mapping crucial biological pathways.

Types of Centrality Used in This Project

To capture different flavours of “importance” in a graph, this project explores four key centrality measures:

Degree Centrality: Who has the most direct connections?
Measures how many edges a node has — useful for identifying immediate hubs.
Betweenness Centrality: Who acts as a bridge between different parts of the graph?
Highlights nodes that often lie on the shortest paths, playing the role of connectors or gatekeepers.
Closeness Centrality: Who can reach others the fastest?
Prioritises nodes that are, on average, closest to all others — indicating efficient spreaders or integrators.
PageRank (Eigenvector Centrality): Who holds influence based on who they’re connected to?
It’s not just about how many connections a node has, but how important those connections are.

Each of these offers a unique lens to identify key players in the biomedical graph — whether they’re hubs, bridges, gateways, or influencers in the system’s flow of biological information.

Understanding Degree Centrality in PrimeKG

Degree centrality is perhaps the most straightforward way to measure a node’s importance in a graph.
It counts the number of direct connections (or edges) a node has to other nodes — telling us, quite simply, who has most number of connections?

Left Image: Subgraph sampled with 1,000 edges from the full PrimeKG graph (randomly selected for performance-friendly visualisation).
Right Image: Local subgraph centered around the AR (Androgen Receptor) node, showing a limited number of its immediate neighbours.

In the image shown above, on the left, you see the entire graph coloured by degree centrality. Each dot represents a gene or protein, and the more connected a node is, the more it stands out visually.
This full network reveals something interesting — a large number of nodes have relatively high degrees. This is common in biological networks, where certain proteins or genes participate in many interactions due to their multifunctional roles.

Zooming in on the right, we see one such node: AR (Androgen Receptor). The subgraph shows AR connected to several neighbouring proteins like CREB1, YES1, and GTF2F1, highlighting its hub-like role in the network.

This full network reveals something interesting — a large number of nodes have relatively high degrees.
This is common in biological networks, where certain proteins or genes participate in many interactions due to their multifunctional roles.

The right side shows a zoomed in version of a subgraph, we see one such node: AR (Androgen Receptor). The subgraph shows AR connected to several neighbouring proteins like CREB1, YES1, and GTF2F1, highlighting its hub-like role in the network.

So What Does This Mean?

AR has a high degree — it connects to many other nodes.
It’s play a key interaction in the network, potentially playing a central role in regulatory or signalling pathways.
Nodes like AR could be biological bottlenecks, making them attractive targets for research or therapeutic intervention.

Real-World Implication:
The Androgen Receptor is a well-established target in prostate cancer therapy. Its high degree in PrimeKG reflects its biological relevance and validates the use of graph analytics in surfacing known drivers of disease.

Did You Know?
In social networks, high-degree users are often “influencers.” In biology, such nodes are called party hubs — they interact with many partners but may do so in a non-specific manner. Targeting them could affect a wide array of processes, making them powerful but risky intervention points.

Disclaimer & Context

The visualisation above represents a sampled subgraph from the full PrimeKG biomedical network, containing 1,000 randomly selected edges. While node colours reflect degree centrality quantiles (top 10%, mid 30%, and remaining), the graph itself is not ranked or filtered by centrality — it includes nodes with varying degrees to offer a representative slice of the overall topology.

In the accompanying bar plot, certain genes and proteins like UBC exhibit extremely high degrees (often exceeding 5,000), confirming their role as global hubs. However, in this visualisation, nodes like UBC may appear with much smaller degrees due to sampling constraints, which help avoid rendering overload.

The subgraph on the right, centered on AR (Androgen Receptor), was extracted by selecting AR as a top-degree node and plotting it alongside a handful of its immediate neighbours. While AR’s full degree in the original graph is much higher, this zoomed-in view offers a focused look at its local interactions, helping to highlight its structural role without the clutter of its full connectivity.

Degree centrality gives a strong first impression of a node’s involvement, but it doesn’t tell us how information flows through the network.
To dig deeper, the next section explores betweenness and closeness centrality, which shift the focus from direct connections to the paths that enable influence, navigation, and control in complex biological systems.

Betweenness Centrality: Who Connects the Clusters?

Just knowing a lot of people doesn’t always make someone influential. Sometimes, it’s where someone stands — right at the intersection of different groups — that makes them truly powerful.

Imagine a person who isn’t the most popular, but who connects your school friends with your work circle, or your gym buddies with your college network. They become the bridge, the one through whom ideas, opportunities, or even gossip travel.

In graph terms, that’s what betweenness centrality captures — not who knows the most people, but who connects the most communities.

The same holds true in biomedical graphs. Some nodes matter not because they connect to many others, but because they sit between them — bridging otherwise disconnected parts of the network.

That is the essence of betweenness centrality. It measures how often a node lies on the shortest paths between other nodes. In simple terms, it reveals which nodes serve as bridges for flow of information.

Another Example: Think of it like an airport hub! A place like Doha or Istanbul might not have the most overall flights, but they connect continents — Europe, Asia, and Africa — with strategic efficiency. Similarly, in a biomedical context, a gene or disease might not be the most connected, but if it links two critical biological modules — say, neuro-degeneration and immune signalling — it becomes structurally essential.

Left: Betweenness centrality visualized on a sampled subgraph of PrimeKG (1,000 edges). Nodes with darker purple color exhibit higher betweenness — acting as critical bridges within the biomedical network.
Right: Local subgraph centered on Schizophrenia, one of the top-ranked nodes by betweenness. Its immediate neighbors include genes, proteins, and a drug (Mosapramine) — illustrating its central role in connecting multiple biological pathways.

A Snapshot of Betweenness in PrimeKG

The figure above visualises this in action. On the left, a sampled subgraph from PrimeKG (1,000 edges) is shown, with node colour indicating betweenness.

Darker purple nodes are the ones with higher betweenness scores — they appear on more shortest paths, suggesting they influence how signals or relationships traverse the graph.
These are the bridge builders — the ones holding the graph together.

On the right, we zoom into one such node: Schizophrenia. Despite being a psychiatric condition, it sits at the center of multiple relationships, acting as a connector between genes, proteins, and a drug (Mosapramine).

Why Does Schizophrenia Matter in This Network?

At first glance, it may seem surprising to see Schizophrenia — a psychiatric disorder — appear as a central hub in a biomedical knowledge graph. But when viewed through the lens of systems biology, it makes sense for it to be in this strategic position.

Schizophrenia as a Cross-System Connector

Schizophrenia is not just a brain disorder; it’s a multi-system condition:

Neurotransmitter genes like GSK3B and NMDA receptors are involved in its pathophysiology.
Inflammatory pathways and oxidative stress have increasingly been recognised in the research of schizophrenia.
It shares genetic architecture with other complex diseases like bipolar disorder, Alzheimer’s, and even autoimmune disorders.

This means Schizophrenia “sits” at the intersection of multiple biological modules — neurological, immunological, and pharmacological.

Why Betweenness Centrality Confirms That

High betweenness means Schizophrenia lies on many shortest paths between other nodes — acting like a connector:

It links proteins involved in inflammation with those related to neurodevelopment.
It bridges drug targets to genetic risk factors, potentially exposing new angles for drug repurposing or comorbidity research.

Real-World Implications

Drug development: If a drug affects Schizophrenia-linked pathways, it may also impact other diseases it’s “connected to.”
Biomarker discovery: Understanding its neighbourhood could highlight shared biomarkers with other neurological or systemic disorders.
Precision medicine: It provides a network-based rationale for why some patients with Schizophrenia show immune or metabolic symptoms.

So in essence, Schizophrenia’s presence as a high-betweenness node isn’t random — it reflects a biologically rich and clinically nuanced role in the graph, helping to connect seemingly unrelated processes across the human body.

Note: This is a sampled graph. Full connectivity (e.g., actual node degrees) may be much higher — this visualisation is optimised for clarity, not scale.

Closeness Centrality

Closeness centrality measures how near a node is to all other nodes in the network. It doesn’t focus on how many connections a node has, but rather how quickly it can reach everyone else.

Think of it like this: In a massive biomedical graph of diseases, proteins, and drugs, a node with high closeness centrality isn’t necessarily the most connected — it’s just the most efficiently placed. It can spread information faster because it lies at the “center” of the network in terms of path lengths.

This makes closeness a powerful way to find entities that are highly reachable. These could be genes or diseases that sit at the heart of biological communication — not because they are hubs, but because they’re a few steps away from most others.

In healthcare, such nodes are valuable. They can act as regulators or bottlenecks, making them excellent candidates for targeted treatments or early interventions — since their position allows them to influence the system quickly and broadly.

Left Image : A network-level visualisation of closeness centrality on a sampled subgraph (1,000 edges). Nodes with larger size and deeper blue colour have higher closeness scores, indicating that they are topologically closer to the rest of the network. Right Image: A zoomed-in subgraph around AADACL2, one of the nodes with the highest closeness centrality. Its close proximity to a diverse set of entities — including proteins, enzymes, and compounds — reveals its strategic position in the graph.

You might notice in the image above on the right side that some nodes — like AADACL2 or CES1P1 — appear disproportionately large. That’s because nodes are sized based on their normalised closeness score. A higher score indicates that the node, on average, has shorter paths to all other nodes, and is therefore more “central” in terms of reachability.

What is AADACL2 and Why Does It Matter?

AADACL2 (Arylacetamide Deacetylase-Like 2) is an enzyme that belongs to the serine hydrolase family, which plays a role in breaking down lipid-like molecules (such as esters and amides). While it’s not as heavily studied as other enzymes, research suggests it’s involved in:

Lipid metabolism, especially in processing bioactive lipid compounds.
Drug metabolism, meaning it may influence how certain drugs are activated or broken down in the body.

Because it’s connected to many different proteins, enzymes, and compounds, its high closeness centrality suggests that AADACL2 might serve as a central biochemical “router” — helping relay metabolic or pharmacological signals efficiently.

In a biomedical context, this makes AADACL2 a potentially strategic molecule for understanding cross-talk between metabolic pathways, or even for exploring new drug targets related to lipid disorders or metabolism.

Although Betweenness Centrality and Closeness Centrality share many parallels with each other, they still provide distinguishable roles:

While Schizophrenia and AADACL2 both emerge as central nodes in the biomedical graph, their roles differ based on the type of centrality. Schizophrenia, identified via betweenness centrality, acts as a strategic bridge — connecting otherwise distant biological modules like neurodevelopment and inflammation. Its importance lies in being on the shortest paths between other entities. In contrast, AADACL2, ranked high in closeness centrality, isn’t a bridge but a hub — positioned at the topological center of the network. It can “reach” many other nodes quickly, making it ideal for rapid information diffusion or systemic influence. Together, these perspectives highlight how different nodes can matter — not just by how many connections they have, but by where they sit in the graph.

Understanding PageRank Centrality: Influence Beyond Just Connections

PageRank goes a step beyond simply counting the number of connections of a node. Instead, it captures the idea of influence by association. Originally developed by Google to rank websites, the algorithm doesn’t just ask “How many other pages link to this one?” — it also considers “How important are the pages that link to this one?”

It’s like academic papers: being cited 10 times means something, but if those 10 citations come from Nobel laureates, your work holds more weight. In the same way, PageRank assigns higher scores to nodes that are connected to other high-ranking nodes, making it ideal for uncovering biomedical influencers — genes, diseases, or drugs that quietly shape major biological systems due to the company they keep.

From a biomedical perspective: Imagine KLK3, also known as Prostate-Specific Antigen (PSA). It’s a protein with a well-established role in prostate cancer diagnostics. Suppose KLK3 is directly connected to several other proteins, but most of them are niche players not heavily involved in major biological pathways.

Now compare it to TP53, often dubbed as the “guardian of the genome.” TP53 may not have the highest number of connections overall, but it connects to key proteins in DNA repair, cell cycle regulation, apoptosis, and tumour suppression. These proteins, in turn, are connected to critical pathways across cancer, neuro-degeneration, and inflammation.

Even if TP53 had fewer direct edges, PageRank would assign it a higher score because it’s embedded in a highly influential sub-network. KLK3, while important in a specific diagnostic context, doesn’t exert the same level of systemic influence as TP53 across the biomedical landscape.

This is what makes PageRank so insightful — it helps uncover central regulators like TP53 that aren’t just popular, but connected to the popular kids — the true power brokers of the graph.

This makes PageRank particularly valuable for identifying hidden hubs in the network — entities that may not be the most connected or centrally located, but play an outsize role due to their indirect influence.

PageRank centrality highlighting “Intellectual Disability” and its influential neighbours in the biomedical graph.

The Visualisation

Left image: A sampled subgraph from PrimeKG coloured by PageRank values. Deeper red nodes indicate higher PageRank scores — meaning these nodes are not only well-connected but also strategically positioned in the network.
Right zoomed-in panel: A subgraph centered around “Intellectual Disability”, one of the top-ranking nodes by PageRank. It is connected to several rare syndromes and phenotypes — and its centrality highlights how it acts as a hub for many other conditions.

Why Intellectual Disability Ranks So High in PageRank

In the network visualisation, Intellectual Disability stands out not because it’s connected to the most nodes — but because of who it’s connected to. PageRank rewards nodes that are plugged into influential neighbourhoods, and this condition fits that profile perfectly.

If you zoom into the subgraph, you’ll notice that Intellectual Disability is linked to several rare syndromes, genetic deletions, and neurological disorders, such as:

Williams Syndrome
Tetra-Amelia Syndrome
Lambert Syndrome
Specific language impairment
Wolf–Hirschhorn syndrome

These aren’t random edges — they represent meaningful biomedical relationships across neurodevelopment, cognition, and rare disease modules.

From a PageRank perspective, Intellectual Disability sits at the crossroads of high-impact sub-networks. It’s not just “popular” — it’s important because it’s embedded among other important nodes.

Why This Matters in Biomedical Research

Conditions like Intellectual Disability often serve as phenotypic end-points of multiple upstream disruptions — genetic, metabolic, developmental. High PageRank suggests that it:

Links many rare diseases that may otherwise be studied in isolation.
Sits close to diagnostic phenotypes and genomic loci frequently associated with neurological development.
May serve as a pivot for comorbidity studies, where understanding one connection could help unravel others.

In essence, PageRank shows that Intellectual Disability is not just a clinical diagnosis — it’s a graph-theoretical hub that could unlock relationships between diverse disorders.

A Note on Similarity Across Centrality Measures

While we visualised different types of centrality — degree, betweenness, closeness, and PageRank — many of the subgraphs looked strikingly similar.

Why?
Because in this particular biomedical graph:

Many nodes have simultaneously high scores across multiple centrality metrics.
High-degree nodes often also act as bridges or hubs in terms of betweenness or PageRank.
Biological networks are naturally dense and modular, where central players influence several aspects of the system.

So while each centrality measure offers a unique lens, in practice, the same influential nodes often appear across them all — justifying their biological relevance even further.

4. Understanding Network Properties: Why They Matter in Biomedicine

When we analyse graphs — especially biomedical knowledge graphs — we’re not just interested in the number of nodes and edges. We care about how these nodes are connected, how information flows, and what the structure reveals about the underlying biology.

This is where network topology comes in: By studying properties like connected components, clustering, and path lengths, we gain insights into whether our biomedical graph resembles real-world biological systems or behaves more like random noise.

These properties help us answer questions like:

Connected Components: Are entities isolated or part of a larger biological cluster?
Clustering Coefficient: Do nodes tend to form tightly-knit neighbourhoods?
Average Shortest Path: How quickly can one entity influence another?

Understanding these traits helps in designing better machine learning models and even spotting gaps or inconsistencies in biomedical knowledge.

Largest Connected Component (LCC)

In a graph, all nodes are not always connected. The Largest Connected Component (LCC) refers to the biggest cluster of nodes where each node is reachable from every other node in that cluster. When analysing biological networks, focusing on the LCC helps us concentrate on the most meaningful and structurally relevant part of the graph — the “mainland” where most of the action happens, rather than the scattered “islands.”

Average Clustering Coefficient

The clustering coefficient measures how connected a node’s neighbours are to each other. It answers the question: “If A is connected to B and C, how likely is it that B and C are also connected?”

A high average clustering coefficient indicates a “cliquish” structure — common in social or tightly-knit biological communities (like protein complexes).
A low value suggests that although nodes are connected, their neighbours don’t interact much — like spokes on a wheel.

Average Shortest Path Length

This metric tells us how many steps (edges) it takes, on average, to travel from one node to any other node in the network.

A lower value means the network is more tightly connected — information (or influence) can travel quickly.
In biological graphs, it can indicate how efficiently signals or interactions propagate through molecular pathways.

Small-World Networks & the Watts–Strogatz Model

In many complex systems — ranging from social circles to biological pathways — the underlying structure of connections doesn’t follow a completely random pattern, nor is it perfectly regular. Instead, these systems often exhibit properties of a small-world network.

A small-world network is characterised by two main features:

High clustering: Nodes tend to form closely-knit groups. In biology, this could resemble protein complexes where a group of proteins interact tightly within a specific cellular function.
Short average path lengths: Despite the clustering, any two nodes can typically be reached via only a few steps — just like the “six degrees of separation” often cited in social networks.

To model and investigate this behaviour, the Watts–Strogatz model is widely used. This model begins with a structured graph — such as a ring lattice, where each node is connected to its immediate neighbours. Then, a small fraction of the edges are randomly rewired, creating shortcuts across the network. These shortcuts preserve clustering while dramatically reducing the average path length, simulating the balance between local cohesion and global reach that defines real-world systems.

Real-World Analogy

Consider a group of researchers in a large scientific community. Most are tightly connected within their own labs or institutions (high clustering). But occasionally, someone collaborates with a peer in a distant university (a shortcut). Even though the system has tight local groups, the occasional long-range collaboration ensures that any two scientists are connected through just a few steps — a classic small-world property.

Relevance in Biomedical Networks

In this analysis, the Watts–Strogatz model was used as a baseline to assess whether real biomedical subgraphs — such as disease-protein or protein-protein networks — exhibit small-world characteristics. By comparing metrics such as:

Clustering Coefficient
Average Path Length

between the real network and its Watts–Strogatz equivalent, one can infer structural efficiency. If the real network shows higher clustering and comparable or shorter path lengths, it suggests that the system is modular, resilient, and optimized for biological communication — hallmarks of small-world organization.

Case 1: Disease–Protein Network

To investigate the small-world properties within the biomedical graph, the disease–protein subgraph was extracted and analysed. Focusing on its largest connected component (LCC), the following statistics were observed:

Nodes: 14673
Edges: 80411
Average Shortest Path Length: 4.4630
Clustering Coefficient: 0.0000

At first glance, the average path length of 4.46 suggests that nodes are relatively well-connected — it doesn’t take many steps to travel between a disease and a protein. However, the clustering coefficient of zero paints a different picture.

This means there’s no evidence of local clustering — the kind of tight-knit groupings seen in small-world systems. In other words, while a protein may be linked to many diseases, those diseases aren’t interconnected with each other. They don’t form meaningful “modules” or communities, as one would expect in a small-world network.

This result suggests that the disease–protein portion of the biomedical graph is more bipartite and dispersed than modular. It lacks the dense local pockets of interaction typically seen in biological subsystems — indicating that this specific subgraph does not exhibit small-world behaviour, even though it has short path lengths.

Case 2: Protein–Protein Network

Protein–protein interaction (PPI) networks are often celebrated in systems biology for exhibiting small-world properties — a structure that balances local clustering with global efficiency. This means that proteins often form tight interaction modules (e.g., signalling complexes), while still maintaining short paths to other proteins across the network — much like how social networks operate.

To test this behaviour within the dataset, the largest connected component (LCC) of the protein_protein subgraph was analysed:

Nodes: 18,354
Edges: 32,1075
Clustering Coefficient: 0.1135
Average Shortest Path Length: 2.98

At first glance, the average path length is impressively low, suggesting efficient connectivity. However, the clustering coefficient — a key hallmark of small-world networks — was surprisingly modest.

To benchmark this, a Watts–Strogatz small-world simulation was performed using the same number of nodes and average degree (n = 18,345, k = 35, p = 0.05). The simulation yielded:

Clustering Coefficient: 0.6241
Average Path Length: 4.29

Despite its short paths, the real PPI graph lacks the high clustering expected of small-world networks. This suggests that while the network is efficient in terms of communication, it does not exhibit the modular structure characteristic of classical small-world topologies.

In simpler terms:

Proteins in this dataset are well-connected globally, but they don’t form tight local communities as densely as expected — making the small-world signature incomplete.

Disclaimer on Subgraph Scope and Small-World Properties

The current analysis was performed on a subgraph of the full PrimeKG dataset — specifically, the largest connected component of the protein_protein network. While this subgraph is sizable, it still represents only a portion of the total protein–protein interactions in biomedical reality.

Had the entire protein–protein network been considered — capturing more protein families, paralogs, and overlapping pathways — the clustering coefficient may have been significantly higher. This is because proteins, especially within the same biological processes or cellular compartments, are more likely to interact and form densely connected communities, a key hallmark of small-world networks.

Thus, the absence of clear small-world properties in this analysis might not reflect a true structural limitation of the data, but rather a sampling artifact due to subgraph boundaries and visualization constraints.

5. Community Detection:Finding Meaningful Modules PrimeKG

In large-scale biomedical knowledge graphs, the web of connections can feel overwhelming. Yet, hidden within this complexity are communities — clusters of nodes that are more densely connected to each other than to the rest of the network. These tightly knit groups often correspond to biologically meaningful modules: diseases that share similar symptoms, proteins involved in the same pathway, or drugs targeting related mechanisms.

To uncover these underlying structures, the Louvain algorithm was applied — a widely used method for unsupervised community detection. It works by optimizing a metric called modularity, which evaluates how well a graph can be split into distinct communities. A higher modularity score indicates clearer boundaries between groups, helping reveal meaningful biological or therapeutic patterns embedded in the graph’s architecture.

Interpretation and Significance

The image above shows a subgraph of a biomedical knowledge graph, with nodes colored by communities identified using the Louvain algorithm. Each color represents a distinct cluster of nodes that are more densely connected internally than to the rest of the network — capturing potential biological or pharmacological modules.

This particular subgraph highlights one of the most densely interconnected communities. At its core are phenotypes such as epistaxis, vertigo, and respiratory distress, surrounded by a range of drugs including Milnacipran, Tacrolimus, Dicyclomine, and Azithromycin. These relationships suggest shared therapeutic indications, adverse effect profiles, or treatment contexts.

Peripheral phenotypes such as alopecia, tongue pain, and urinary hesitancy further support the idea that this cluster reflects pharmacovigilance signals — real-world patterns of drug-related outcomes observed in clinical use.

Why This Subgraph Matters

Pharmacological Modules: The dense drug-phenotype interplay may indicate drugs that share mechanisms of action or are commonly co-prescribed.
Polypharmacy Risk Exploration: The structural proximity of these drugs allows for the analysis of potential cumulative side effects or drug-drug interactions.
Drug Repurposing & Mechanistic Insight: Communities like this one surface latent relationships between drugs and phenotypes, offering a basis for repositioning opportunities and mechanistic hypotheses.

This subgraph tells a biological story — capturing the interwoven nature of treatments, symptoms, and molecular processes. Community detection transforms vast biomedical graphs into interpretable, actionable clusters, revealing functional groupings that may be missed in traditional tabular analysis.

By allowing the graph structure to guide discovery — rather than relying solely on predefined categories — we unlock new pathways for understanding disease mechanisms, optimising therapeutic strategies, and navigating the complexity of biomedical knowledge.

6. Structural & Semantic Subgraphs

After analysing global patterns like centrality and community structure, the next step is to zoom in — examining how individual biomedical concepts are embedded within the graph. This section explores both the structural layout and semantic context of nodes using k-hop subgraphs, revealing local patterns that offer biological insight.

Understanding Semantic Neighbourhoods: The Power of K-hop Subgraphs

In vast biomedical graphs, it’s easy to lose sight of local meaning. Concepts like “breast cancer” don’t exist in isolation — they’re surrounded by proteins, pathways, drugs, and phenotypes that help shape their biological relevance. To explore these relationships, k-hop subgraphs are a powerful tool.

What Is a K-hop Subgraph?

A k-hop subgraph extracts the immediate network around a node — everything reachable in k steps:

1-hop shows direct connections (e.g., drugs used to treat a disease).
2-hop brings in neighbours of those neighbours — capturing more subtle but still localised context, such as downstream effects or secondary pathways.

This approach balances focus and scope — providing enough detail for insight without being overwhelmed by the full graph’s scale.

Visualising Breast Cancer’s Semantic Neighbourhood

The figure illustrates a 2-hop semantic neighborhood around a central biomedical entity, visualized as a subgraph extracted from a larger knowledge graph. Nodes represent various biomedical entities — genes, drugs, diseases, and biological processes — colored by category. Edges capture direct or indirect associations within the 2-hop boundary. While some nodes are densely connected, others appear more isolated, reflecting the diverse nature of contextual relationships.

Interpretation and Significance

The visualisation reveals a 2-hop neighbourhood around a selected biomedical node, capturing its extended context within the knowledge graph. Nodes represent a mix of genes/proteins (light blue), diseases/phenotypes (red), biological processes (gray), and drugs (green or purple), and are connected via known relationships like regulatory effects, therapeutic use, or shared pathways.

Several small clusters are visible, each centered around a key biological or clinical term. For instance:

The red nodes mark diseases or phenotypes such as prostate cancer, ovarian clear cell adenocarcinoma, or attention deficit-hyperactivity disorder.
Green nodes such as Propafenone and Cinchocaine represent drugs, possibly linked via shared treatment targets or adverse effects.
Light blue nodes dominate the graph, representing genes and proteins that may be involved in upstream signalling or downstream response mechanisms.

Why This Matters

Functional Modules: The visualisation helps identify functionally related groupings, like genes that co-regulate disease expression or drugs associated with shared phenotypic responses.
Latent Associations: Some nodes are only loosely connected or isolated — a reflection of weak yet relevant semantic relationships not fully captured in the core graph structure.
Biological Insight from Topology: For example, a gene like DNAJA3 connected to NF-κB signalling and influenza could suggest a pathway-level role that’s worth exploring further in disease contexts or drug targeting.

On Visually Isolated Nodes

Some nodes appear disconnected from others despite being included in the 2-hop neighbourhood. This isn’t an error — it’s a reflection of how visual constraints (e.g., maximum edge count, subgraph boundary limits) affect layout.

These nodes are still part of the semantic neighbourhood and may:

Represent weaker or single-hop connections,
Be part of a peripheral pathway or side-effect profile,
Or act as bridges to other communities in the larger graph.

A Simple Analogy:

Imagine being at a party. Everyone there is somehow connected to the host — but not necessarily to each other. Some guests arrived as a quiet plus-one, standing off in a corner, not mingling much. They’re still part of the guest list — they just don’t have as many visible connections.

These “disconnected” nodes still matter. Their presence tells us the local biomedical neighbourhood is diverse — some entities are deeply intertwined, others act as bridges, and some are passive outliers. All contribute meaningfully to the context.

7. Causal Discovery (Advanced)

So far, the focus has been on structural patterns, centrality, and semantic proximity. But graphs aren’t just about who is connected — they’re also about how they’re connected.

In biomedical knowledge graphs, certain relationships carry a directional implication: a drug targets a protein, a protein modulates a pathway, or a drug is indicated for a disease. These are not causal in the strict statistical sense, but they encode mechanistic or clinical directionality — which makes them semantically causal and incredibly relevant for downstream tasks like treatment discovery or hypothesis generation.

In the final section, edges with causal-like semantics (such as "target", "indication", "enzyme") are extracted and explored to surface potential influence pathways across the biomedical graph. These directional links help illuminate how interventions might propagate and where key levers of action exist within the network.

Bevacizumab-Centric Causal Subgraph

The graph above illustrates a 2-hop causal subgraph centered around Bevacizumab, an anti-angiogenic drug widely used in the treatment of cancers. To focus on meaningful directional relationships, only edges with causal-like semantics were included, such as:

Bevacizumab → Disease (indication)
Bevacizumab → Protein (target)
Protein → Disease (target, indication)

Node color and meaning:

Bevacizumab (central drug node)
Diseases it is indicated for (e.g., breast cancer, glioblastoma)
Proteins or genes it targets or that are linked to related diseases (e.g., VEGFA, C1QB, FCGR2A)

Why This Matters:

This graph tells a therapeutic story. Bevacizumab doesn’t operate in isolation — it intersects multiple biological axes:

It targets key immune and angiogenesis-related proteins (like VEGFA).
It is indicated for a wide array of cancers, spanning breast, cervical, ovarian, and glioblastoma.
Some of its downstream targets (e.g., complement proteins C1QA, C1QB) suggest broader immunological involvement, beyond just vascular inhibition.

This view provides a mechanistic snapshot of how a single drug engages with both molecular targets and clinical outcomes, and may surface opportunities for drug repurposing, comorbidity mapping, or understanding off-target effects.

In short, this subgraph is not just about what Bevacizumab treats — it reflects how it acts, where it acts, and why its place in the biomedical network is central.

Wrapping Up: From Graphs to Ground Truths

What began as a tangle of nodes and edges has unraveled into something far more powerful — a structured lens on the complexity of human biology. Through centrality scores, semantic neighbourhoods, causal edges, and community structures, this exploration of PrimeKG revealed how graphs don’t just store knowledge — they shape it.

We saw how proteins form the backbone of biomedical interactions, how diseases like Schizophrenia or Intellectual Disability emerge as influential connectors, and how drugs like Bevacizumab map out therapeutic footprints across molecular and clinical landscapes.

Each visual and metric wasn’t just an abstract representation — it was a story waiting to be discovered. Stories of influence, proximity, causality, and connection. Stories that hint at new hypotheses, repurposing opportunities, or underlying biological mechanisms yet to be fully understood.

As the field of biomedical AI accelerates, it’s becoming clear that networks aren’t just data structures — they’re mirrors of biology’s design. And learning to read them well may hold the key to more personalised, explainable, and effective healthcare.

This blog is just the beginning. In the next part of this series, we’ll move beyond shallow metrics and handcrafted rules — and dive deep into Graph Neural Networks (GNNs), learning embeddings, training models, and generating predictions from these richly connected datasets.

To view the code behind the visualisation check out this notebook.

Coming up Next:

XplainMD Part 2: Finding the Missing Links with Machine Learning

In the next part of this series, we shift from structural analysis to predictive modelling. Using Node2Vec embeddings, we’ll transform the graph into numerical vectors that capture both local neighbourhoods and global patterns. These embeddings will power machine learning models — from logistic regression to XGBoost — to predict missing links between diseases, proteins, and drugs.

We’ll explore how well these models perform in identifying biologically plausible but unseen connections, assess their accuracy, and compare strengths and limitations — setting the stage for the deep learning methods to follow.

References:

https://www.turing.com/kb/graph-centrality-measures
Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of ‘small-world’networks. nature, 393(6684), 440.
https://chih-ling-hsu.github.io/2020/05/15/watts-strogatz
https://medium.com/data-science/community-detection-algorithms-9bd8951e7dae

5. Stanford CS224W: ML with Graphs | 2021 | Lecture 2.1 — Traditional. Feature-based Methods: Node:https://youtu.be/3IS7UhNMQ3U?si=_FDb2LtxoxI_fYtb

XplainMD: A Graph-Powered Guide to Smarter Healthcare

FHIR Shot Learning — Sat, 29 Mar 2025 17:30:13 GMT

A beautiful synergy between GNNs and LLMs for transparent and trustworthy Clinical Decision Support Systems

Introduction

Imagine the future of healthcare…

A doctor encounters a puzzling symptom combination in a patient and turns to an AI system for insight. The AI returns a prediction — perhaps a rare disease, an overlooked comorbidity, or repurposing a promising drug.

But then comes the critical question:
“How can the doctor trust this prediction?”

After all, without transparency, even the most confident output might just be a well-worded guess?

And this is exactly where things take an exciting turn.

When the doctor clicks “Why?”, the AI doesn’t just return citations (although those matter too). Instead, it also reveals a subgraph — a glowing, interpretable network of diseases, phenotypes, and drug interactions that explain the prediction.
A missing link emerges — one the doctor hadn’t considered — now made visible through the graph structure.

Doctor viewing patient subgraph generated by ChatGPT-4o

Now imagine another scenario!

The patient's condition is improving. But how can we be sure its the drug or the treatment?

Is it the drug that’s responsible — or other variables, like genetics, lifestyle, or concurrent treatments?

This is where graphs step in! They don’t just represent data — they reveal relationships, enable reasoning, and allow us to explore causality.

By adding or removing nodes, adjusting edges, or examining an entity’s neighbourhood, clinicians can simulate scenarios, challenge assumptions, and reason through outcomes — not with guesswork, but with a proper pathway(subgraph).

Graphs don’t just connect information. They connect understanding.

In traditional machine learning, data is often treated as isolated rows in a table — one record per patient, one feature vector at a time. Each prediction is made based solely on the input features for that specific instance, without considering how it relates to other features.

But healthcare doesn’t work in isolation. Diseases are linked to phenotypes. Drugs interact with proteins. Genes influence both — along with millions of interconnected biological factors constantly affecting one another. If we want AI to understand these relationships, then how we represent the data matters just as much as the data itself. And to reason over such complex, interconnected information, we need a model that’s designed for it —
Not just any neural network, but one that can learn from structure.

This is where graphs fundamentally change the game.

Graphs let us connect the dots — not just between entities, but between layers of biological meaning. A graph doesn’t just store that “Drug A treats Disease B” — it shows that Drug A interacts with a protein, which participates in a pathway, which relates to a phenotype seen in Disease B.

With that structure in place, one can:

Explore causality, not just correlation
Reason across multiple biological scales
Understand how and why predictions emerge

So when I say “graphs connect understanding,” I mean they offer a richer, more contextual view of complex systems — something traditional models simply can’t achieve when they treat each input as a flat, disconnected data.

To learn from such interconnected data, we need neural networks that are structure-aware — models that don’t just look at individual features, but understand the relationships between them.

That’s exactly what Graph Neural Networks (GNNs) are built for!

So what could this mean for the future of medicine?

What if a digital twin of a patient wasn’t just a file of lab reports and sensor data, but an interactive knowledge graph?
A graph where each node represents a part of their biology or history — and modifying that graph could uncover risks, suggest therapies, or even prevent misdiagnoses. What if every missing node or edge
was the difference between a delayed diagnosis and an early breakthrough?

Digital Twin Illustration generated by ChatGPT-4o

What if we could actually build something like this?

Not just imagine it — but design a system that can predict biomedical relationships and explain them visually and contextually.
A system that combines structured knowledge with reasoning.
One that doctors — and researchers — could actually trust.

That’s the vision behind XplainMD — a predictive and explainable medical assistant that brings together the structure of graph neural networks (GNNs) and the language fluency of large language models (LLMs).
It’s built on top of Harvard’s PrimeKG, a richly curated biomedical knowledge graph designed for precision medicine.

Connecting the Dots: From Architecture to Execution

This project didn’t begin with a dataset nor with a specific model in mind.
It began with a vision — a rough architectural sketch I drew two years ago, mapping out a system I wasn’t yet ready to build… but knew I had to.

A system that could bring structure to biomedical knowledge, reason over it, and explain itself — end to end.

CDSS Diagram envisioned in 2023

Before building XplainMD, I explored a foundational challenge: how can one extract meaningful, high-quality biomedical knowledge from unstructured literature?

In my previous PubMed Data Extraction series, I built a pipeline that did just that. It filtered high-quality articles, performed metadata analysis, and used large language models (LLMs) to extract entities and relationships — ultimately constructing a biomedical knowledge graph from scratch.

But as the project evolved, so did the realisation that this approach had critical limitations.

First, there was no human-in-the-loop validation for the extracted entities. While tools like BERN2 are powerful, biomedical terminology is inherently ambiguous. Consider insulin — biologically a protein, clinically a drug. Without proper context, the same term can be interpreted multiple ways, leading to noisy or misleading structures of graph.

Second, the relationship extraction process was based on individual sentences. The LLM would classify an association (positive/negative) based on a single snippet of text — but in scientific literature, context is everything. A relationship that appears causal in one sentence might be contradicted or clarified elsewhere in the same abstract. This sentence-level view simply wasn’t enough for accurate biomedical relationship extraction.

And most importantly, there was no expert validation. Even with strong models, constructing biomedical graphs without clinician oversight risks encoding false associations — which, in a domain as sensitive as medicine, can be very dangerous.

That’s when I discovered PrimeKG — a precision medicine knowledge graph developed by Harvard, integrating over 20 high-quality biomedical sources. With more than 17,000 diseases and 4 million+ relationships across 10 biological scales, PrimeKG offered something my earlier pipeline couldn’t: clinical relevance, validated structure, and multi-modal depth.

XplainMD: Brought to Life by PrimeKG

PrimeKG, developed by researchers at Harvard, is a richly curated biomedical knowledge graph that connects diseases, drugs, proteins, phenotypes, and much more. But it’s not just comprehensive — it’s clinically meaningful.

What sets PrimeKG apart is its depth and precision. Whether it’s approved treatments or experimental compounds, PrimeKG doesn’t just capture the surface — it maps the underlying structure of biomedical knowledge, making it one of the most complete and actionable disease-centric graphs available today.

Overview of the pipeline of XplainMD

With PrimeKG as the foundation, XplainMD was developed — a system that doesn’t just predict, but explains its own predictions.

For this project, Relational Graph Convolutional Networks (R-GCN) was used to perform link prediction on PrimeKG, surfacing potential drug-disease, disease-phenotype, and drug-protein relationships.

To interpret these predictions, GNNExplainer was used, to extract the subgraph-level evidence that contributed to each prediction. These subgraphs were then compared against the ground truth to assess confidence and alignment with validated knowledge.

Next, these insights were passed into a Large Language Model (LLaMA 3.1 8B Instruct), which generated natural language explanations that accompany the visual subgraphs — giving users clear, contextual interpretations instead of black-box outputs.

Finally, the entire pipeline is wrapped in an intuitive Gradio-based chatbot, where clinicians (or curious users!) can ask biomedical questions, receive predictions, explore subgraphs, and most importantly — understand the why behind every answer.

This is explainable AI for real-world healthcare — and it’s just the beginning of exciting times ahead!

This project series offers a gentle introduction to the world of graph data science — and how we can build more trustworthy, transparent systems using graph neural networks and explainability tools while using LLM to enable natural language understanding of the predictions.

Coming up Next

XplainMD is a four-part journey that explores the complete pipeline from structured biomedical data to explainable AI-driven insights:

Part 1: A Visual Exploration of PrimeKG
A beginner-friendly introduction to graph theory and biomedical knowledge graphs, with a deep dive into the structure of PrimeKG.

Part 2: Finding the Missing Links with Machine Learning
Using Node2Vec and classical ML techniques to uncover hidden relationships in the graph.

Part 3: Relational GCN + GNNExplainer: Learning & Explaining Links
Training a Relational Graph Convolutional Network (R-GCN) to predict drug-disease and disease-phenotype links — and interpreting those predictions with GNNExplainer.

Part 4: From Graph Reasoning to Natural Language — Integrating GNNs with LLMs and Gradio
Integrating an LLM to generate natural language explanations, wrapped in an intuitive chatbot interface using Gradio.

If you’re passionate about trustworthy AI, clinical decision support, or graph-powered reasoning, this series was built for you. Check out the full project on Github.

Because in medicine, relationships matter— but understanding those relationships?
That has the power to save millions of lives!

References:

Chandak, P., Huang, K. & Zitnik, M. Building a knowledge graph to enable precision medicine. Sci Data 10, 67 (2023). https://doi.org/10.1038/s41597-023-01960-3
Ying, Z., Bourgeois, D., You, J., Zitnik, M. and Leskovec, J., 2019. Gnnexplainer: Generating explanations for graph neural networks. Advances in neural information processing systems, 32.
https://doi.org/10.48550/arXiv.1903.03894
Schlichtkrull, M., Kipf, T.N., Bloem, P., Van Den Berg, R., Titov, I. and Welling, M., 2018. Modeling relational data with graph convolutional networks. In The semantic web: 15th international conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, proceedings 15 (pp. 593–607). Springer International Publishing. https://doi.org/10.48550/arXiv.1703.06103

PubMed Data Part 4: Building Knowledge Graphs

FHIR Shot Learning — Tue, 07 Jan 2025 07:17:46 GMT

After scoring and clustering papers in Part 3, the final phase of this project focuses on constructing Knowledge Graphs (KGs) — powerful tools for structuring relationships and making complex data more interpretable. In this stage, advanced techniques such as Regex tokenization, BERN2 for Named Entity Recognition (NER), and Llama 3.1 for relationship extraction are employed to build insightful and meaningful knowledge graphs.

Knowledge Graphs represent a transformative approach for organizing biomedical research. They not only provide interpretability but also address critical limitations of transformer models, such as hallucinations and a lack of transparency. By combining transformer-based entity and relationship extraction with the robust foundation of KGs, this approach ensures reliable, evidence-backed AI recommendations.

The process will be broken down step-by-step, along with the corresponding implementation code, to illustrate how these knowledge graphs are constructed.

Step 1 : Extracting Sentences with Regular Expressions

To prepare the text for entity recognition and relationship extraction, the sentences from research papers are tokenized using a fast and efficient method. Tokenization is critical because processing individual sentences improves the accuracy of entity extraction. Although there are many tokenizers available like spacy sentencizer or transformer based tools, for this project, a regex based function is used to extract the sentences

Here is the code used for sentence extraction:

def fast_extract_sentences(text):
    # Split on sentence-ending punctuation
    sentences = re.split(r'(?:[.!?]\s+)', text)
    # Remove any empty strings or extra whitespace
    return [sentence.strip() for sentence in sentences if sentence.strip()]

# Function to extract sentences from a .txt file
def extract_sentences_from_text_file(file_path):
    with open(file_path, 'r') as file:
        text = file.read()
    # Use fast method to split text into sentences
    sentences = fast_extract_sentences(text)
    return sentences

Step 2 : Named Entity Recognition using BERN 2

This step focuses on extracting biomedical entities, a crucial task in biomedical natural language processing (NLP). Named Entity Recognition (NER) and Named Entity Normalization (NEN) play a pivotal role in automatically identifying and standardizing entities like diseases, drugs, and organizations from the ever-expanding biomedical literature. While numerous tools exist for extracting general English entities, options for biomedical-specific entities are limited. Among these, BERN2 (Biomedical Entity Recognition and Normalization) stands out as the most advanced tool available.

BERN2 builds on previous neural network-based NER tools by integrating a multi-task NER model with neural network-based NEN models. This combination allows BERN2 to deliver significantly faster and more accurate entity recognition and normalization, making it an indispensable resource for biomedical NLP tasks. They provide a RESTful API to access their model( link: http://bern2.korea.ac.kr./documentation). The implementation of this is shown below:

# Query BERN2 for named entity recognition
def query_raw(text, url="http://bern2.korea.ac.kr/plain"):
    try:
        return requests.post(url, json={'text': text}).json()
    except:
        print('Invalid sentence')
        return None

# Extract entities from the BERN2 response
def extract_entities(entities):
    if not entities.get('annotations'):
        return {'text': entities['text'], 'text_sha256': hashlib.sha256(entities['text'].encode('utf-8')).hexdigest()}

    e = []
    for entity in entities['annotations']:
        other_ids = [id for id in entity['id'] if not id.startswith("BERN")]
        entity_type = entity['obj']
        entity_name = entities['text'][entity['span']['begin']:entity['span']['end']]
        entity_id = next((id for id in entity['id'] if id.startswith("BERN")), entity_name)
        e.append({
            'entity_id': entity_id,
            'other_ids': other_ids,
            'entity_type': entity_type,
            'entity': entity_name
        })

    return {'entities': e, 'text': entities['text'], 'text_sha256': hashlib.sha256(entities['text'].encode('utf-8')).hexdigest()}

The query_raw function sends a POST request to the BERN2 RESTful API with the input text in JSON format and retrieves the API's response.

The extract_entities function processes the raw response from BERN2 to extract and structure named entities along with their metadata.

Explanation of this function:

Input Arguments:

entities: The JSON response from the BERN2 API.

2. Handling Missing Annotations:

If the API response does not contain an annotations key, the function:
Returns the original text.
Computes a SHA-256 hash of the text to uniquely identify it.
This ensures robustness in cases where no entities are identified.

3. Extracting Entity Details:

For each entity in the annotations field:
entity_name: Extracted text span of the entity using begin and end indices.
entity_type: The type of entity (e.g., "ORG" for organization, "DISEASE" for disease).
entity_id: A unique identifier for the entity. If a BERN-specific ID exists, it is used; otherwise, the entity name is used as the fallback.
other_ids: Any additional IDs (e.g., MeSH or PubMed IDs) associated with the entity.

4. Return Value:

A dictionary containing:

entities: A list of extracted entities with their details.
text: The original text.
text_sha256: A hash of the text for unique identification.

The output is the following:

 "text": "Abstract 5: immunotherapy targeting programmed cell death-1 (pd-1) and pd-l1 immune checkpoints has reshaped treatment paradigms across several cancers, including breast cancer",
        "text_sha256": "64b0bdbd0e6a7eca8779b6f983f077f494e392c1706fda426efbc60beffad3a0"
    },
    {
        "entities": [
            {
                "entity_id": "pd-1",
                "other_ids": [
                    "NCBIGene:5133"
                ],
                "entity_type": "gene",
                "entity": "pd-1"
            },
            {
                "entity_id": "pd-l1",
                "other_ids": [
                    "NCBIGene:29126"
                ],
                "entity_type": "gene",
                "entity": "pd-l1"
            },
            {
                "entity_id": "triple-negative breast cancer",
                "other_ids": [
                    "mesh:D064726"
                ],
                "entity_type": "disease",
                "entity": "triple-negative breast cancer"
            },
            {
                "entity_id": "patients",
                "other_ids": [
                    "NCBITaxon:9606"
                ],
                "entity_type": "species",
                "entity": "patients"
            }
        ]

Explanation of the Output:

Each entity represents a meaningful biomedical concept extracted from the text. Let’s break down the entities:

Entity 1: PD-1

entity_id: "pd-1" (Programmed Cell Death Protein 1).
other_ids: Includes the identifier NCBIGene:5133, referencing the PD-1 gene in the NCBI database.
entity_type: "gene", indicating that PD-1 is a gene.
entity: "pd-1", the exact text from the input referring to this entity.

2. Entity 2: PD-L1

entity_id: "pd-l1" (Programmed Death Ligand 1).
other_ids: Includes NCBIGene:29126, referencing the PD-L1 gene in the NCBI database.
entity_type: "gene", indicating it is also a gene.
entity: "pd-l1", the mention of this immune checkpoint in the text.

3. Entity 3: Triple-Negative Breast Cancer

entity_id: "triple-negative breast cancer".
other_ids: Includes MeSH:D064726, referencing this specific type of breast cancer in the Medical Subject Headings (MeSH) database.
entity_type: "disease", indicating this entity is a disease.
entity: "triple-negative breast cancer", as mentioned in the input text.

4. Entity 4: Patients

entity_id: "patients".
other_ids: Includes NCBITaxon:9606, referencing the human species in the NCBI Taxonomy database.
entity_type: "species", denoting this entity refers to humans.
entity: "patients", as mentioned in the text.

PD-L1 is a protein encoded by the CD274 gene. While BERN2 classified it as a “gene” based on database associations, it is technically a protein. This is a misclassification. NER tools like BERN2 often associate proteins with their encoding genes for simplicity and because they are closely linked in biological texts. This classification does not invalidate the output but highlights the importance of domain knowledge to interpret results accurately.

Step 3 : Relationship Extraction

Llama 3.1 (8B-Instruct), a large language model optimized for instruction-based tasks, was employed for this step as many biomedical transformers are not yet capable of relationship extraction. For the sake of simplicity, the relationships between biomedical entities were classified into four major categories:

Positive Association: Indicates a direct or favorable link between two entities.

Example: “Drug A is effective in treating Disease B.

2. Negative Association: Represents an adverse or unfavorable connection between entities.

Example: “Treatment C exacerbates Condition D.”

3. Positive Correlation: Highlights a statistically significant, positive relationship where one entity increases or improves with another.

Example: “Higher dosages of Drug E correlate with improved outcomes for Disease F.”

4. Negative Correlation: Denotes a statistically significant, negative relationship where one entity decreases or worsens as another increases.

Example: “Long-term use of Drug G is negatively correlated with Patient H’s recovery time.”

Step 3: Loading the Model through Huggingface

The model was loaded using Hugging Face’s Transformers library with 4-bit quantization to optimize memory usage and enable efficient inference on modern GPUs. The tokenizer preprocesses input prompts for compatibility with the Llama architecture, while the quantized model allows high-performance execution by reducing the precision of computations without compromising accuracy.

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_id = "meta-llama/Llama-3.1-8B-Instruct"
quant_config = BitsAndBytesConfig(
    load_in_4bit=True
)

tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=huggingface_token)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    use_auth_token=huggingface_token,
    device_map="auto",
    quantization_config=quant_config
)

Step 4 : Extracting Relationships using LLAMA 3.1

Further, anextract_relationships_with_llama function is used that utilizes Llama 3.1 to identify and classify relationships between entities within a given sentence. This is achieved through a structured prompt-based approach, leveraging the capabilities of the LLM for contextual understanding and inference.

# Function to query Llama 3.1 for relationship extraction
def extract_relationships_with_llama(sentence, entities):
    # Create the prompt for the LLM
    prompt = (
        f"The following text contains entities:\n\n{sentence}\n\n"
        f"Entities:\n{entities}\n\n"
        "Identify relationships between these entities and classify them into one of the following categories: "
        "Positive association, Negative association, Positive correlation, Negative correlation, Neutral.\n\n"
        "Output the relationships in a structured format: Each row should contain [Source Entity, Target Entity, Relationship].\n"
        "Relationships:"
    )

      # Generate output
    with torch.inference_mode():
        # Tokenize and move to device
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True).to("cuda")

        # Record how many tokens are in the prompt
        start_index = inputs["input_ids"].shape[-1]

        # Generate
        outputs = model.generate(**inputs, max_new_tokens=1500)
        new_tokens = outputs[0][start_index:]

        # Decode new tokens only (exclude prompt)
        result_text = tokenizer.decode(new_tokens, skip_special_tokens=True)

    return result_text

Why Use Llama 3.1 for Relationship Extraction?

Contextual Understanding: Llama 3.1 excels at capturing nuanced relationships by understanding the context of biomedical sentences.
Flexible Classification: The model can classify relationships into specific categories, making the output more actionable.
Structured Outputs: By guiding the model with a well-designed prompt, the function ensures that relationships are extracted in a machine-readable format.

Step 5: Structuring the Data to create Knowledge Graphs

To build a Knowledge Graph (KG), it’s essential to represent data in a structured and relational format. The source entity, target entity, and relationship form the core components of a KG, enabling it to model complex interactions between entities effectively.

Core Components of a Knowledge Graph:

Source Entity: Represents the starting point of a relationship.

Example: Stanford University in the relationship “Stanford University → Conducted_By → Diabetes.”

2. Target Entity: Represents the endpoint or recipient of a relationship.

Example: Diabetes in the same relationship.

3. Relationship: Defines the type or nature of the interaction between the source and target entities.

Example: Conducted_By, indicating that Stanford University conducted research on diabetes.

# Function to extract only the relevant table
def extract_relationships_table(raw_output_list):
    relationships = []

    for entry in raw_output_list:  # Iterate through each entry in the list
        # Ensure entry is a dictionary and contains the "relationships" key
        if isinstance(entry, dict) and "relationships" in entry:
            relationships_text = entry["relationships"]

            # Split the text into rows
            rows = [row.strip() for row in relationships_text.split("\n") if row.strip() and row.startswith("|")]

            # Parse rows into a structured format
            for row in rows:
                columns = [col.strip() for col in row.split("|")[1:-1]]
                if len(columns) == 3:
                    relationships.append({
                        "Source Entity": columns[0],
                        "Target Entity": columns[1],
                        "Relationship": columns[2]
                    })

    # Convert to DataFrame
    return pd.DataFrame(relationships)

# Extract the relationships table
relationships_df = extract_relationships_table(raw_output)

# Display the cleaned DataFrame
print(relationships_df)

Step 6: Knowledge Graph Construction

This code below shows how to construct and visualize a Knowledge Graph (KG) using NetworkX, where entities are represented as nodes and their relationships as directed edges. A subset of the dataset containing entity relationships is used to create the graph, with attributes like relationship type displayed as edge labels. This visualization provides an intuitive way to explore connections between entities, such as drugs and diseases, and uncover patterns in biomedical data. While NetworkX is suitable for small-scale visualization, for larger, scalable applications, Neo4j can be used to build and query the Knowledge Graph. Neo4j is a graph database optimized for handling complex relationships and allows users to store, query, and traverse large graphs efficiently using Cypher queries. This makes it an ideal choice for dynamic, real-world Knowledge Graph applications like healthcare recommender systems or drug discovery pipelines.

import networkx as nx
import matplotlib.pyplot as plt
# Load the processed dataset (replace 'processed_data_with_abstracts.csv' with your actual file path)
df = pd.read_csv("entity_relationships.csv")

# Select the first 10 rows of the DataFrame for the knowledge graph
plot_data = df.head(10)

# Create a directed graph
G = nx.DiGraph()

# Add edges with relationships from the DataFrame
for index, row in plot_data.iterrows():
    source = row['Source Entity']
    target = row['Target Entity']
    relationship = row['Relationship']
    G.add_edge(source, target, label=relationship)

# Draw the graph
plt.figure(figsize=(12, 8))
pos = nx.spring_layout(G)

# Draw nodes and edges
nx.draw(G, pos, with_labels=True, node_color='lightblue', node_size=3000, font_size=10, font_weight='bold', arrowsize=20)

# Draw edge labels
edge_labels = nx.get_edge_attributes(G, 'label')
nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_color='red', font_size=8)

plt.title("Knowledge Graph for First 10 Source Entities")
plt.show()

Conclusion and Key Takeaways

This final part of the blog series demonstrated the process of building and visualizing a Knowledge Graph (KG) to represent and explore relationships between biomedical entities. Starting from tokenized text and identified entities, relationships were extracted using Llama 3.1, categorized into meaningful classifications, and structured into a graph format. The graph was then visualized using NetworkX to provide a clear and interpretable representation of entity connections, laying the foundation for real-world applications.

Key Takeaways:

Power of Knowledge Graphs:

KGs offer an interpretable and explainable way to structure and visualize relationships between entities. This is particularly useful in domains like biomedical research, where understanding the connections between diseases, treatments, and organizations is critical.

2. Leveraging AI for Efficiency:

The combination of tools like BERN2 for Named Entity Recognition and Llama 3.1 for relationship extraction shows how advanced AI models can transform unstructured biomedical text into actionable insights.

3. Scalability and Flexibility:

While NetworkX is effective for small-scale visualization, tools like Neo4j are better suited for building scalable Knowledge Graphs that can handle dynamic data and support complex queries.

Real-World Applications:

The Knowledge Graph pipeline demonstrated here can be extended to power healthcare chatbots, recommender systems, and decision-support tools, enhancing explainability and transparency in AI-driven healthcare.
Addressing Challenges:

By integrating structured KGs with transformer models, this approach overcomes common limitations of language models, such as hallucinations, and provides a reliable backend for applications requiring evidence-based insights.

Reflections and Future Directions

This blog series has walked through the entire pipeline — from data extraction and scoring to entity recognition, relationship extraction, and Knowledge Graph construction. While these techniques offer valuable insights and practical applications, it’s important to acknowledge their limitations. The scoring methodology, while effective, is constrained by the dataset size and the absence of features like citation counts and study type data. Similarly, open-source models like Llama 3.1, though powerful, have their imperfections, and larger, more sophisticated LLMs could yield even better outputs for entity and relationship extraction.

Despite these challenges, this project showcases how advanced AI systems can be designed to be not only intelligent but also interpretable and impactful. By addressing these limitations in future iterations, the approach can inspire more robust and innovative applications in biomedical research and beyond.

Through these techniques, I aim to inspire innovative applications in biomedical research and beyond, fostering the development of advanced AI systems that are not only intelligent but also transparent, interpretable, and impactful.

PubMed Data Part 3: Mathematical Modelling

FHIR Shot Learning — Mon, 06 Jan 2025 12:34:46 GMT

In the part 2, we explored the dataset through extensive data analysis and visualisation. The focus was on understanding the structure of the data, identifying correlations, and uncovering trends. Through a mix of univariate, bivariate, and multivariate analyses, the challenges were highlighted that were mainly posed by limited access to high-impact-factor journals and the implications of open-access publishing models. This groundwork laid a solid foundation for diving into the next phase of the project.

Mathematical Modeling and Scoring Research Papers

In this section, the focus is to develop a mathematical model to assign scores to research papers based on numerical features, such as the 5-Year Impact Factor of journals and the Research Score of contributing universities. To achieve this, clustering algorithms were employed to group the papers into distinct categories, enabling pattern identification and scoring within these clusters.

While a supervised learning approach would be ideal for such a task if the dataset were labeled, the absence of labels in this case necessitated the use of unsupervised learning techniques. Clustering allowed us to structure the dataset meaningfully and derive insights without requiring pre-defined outcomes, making it a powerful alternative for this challenge.

Several clustering algorithms, including K-Means, DBSCAN, and Gaussian Mixture Models, were explored. Among these, the Gaussian Mixture Model (GMM) proved to be the best fit, as it effectively captured the underlying structure of the dataset and produced well-defined clusters, making it ideal for scoring research papers.

Why Gaussian Mixture Model (GMM)?

When faced with complex data, where clusters might overlap or take on irregular shapes, Gaussian Mixture Models (GMM) can be particularly powerful. At a high level, a GMM views the dataset as arising from a combination — or mixture — of different Gaussian (normal) distributions. Here’s how that works in practice, without diving into the underlying equations:

Multiple “Centers of Gravity”
Unlike simple clustering methods that assume all data points in a group are positioned around a single center, GMM envisions multiple regions of density. Think of each region as having its own “center of gravity,” but one that can stretch or skew in various directions. This allows GMM to capture clusters that aren’t strictly spherical.
Probability of Belonging
A core idea in GMM is that every data point receives a probability of belonging to each cluster, rather than being assigned to one cluster in a hard, all-or-nothing way. For instance, a research paper could be 80% likely to belong to Cluster A and 20% likely to belong to Cluster B, reflecting real-world uncertainties in which category it fits best.
Flexible Cluster Shapes
One of the limitations of methods like K-Means is the assumption that clusters are roughly circular (or spherical in higher dimensions). GMM sidesteps this by allowing clusters to take on elliptical or elongated shapes, accommodating the variety found in real data — like research metrics that can vary widely from paper to paper.
Adaptability
Because each Gaussian can have different parameters, GMM is adept at modeling data where variance and covariance differ across clusters. In other words, if one group of data points forms a tight ball and another forms a broader, more spread-out region, GMM can still handle both. This adaptability helps it detect underlying structures that a single fixed shape would overlook.
Intuitive Interpretation
From a conceptual standpoint, once GMM finds these mixtures, each “Gaussian” can be viewed as a different “theme” or “category” in your dataset. You can then look at how each paper (or data point) distributes its probabilities across these themes, giving you a nuanced understanding of how a paper aligns with each cluster’s characteristics.

By leveraging these features, GMM captures the gray areas between categories better than many other clustering methods. This nuance proves especially helpful in fields like research evaluation, where a paper may share similarities with multiple groups and shouldn’t be forced into a single label.

Step 1: Normalizing the dataset and performing GMM clustering

For clustering two key metrics were extracted — 5-Year Impact Factor and Research Score.

Prepare and Scale the Data
StandardScaler function was used for normalizing the data to avoid any feature bias.
Gaussian Mixture Model (GMM) Clustering
Next, GaussianMixture is instantiated with a chosen number of clusters (in this case, 3). After fitting the model to the scaled data, cluster assignments — called gmm_labels— is stored in a new column of the DataFrame.
Profiling the Clusters
The original data is then grouped by these newly assigned cluster lables and then summary statistics (like the mean) of Impact Factor and Research Score is computed. This gives an at-a-glance profile of each cluster’s average values.
Evaluating Performance
To see how well GMM separated the data, silhouette score was calculated. Higher scores generally indicate better-defined clusters.The silhouette score is a metric used to evaluate the quality of clustering in a dataset. It measures how similar each data point is to its own cluster (cohesion) compared to other clusters (separation). The score ranges from -1 to 1, where:

1 indicates well-defined clusters, with data points close to their own cluster and far from others.
0 suggests overlapping clusters, where data points are equally close to multiple clusters.
-1 indicates poorly defined clusters, with data points closer to other clusters than their own.

To assess the performance of the Gaussian Mixture Model (GMM), the silhouette score was calculated, with higher scores reflecting better-defined and more distinct clusters.

5. Visualizing the Results
Finally, a scatter chart was plotted of Research Score on the x-axis and Impact Factor on the y-axis. This provides an intuitive, at-a-glance view of how GMM grouped our research papers.

# Define numerical features
numerical_features = ['Impact_Factor_5Years', 'Research_Score']  # Replace with your actual column names

data_subset = df[numerical_features].copy()
# data_subset = data_subset.fillna(data_subset.mean())  # Fill missing values with mean (or other strategy)

# Standardize the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_subset)

# Step 2: Gaussian Mixture Model (GMM) Clustering
gmm = GaussianMixture(n_components=3, random_state=42)
gmm_labels = gmm.fit_predict(data_scaled)
df['GMM_Cluster'] = gmm_labels

# Step 3: Profile the Clusters
def profile_clusters(data, cluster_col, features):
    cluster_profiles = data.groupby(cluster_col)[features].mean()
    print(f"Cluster Profiles for {cluster_col}:\n")
    print(cluster_profiles)
    return cluster_profiles

# Profile GMM Clusters
print("Profiling GMM Clusters")
gmm_profiles = profile_clusters(df, 'GMM_Cluster', numerical_features)

# Evaluate GMM Clustering
silhouette_gmm = silhouette_score(data_scaled, gmm_labels)
print(f"Silhouette Score for GMM Clustering: {silhouette_gmm}")

# Visualize GMM Clusters with actual features
plt.figure(figsize=(8, 6))
sns.scatterplot(
    x=df['Research_Score'], 
    y=df['Impact_Factor_5Years'], 
    hue=gmm_labels, 
    palette='coolwarm', 
    alpha=0.7
)
plt.title('Gaussian Mixture Model (GMM) Clustering')
plt.xlabel('Research Score')
plt.ylabel('Impact Factor (5 Years)')
plt.legend(title='Cluster')
plt.grid(True)
plt.show()

The figure below showcases the clustering results of the Gaussian Mixture Model (GMM) based on two features: the 5-Year Impact Factor and the Research Score.

Three clusters were identified:

Cluster 0 (blue): Represents journals with low impact factors (mean ~3.36) and universities with lower research scores (mean ~28.81).
Cluster 1 (gray): Includes journals with slightly higher impact factors (mean ~5.52) and universities with moderate research scores (mean ~63.38).
Cluster 2 (red): Comprises journals with significantly higher impact factors (mean ~21.83) and universities with higher research scores (mean ~63.15).

The silhouette score of 0.387 suggests moderate separation between clusters. While the clustering captures some meaningful groupings, there is room for improvement, as seen in the overlap between Cluster 0 and Cluster 1. The scatterplot visually demonstrates the spread of clusters, highlighting how Cluster 2 stands out due to its higher impact factors.

Step 2 : Deriving Feature Weights

Following clustering using the Gaussian Mixture Model (GMM), the contribution of each feature (Impact Factor and Research Score) to the formation of distinct clusters was analyzed. By examining the variance within each cluster, it was observed that the Impact Factor played a more significant role (≈0.78) in separating the clusters compared to the Research Score (≈0.22). These values represent the relative importance or weights of the features in the clustering process.

Creating the Scoring Equation

Using the derived feature weights, a linear combination of the two features was constructed to develop a scoring model. This approach directly leverages the insights obtained from the GMM clustering, emphasising the more influential feature (Impact Factor) while still accounting for the contribution of the Research Score. The resulting model provides a structured and data-driven way to score research papers based on their most impactful attributes.

The code below calculates the relative importance (weights) of numerical features in forming clusters based on variance analysis:

Total Variance: The overall variance for each feature is computed.
Between-Cluster Variance: Variance of feature means across clusters is calculated, representing the feature’s role in differentiating clusters.
Within-Cluster Variance: Derived as the difference between total variance and between-cluster variance.
Importance Ratios: The ratio of between-cluster variance to total variance is computed for each feature, indicating its contribution to cluster formation.
Normalization: Ratios are normalized to sum to 1, resulting in feature weights.

These weights are then used to create a scoring equation, assigning greater emphasis to features that play a larger role in separating the clusters, such as the Impact Factor and Research Score.

# Calculate total variance for each feature
total_variance = df[numerical_features].var()

# Calculate between-cluster variance
cluster_means = gmm_profiles.mean(axis=0)
between_cluster_variance = gmm_profiles.var()

# Calculate within-cluster variance
within_cluster_variance = total_variance - between_cluster_variance

# Calculate importance ratios
importance_ratios = between_cluster_variance / total_variance

# Normalize importance ratios to sum to 1
weights = importance_ratios / importance_ratios.sum()

print("Feature Weights Based on Cluster-Centric Variance Ratios:")
print(weights)

# Develop the scoring equation
impact_factor_weight = weights['Impact_Factor_5Years']
research_score_weight = weights['Research_Score']
#rank_median_weight = weights['Rank_Median']

print("\nScoring Equation:")
print(f"Score = {impact_factor_weight:.4f} * Impact_Factor_5Years + "
      f"{research_score_weight:.4f} * Research_Score")

Conclusion

The derived feature weights were used to generate scores for each article, enabling the categorization of articles into three tiers: High, Medium, and Low. This scoring approach provides a structured way to rank articles based on their impact factor and research score. Additionally, the top 10 articles, based on their scores, were identified, and their abstracts were saved in a .txt file for further analysis.

Scoring Equation : Future Improvement

While the current scoring equation effectively utilizes Impact Factor and Research Score to rank articles, it has room for improvement with the inclusion of additional features and a larger dataset.

Incorporating Study Type Data: If data on study types were available, such as randomized controlled trials or meta-analyses, it could serve as an additional feature to refine the scoring. Certain study types often hold more credibility and relevance in academic and clinical research, making them valuable for ranking.

2. Adding Citation Count: Citation count is a strong indicator of an article’s influence and relevance within the research community. Including this metric could provide a more comprehensive assessment of an article’s impact.

3. Increasing the Sample Size: A larger dataset would enable more robust clustering and reduce the noise caused by outliers, improving the accuracy of the derived weights and the scoring model. With a broader representation of articles, the model could better capture patterns and trends across a more diverse range of journals and research outputs.

By incorporating these additional features and expanding the dataset, the scoring equation could become more precise and reflective of the true impact and quality of research articles.

Next Steps

In the upcoming blog, the focus will shift to utilizing Large Language Models (LLMs) for advanced tasks such as extracting entities from the article abstracts and identifying accurate relationships between them. This step will further enhance the ability to build meaningful knowledge graphs and derive actionable insights from the dataset.

PubMed Data Part 2 : Data Visualisation

FHIR Shot Learning — Mon, 06 Jan 2025 11:23:59 GMT

PubMed Data Part 2 : Data Visualisation

“No data is clean, but most is useful.” ~ Dean Abbott, Co-founder and Chief Data Scientist at SmarterHQ

A thorough understanding of the data is essential before developing any model. Examining it from multiple angles yields greater clarity, revealing patterns and trends that might otherwise remain hidden. To achieve this one must look at data individually and then pair it up and finally as a whole. This is achieved using Data Analysis and Visualisation which will be explored in this section. By perusing through different graphs, we’ll try to unlock all the secrets of our mysterious data and understand its inherent limitations.

Step 1: Data Cleaning

Before visualizing, it’s essential to remove extraneous data, such as NaN values in numerical columns, and drop rows with missing information in these crucial fields. These features play a vital role in the next phase of the project — which involves scoring the articles.

columns = ["Standardized_University", "University", "Rank", "Research_Score"]

# Drop rows with "unknown" or "nan" in object columns and NaN in numeric columns
for col in columns:
    if df[col].dtype == 'object':  # If the column is a string/object type
        df[col] = df[col].astype(str).str.strip().str.lower()
        df = df[~df[col].isin(["unknown", "nan"])]
    else:  # If the column is numeric
        df = df[df[col].notna()]  # Keep rows where the value is not NaN

# Reset index
df = df.reset_index(drop=True)

print(df)

After dropping the incomplete rows, there were some entries in the University Rank column were presented as a range. To standardize these values, the median was calculated of each range and assigned it to the respective universities.

def get_median_rank(rank_str: str):
    """
    If rank_str is a range like '501–600' or '501-600',
    return the midpoint. If it's a single number (e.g. '10'),
    return that number as float.
    """
    # Normalize dash variations: 501–600 -> 501-600
    rank_str = rank_str.replace('–', '-').strip()
    
    # Find all digits (e.g., '501', '600')
    numbers = re.findall(r'\d+', rank_str)
    if not numbers:
        return None  # or np.nan, depending on preference
    
    if len(numbers) == 2:
        lower, upper = map(int, numbers)
        return (lower + upper) / 2.0
    else:
        # If there's just one number
        return float(numbers[0])

For the sake of simplicity only the articles written in English language have been retrieved.

# Check unique values in the "Languages" column
unique_languages = df['Language'].unique()
print("Unique languages:", unique_languages)

df = df[~df['Language'].str.strip().str.lower().eq('por')]

Step 2 : Statistical Summary

To generate a statistictal summary , basic_statistics function is created. It calculates descriptive statistics for both numeric and categorical columns in a DataFrame. It also computes the number of missing values and unique values for each column, returning two structured outputs: one for numeric columns and another for categorical columns.

def basic_statistics(df):
    """
    Calculate basic statistics for numeric and categorical columns.
    """
    numeric_stats = df.describe(include=[float, int]).T  # Numeric columns
    categorical_stats = df.describe(include=[object]).T  # Categorical columns

   
    numeric_stats['missing_values'] = df.isnull().sum()
    numeric_stats['unique_values'] = df.nunique()

   
    categorical_stats['missing_values'] = df.isnull().sum()
    categorical_stats['unique_values'] = df.nunique()

    return numeric_stats, categorical_stats


numeric_stats, categorical_stats = basic_statistics(df)

print("Numeric Column Statistics:\n", numeric_stats)
print("\nCategorical Column Statistics:\n", categorical_stats)

This is the output of the summary:

As can be observed after removing missing rows based on the column list created above, it can be observed that from 1k articles, it has reduced to approximately 500 articles.

Step 3 : Univariate Analysis

Next we look at the frequency distribution of the date

As shown in the image below the frequency distributions of key numeric attributes provides some interesting insights:

Impact Factor for 5 Years: The distribution is heavily skewed towards lower values. This skewness is primarily because PubMed is an open-source medical repository, and high-impact-factor journals are often published by expensive, subscription-based publishers, limiting their availability in open databases.
University Rank: The dataset primarily includes articles from top-ranked universities (ranked below 500). While this ensures access to research from prestigious institutions, it also suggests that these articles were more likely published in lower-impact-factor journals, making them freely accessible on PubMed.
Research Score of Universities: Logically, one would expect the research score to have an inverse relationship with university rank. However, the figure below does not seem to reflect this expected trend.

Next we have look at the top 20 journals by frequency distribution

As illustrated in the figure below, the majority of journals in this dataset — such as Scientific Reports, PLOS ONE, and Frontiers in Immunology — embrace an open-access publishing model. This highlights a significant advantage: the research is widely accessible to the public without subscription barriers.

However, the dataset has a notable limitation: it excludes top-tier, high-impact journals like NEJM or Elsevier, which remain behind paywalls. These prestigious publications, often subscription-based, represent critical sources of cutting-edge research that are absent here. To build a truly comprehensive and world-class knowledge graph, access to these articles is indispensable.

It’s time for leading journals to reconsider their publishing models, embracing greater accessibility. Doing so would not only foster inclusivity but also catalyze global scientific collaboration, advancing research for the benefit of all.

The figure below showcases the frequency of articles published by various universities. It’s important to note that this does not suggest that top universities publish fewer articles. Instead, it reflects a tendency for these institutions to prioritize publishing in high-impact-factor journals. Unfortunately, such journals are often gated behind subscription barriers, making their research less accessible to the broader public.

Step 4: Bivariate Analysis

Now, let’s analyze the trends between two features to verify the accuracy of the data collation.

First, we’ll examine the top ten journals based on their impact factor. As shown below, a few top-tier articles are indeed available on PubMed although as shown above the frequency of such articles maybe low.

The chart highlights the top 15 universities based on the impact factor of their respective publications. Cornell University leads with a significantly higher impact factor of 50.5, followed by Stockholm University at 36.1.

Step 5 : Correlation Matrix

The heatmap above illustrates the correlations between the numerical columns in the dataset. While one might intuitively anticipate a correlation between impact factor and research score — both being indicators of research quality and influence — no significant relationship was observed, likely due to the limited sample size. The most notable correlation is between research score and university rank, which shows a strong negative correlation (-0.81), aligning with expectations that higher-ranked universities tend to have better research scores.

Step 6: Multivariate Analysis

Finally to have a look at the features as a whole, multivariate analysis was performed.

Relationship Between Impact Factor and Research Score

The scatter plot illustrates the relationship between the impact factor of journals and the research score of the universities contributing to the articles. Key observations include:

Weak Correlation (R² = 0.03):

The R² value of 0.03 indicates a very weak correlation between the impact factor of journals and the research score of universities.
This suggests that a university’s research score does not strongly influence the impact factor of the journals where its research is published. This could be due to low sample size.

2. Wide Dispersion:

Data points are widely scattered, reflecting the diversity in publication venues chosen by universities regardless of their research scores.
Universities with high research scores are seen publishing in both high and low-impact journals, which could be due to the open-access nature of many journals in the dataset.

3. Slight Upward Trend:

The fitted regression line shows a marginal positive slope, implying a slight tendency for universities with higher research scores to publish in journals with higher impact factors, though this trend is not statistically significant maybe because of low sample size.

4. Implications:

This result also aligns with earlier findings that many high-impact-factor journals are less accessible and may not prominently feature in open datasets like PubMed.

This insight underscores the importance of expanding access to high-impact journals for a more balanced representation in academic research repositories.

Some distributions are not explicitly displayed here but can be accessed through the following GitHub link: Pubmed EDA Part 2. Certain data, such as the study type, has been excluded from this blog as a significant portion of the extracted information was labeled as “Unknown,” making its inclusion less meaningful or insightful in this context.

Key Takeaways

Challenges with Accessibility:

The dataset primarily consists of open-access articles from journals like Scientific Reports and PLOS ONE. However, the absence of high-impact-factor journals, such as NEJM or Elsevier, highlights a significant limitation in building a truly comprehensive knowledge graph.

Correlations and Trends:

University rank and research score exhibit a strong negative correlation, as expected.
Contrary to intuition, research scores showed no significant relationship with the impact factor of journals, likely due to dataset constraints such as small sample size and the nature of open-access journals.
A weak positive trend exists between research scores and the impact factors of journals, but the diversity of publication venues dilutes this relationship.

The Need for Open Research: The findings underscore the need for greater accessibility to high-impact journals, which would enable a more balanced representation of academic research and facilitate global scientific collaboration.

Next in the Series:

In the next part of this series, we will dive into PubMed Data Part 3: Mathematical Modelling, where the focus will shift from exploratory analysis to creating an equation to sort out the articles. Using the cleaned and refined dataset, we will develop models to score the articles and categorize them into different tiers.

Stay tuned as we take the first steps toward building a robust and impactful biomedical knowledge graph!

PubMed Data Part 1: Web Scraping

FHIR Shot Learning — Mon, 06 Jan 2025 06:56:30 GMT

In the previous blog, the overarching vision of transforming PubMed’s vast repository of research into actionable knowledge was introduced and discussed the challenges of navigating unstructured data and the need for intelligent systems to assist in identifying high-quality studies. This blog builds on that foundation, focusing on the first crucial step: preparing and enriching the data to enable advanced analysis and visualization.

Laying the Groundwork: Setting the Foundation for Data Analysis

This part of the project establishes the foundation by focusing on three key objectives:

Extracting relevant articles based on predefined criteria.
Enriching the data with journal impact factors and university rankings.
Structuring the data for advanced analysis and visualization.

Step 1: Retrieving Data — The Search Begins

The first step in the pipeline involves retrieving relevant articles from PubMed. For this, Biopython was utilized, a Python library designed for computational biology and bioinformatics. Biopython provides programmatic access to online biological databases, such as NCBI, via the Entrez API, making it an efficient tool for fetching large datasets.

To keep things simple, the query was limited to five major diseases:

Diabetes
Cardiovascular Disease
Cancer
Alzheimer’s
Dementia

The search was further refined to include only articles published in the last five years to ensure recency and relevancy. This approach ensures that the data remains up-to-date and reflective of the latest advancements in biomedical research. The function returns a comprehensive list of PubMed IDs corresponding to these criteria, forming the foundation for the next steps in the pipeline.

def search_pubmed(email):
    """
    Args:
    - email (str): Email address for Entrez login.

    Returns:
    - list : List of PubMed IDs.
    """

    # Define email for entrez login
    Entrez.email = email

    # Setup Date range for past 5 years
    current_year = datetime.now().year
    date_range = f"{current_year - 5}[PDAT] : {current_year}[PDAT]"

    # Create top 5 list of diseases
    diseases = ["Diabetes", "Cardiovascular disease", "Cancer", "Alzheimer's", "Dementia"]

    # Initialize list to collect all PubMed IDs
    pubmed_ids = []

    for disease in diseases:
        query = f"{disease} AND {date_range}"
        handle = Entrez.esearch(db='pubmed', term=query, retmax=1000)
        record = Entrez.read(handle)
        handle.close()

        # Append the list of IDs for the current disease to the master list
        pubmed_ids.extend(record['IdList'])

    # Return the collected list of PubMed IDs after the loop
    return pubmed_ids

Step 2: Fetching Article Metadata

The fetch_articles function is used to retrieve articles, accepting a list of pubmed_ids as input. Utilizing the Entrez efetch functionality, the data is processed in manageable chunks to prevent overloading the API. An email address is required to use the API, as Entrez may contact users in case of server issues caused by their requests.

Recognizing the inevitability of network interruptions or incomplete reads, the function is designed to retry data fetching up to three times, ensuring reliability and minimizing data loss.

While there was an attempt to include citation counts, although it wasn’t very successful. For the proof of concept, this part has no longer been consider. However, incorporating this metric in the future could significantly enhance the analysis.

from http.client import IncompleteRead

def fetch_articles(email, ids_list, retries=3):
    """
    Fetch details for a list of PubMed IDs.

    Args:
    - email (str): Email address for Entrez login.
    - ids_list (list): List of PubMed IDs.

    Returns:
    - list: List of dictionaries with article details.
    """
    ids = ','.join(ids_list)
    Entrez.email = email
    attempt = 0
    while attempt < retries:
        try:
            # Fetch article details
            handle = Entrez.efetch(db='pubmed', retmode='xml', id=ids)
            results = Entrez.read(handle)
            handle.close()
            
            # Add citation counts
            for paper in results['PubmedArticle']:
                pmid = paper['MedlineCitation']['PMID']
                #citation_count = get_citation_count(pmid, email)
                #paper['CitationCount'] = citation_count
            
            return results
        except IncompleteRead as e:
            print(f"Incomplete read error encountered. Attempt {attempt + 1} of {retries}. Retrying...")
            attempt += 1
            if attempt == retries:
                print("Maximum retries reached. Raising last exception.")
                raise

Step 3 : Parsing the details of the article

Once we have the raw metadata, the next step is to extract specific details that are crucial for our analysis. Key attributes like:

Title and Abstract: For textual analysis and understanding the focus of the study.
Journal Name: Extracts the journal name
Authors and Affiliations: To identify the Authors and the institutions they are affiliated with.
Publication Date: To analyze trends over time.

This process ensures the data is clean, structured, and ready for enrichment.

def extract_article_details(paper):
    """
    Extract specific details from a PubMed article, including citation count.

    Args:
    - paper (dict): Dictionary of article details.

    Returns:
    - tuple: Extracted article details, including citation count.
    """

    title = paper.get('MedlineCitation', {}).get('Article', {}).get('ArticleTitle', 'No Title').lower()
    abstract_data = paper.get('MedlineCitation', {}).get('Article', {}).get('Abstract', {}).get('AbstractText', ['No Abstract'])
    abstract = abstract_data[0].lower() if isinstance(abstract_data, list) else abstract_data.lower()
    journal = paper.get('MedlineCitation', {}).get('Article', {}).get('Journal', {}).get('Title', 'No Journal').lower()
    language = paper.get('MedlineCitation', {}).get('Article', {}).get('Language', ['No Language'])[0]
    pubdate = paper.get('MedlineCitation', {}).get('Article', {}).get('Journal', {}).get('JournalIssue', {}).get('PubDate', {})
    year = pubdate.get('Year', 'No Data')
    month = pubdate.get('Month', 'No Data')
    authors_data = paper.get('MedlineCitation', {}).get('Article', {}).get('AuthorList', [])
    authors_list = []
    affiliations_list = []

    for author in authors_data:
        # Initialize variables for each author
        author_name = None
        affiliation = 'No Affiliation'

        # Check for author name and concatenate if present
        if 'LastName' in author and 'ForeName' in author:
            author_name = f"{author['LastName']} {author['ForeName']}"
            authors_list.append(author_name)

            # Check if 'AffiliationInfo' exists and is not an empty list
            affiliation_info = author.get('AffiliationInfo')
            if affiliation_info and isinstance(affiliation_info, list) and affiliation_info[0]:
                affiliation = affiliation_info[0].get('Affiliation', 'No Affiliation').lower()

        # Append affiliation to the list
        affiliations_list.append(affiliation)

    # Get Citation Count
    #citation_count = paper.get('CitationCount', 'No Citation Count')

    # Join the authors and affiliations into strings
    authors = ', '.join(authors_list)
    affiliations = ', '.join(affiliations_list)

    # Return the extracted information
    return title, abstract, journal, language, year, month, authors, affiliations

Step 4: Creating a Dataframe

The create_dataframe function brings us one step closer to organising the entire data in a tabular format. In this function, we call the above functions in a streamlined pipeline.

Using the fetch_articles function, it retrieves up to 1,000 articles in a single execution. The extract_article_details function is then applied to extract key features from each article, such as the title, abstract, authors, and affiliations. Once all relevant information has been processed, it is compiled into a structured DataFrame, consolidating the extracted metadata into an easily analyzable format.

def create_dataframe(email, ids_list, chunk_size=1000):
    """
    Create a DataFrame containing details of PubMed articles, including citation count.

    This function fetches articles from PubMed in chunks and extracts relevant details
    such as title, abstract, journal, etc., to populate a DataFrame.

    Args:
    - email (str): Email address for Entrez login.
    - ids_list (list of str): List of PubMed IDs to fetch.
    - chunk_size (int, optional): The number of articles to fetch in each request. Default is 1000.

    Returns:
    - pandas.DataFrame: A DataFrame where each row represents an article and columns
      contain details like title, abstract, journal, language, year, month, study type,
      authors, affiliations, and citation count.
    """
    pubmed_df = {
        'Title': [], 'Abstract': [], 'Journal': [], 'Language': [], 'Year': [], 'Month': [],
         'Authors': [], 'Affiliations': []
    }

    for chunk_i in range(0, len(ids_list), chunk_size):
        chunk = ids_list[chunk_i:chunk_i + chunk_size]
        papers = fetch_articles(email, chunk)

        if papers is None or 'PubmedArticle' not in papers:
            print(f"Warning: No data returned for chunk starting at index {chunk_i}")
            continue

        for paper in papers["PubmedArticle"]:
            # Extract article details from the paper
            title, abstract, journal, language, year, month, authors, affiliations = extract_article_details(paper)

            # Append the details to the respective lists in the dictionary
            pubmed_df['Title'].append(title)
            pubmed_df['Abstract'].append(abstract)
            pubmed_df['Journal'].append(journal)
            pubmed_df['Language'].append(language)
            pubmed_df['Year'].append(year)
            pubmed_df['Month'].append(month)
            pubmed_df['Authors'].append(authors)
            pubmed_df['Affiliations'].append(affiliations)

    # Convert the dictionary to a pandas DataFrame
    pubmed_df = pd.DataFrame(pubmed_df)

    return pubmed_df

Step 5: Merging the Impact Factors

The next step in the pipeline was to determine the Impact Factor of the journals associated with the retrieved articles. The Impact Factor is a crucial metric that measures the average number of citations a journal receives, with higher values signifying greater influence in the scientific community. Since PubMed does not directly provide this information, alternative methods were explored, including using APIs. However, many existing Python libraries for Impact Factor data retrieval are no longer functional. For this project, a CSV file containing Impact Factor data was sourced from the Journal Citation Reports website.

The merge_impact_factors function plays a key role in this step by merging the extracted PubMed data with the Impact Factor dataset. It matches the two DataFrames based on journal names or unique identifiers like ISSN/EISSN, ensuring a reliable integration of the Impact Factor into the pipeline. This step enriches the dataset, making it more robust for analysis and scoring methodologies.

def merge_impact_factors(pubmed_df, impact_factor_csv_path, journal_col='Journal'):
    """
    Merge impact factors into the PubMed articles DataFrame, retain articles with impact factors,
    and drop columns that only contain NaN values.

    Args:
    - pubmed_df (DataFrame): DataFrame containing PubMed articles.
    - impact_factor_csv_path (str): Path to the CSV file with impact factors.
    - journal_col (str): Column name for journal titles in the PubMed DataFrame.

    Returns:
    - DataFrame: The merged DataFrame with impact factors and without NaN-only columns.
    """

    # Load the impact factor CSV file
    impact_factors_df = pd.read_csv(impact_factor_csv_path)

    # Format the journal titles consistently (strip whitespaces and convert to lowercase)
    pubmed_df[journal_col] = pubmed_df[journal_col].str.strip().str.lower()
    impact_factors_df['Name'] = impact_factors_df['Name'].str.strip().str.lower()
    impact_factors_df['Abbr Name'] = impact_factors_df['Abbr Name'].str.strip().str.lower()

    # Attempt to merge based on multiple keys: Name, Abbreviated Name, ISSN, and EISSN
    merged_df = pubmed_df.merge(
        impact_factors_df,
        how='left',
        left_on=journal_col,
        right_on='Name'
    )

    # Attempt merging with additional identifiers if no matches are found
    if merged_df['JIF'].isna().all():
        merged_df = pubmed_df.merge(
            impact_factors_df,
            how='left',
            left_on=journal_col,
            right_on='Abbr Name'
        )
    elif merged_df['JIF'].isna().all() and 'ISSN' in pubmed_df.columns:
        merged_df = pubmed_df.merge(
            impact_factors_df,
            how='left',
            left_on='ISSN',
            right_on='ISSN'
        )
    elif merged_df['JIF'].isna().all() and 'EISSN' in pubmed_df.columns:
        merged_df = pubmed_df.merge(
            impact_factors_df,
            how='left',
            left_on='EISSN',
            right_on='EISSN'
        )

    # Rename relevant columns for clarity
    merged_df.rename(columns={
        'JIF': 'Impact_Factor',
        'JIF5Years': 'Impact_Factor_5Years',
        'Category': 'Journal_Category'
    }, inplace=True)

    # Retain only articles with available impact factors
    merged_df = merged_df.dropna(subset=['Impact_Factor'])

    # Drop columns that only contain NaN values
    merged_df = merged_df.dropna(axis=1, how='all')
    return merged_df

Step 6: Entity Recognition for extracting Universities and Study Type using GLINER

From the extracted data, two major challenges were identified:

Study Type Identification: The metadata does not explicitly specify the study type. This information must be inferred from the article title, adding complexity to the data processing pipeline.
Affiliation Cleanup: The affiliations column contains excessively verbose text. Extracting and isolating university names from this unstructured data requires additional processing.

To address these issues, GLINER, a Named Entity Recognition (NER) transformer, was employed. GLINER is built on a BERT-like transformer architecture and offers a robust solution for entity extraction. Unlike many traditional NER tools, which are restricted to predefined entity categories, GLINER is highly adaptable and lightweight, making it an excellent choice for processing large datasets and extracting custom entities like universities and study types efficiently.

The extract_universities_gliner function extracts university names from an affiliation string by using the GLINER transformer to predict entities labeled as "Organization." If no universities are found, it returns "Unknown." Similarly, the extract_study_type_gliner function identifies study types from the abstract text by extracting entities labeled as "Study Type." Both functions apply GLINER's entity prediction capability to handle unstructured text efficiently and populate the respective columns in the DataFrame.

from gliner import GLiNER
import pandas as pd

# Load the GLiNER model
model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1")

# Labels for entity prediction
labels_universities = ["Organization"]
labels_study_types = ["Study Type"]

# Function to extract universities from affiliations using GLiNER
def extract_universities_gliner(affiliation):
    """
    Extract universities from the affiliation string using GLiNER.

    Args:
    - affiliation (str): The affiliation string.

    Returns:
    - str: Extracted university names.
    """
    if not isinstance(affiliation, str) or affiliation.strip() == "":
        return "Unknown"

    # Perform entity prediction using GLiNER
    entities = model.predict_entities(affiliation, labels_universities, threshold=0.5)

    # Extract universities from the identified entities
    universities = [entity["text"] for entity in entities if entity["label"] == "Organization"]

    # Return universities as a comma-separated string or 'Unknown' if none found
    return ", ".join(universities) if universities else "Unknown"

# Function to extract study types from abstract using GLiNER
def extract_study_type_gliner(abstract):
    """
    Extract study types from the abstract text using GLiNER.

    Args:
    - abstract (str): Abstract of the study.

    Returns:
    - str: The type of study.
    """
    if not isinstance(abstract, str) or abstract.strip() == "":
        return "Unknown"

    # Perform entity prediction using GLiNER
    entities = model.predict_entities(abstract, labels_study_types, threshold=0.5)

    # Extract study type from the identified entities
    study_types = [entity["text"] for entity in entities if entity["label"] == "Study Type"]

    # Return the first matched study type or 'Unknown' if none found
    return study_types[0] if study_types else "Unknown"

# Apply the GLiNER extraction functions to the DataFrame
final_df['Universities'] = final_df['Affiliations'].apply(extract_universities_gliner)
final_df['Study_Type_Extracted'] = final_df['Abstract'].apply(extract_study_type_gliner)

Step 7: Standardisation of the Universities

Inconsistent naming conventions in affiliations can pose significant challenges for analysis. University names in the extracted data often vary due to differences in formatting, case sensitivity, or the inclusion of extra descriptive text. For instance:

“Stanford University, Department of Medicine” → “Stanford University”
“University of California, Los Angeles (UCLA)” → “university of california”

To address these variations, standardization ensures that all names are reduced to a consistent and comparable format. The standardize_university_names function streamlines this process by:

Utilizing Regular Expressions (Regex): Extracting the main university or institutional name from complex strings.
Handling Ambiguities: Assigning “Unknown” to cases where a clear match cannot be identified.

This step improves data consistency, enabling accurate analysis and integration with external datasets.

import re
def standardize_university_names(universities_column):
    standardized_names = []
    for university in universities_column:
        if university.lower() == 'unknown':
            standardized_names.append('Unknown')
            continue

        # Extract main university name using regex
        match = re.search(r'([a-zA-Z]+\s*(university|institute|college|academy|school))', university, re.IGNORECASE)
        if match:
            standardized_names.append(match.group(0).strip().lower())
        else:
            standardized_names.append('Unknown')

    return standardized_names

final_df['Standardized_University'] = standardize_university_names(final_df['Universities'])

Step 8 : Fetching University rankings and merging it in the dataframe

To enhance the paper sorting methodology, it was essential to incorporate university rankings and corresponding research scores into the dataset. This was achieved using the extract_and_merge_university_ranking function, which fetches global university rankings and research scores from a specified API and seamlessly integrates this information into an existing DataFrame.

The function works by sending a GET request to the API, parsing the JSON response to extract relevant data (university names, rankings, and research scores), and structuring this information into a new DataFrame. To ensure compatibility with the existing dataset, university names are standardized by converting them to lowercase. The enriched ranking data is then merged with the original DataFrame using the standardized university names as a key.

Additionally, the function handles API request errors gracefully, cleans and formats ranking values, and ensures that the final DataFrame is comprehensive and ready for further analysis. The result is an updated dataset with new columns for university rankings and research scores, enabling a more robust and data-driven approach to ranking papers.

This step completes the data extraction pipeline, providing a comprehensive dataset ready for downstream analysis.

 
def extract_and_merge_university_ranking(final_df, api_url):
    """
    Extracts university rankings from a given API and merges them with the existing DataFrame.

    Args:
        final_df (DataFrame): Existing DataFrame with a column named 'Standardized_University'.
        api_url (str): URL to the API that provides university rankings.

    Returns:
        DataFrame: Updated DataFrame containing 'Rank' and 'Research_Score' columns.
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
    }

    try:
        # Sending GET request to the URL
        response = requests.get(api_url, headers=headers)
        response.raise_for_status()  # Raise an error for bad responses
        data = response.json()

        # Extracting relevant data from the API response
        university_names = []
        ranks = []
        research_scores = []

        for university in data.get('data', []):
            uni_name = university.get('name')
            rank = university.get('rank')
            research_score = university.get('scores_research')

            university_names.append(uni_name.lower())  # Convert to lowercase for standardization
            ranks.append(rank)
            research_scores.append(research_score)

        # Creating DataFrame from extracted data
        ranking_df = pd.DataFrame({
            'University': university_names,
            'Rank': ranks,
            'Research_Score': research_scores
        })

        # Cleaning up rank values to remove symbols like '=' and converting to int
        ranking_df['Rank'] = ranking_df['Rank'].replace('=', '', regex=True).astype(str)

        # Standardizing 'Standardized_University' column in final_df to lowercase for matching
        final_df['Standardized_University'] = final_df['Standardized_University'].str.lower()

        # Merging rankings with the original DataFrame
        final_df = final_df.merge(ranking_df, left_on='Standardized_University', right_on='University', how='left', suffixes=('', '_Ranking'))
        return final_df

    except requests.exceptions.RequestException as e:
        print(f"An error occurred while fetching university rankings: {e}")
        return final_df

api_url = "https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2024_0__91239a4509dc50911f1949984e3fb8c5.json"


# Call the method and update final_df
pubmed_final = extract_and_merge_university_ranking(final_df, api_url)
print(pubmed_final.head())

The Result: A Comprehensive Dataset

With all necessary fields consolidated into a single dataset, the foundation for advanced analysis is now complete. This enriched and structured data is primed for deeper exploration and meaningful insights.

What’s Next?

In the next blog, we’ll focus on data visualization to uncover patterns, trends, and feature correlations within the dataset. These insights will play a crucial role in developing a robust scoring methodology for ranking articles and will set the stage for constructing sophisticated Knowledge Graphs in subsequent steps.

Harnessing PubMed: A deep dive in medical knowledge extraction powered by LLMs

FHIR Shot Learning — Wed, 01 Jan 2025 11:20:43 GMT

PubMed is an open source database of biomedical research articles and life-sciences. Its database contains more than 37 million citations and abstracts of biomedical literature.

But it is far more than a repository of medical research; it is a vast trove of insights waiting to be unlocked. With its API granting access to millions of research papers, it offers unparalleled opportunities to transform raw data into actionable knowledge.

Source: hjrc33

Scenario 1: A Traditional Approach to Finding Solutions

Imagine a scenario where a patient suffers from a disease and is allergic to conventional treatment. The doctor, eager to help, decides to find an alternative by manually searching online. They eventually discover a peer-reviewed article in a respected journal and decide to try the suggested treatment. However, the study in question is a single-blind trial with limited supporting data. Tragically, the treatment fails, leaving the patient to suffer further.

While the doctor’s decision to consult recent research is commendable, it’s not without risks. The quality of the article, the credibility of the journal, the type of study, and even the author’s affiliations all play critical roles in determining whether the research is trustworthy. Unfortunately, in the current landscape, many rely solely on a journal’s Impact Factor as a metric for quality. But is that enough?

Consider this: a double-blind study is inherently more reliable than a single-blind one because it eliminates researcher bias. Similarly, an article authored by researchers affiliated with highly ranked institutions or one with substantial citation counts might be more credible. So why haven’t we developed a comprehensive scoring methodology to evaluate papers holistically? Why are we fixated solely on the Impact Factor of journals when more nuanced metrics could significantly enhance the credibility of research used for critical decisions?

Scenario 2: Relying on AI Without Caution

Now, imagine another scenario; With the rising popularity of ChatGPT, the doctor decides to forgo traditional search methods and instead queries the AI for a solution. ChatGPT, celebrated for passing medical exams and its widespread use in medical assistance, confidently suggests a treatment. However, the treatment only worsens the patient’s existing symptoms. Further investigation reveals that ChatGPT was hallucinating — producing fabricated or inaccurate information — potentially jeopardizing the patient’s health or even their life.

This is not to diminish ChatGPT’s capabilities; it is undeniably a groundbreaking tool for tasks such as summarization, classification, and contextual language understanding. However, when it comes to factual accuracy in critical fields like healthcare, relying solely on a language model is fraught with risks. Large Language Models (LLMs) like ChatGPT are prone to hallucinations, where they generate plausible but false or unverified information. In a domain where even a single error can have life-threatening consequences, this is a risk we simply cannot afford.

A Smarter Solution: Combining LLMs and Knowledge Graphs

Rather than relying solely on LLMs, we can leverage their strengths to construct Knowledge Graphs (KGs) — a structured, traceable, and explainable framework. Knowledge Graphs offer several advantages:

They are transparent, allowing users to trace recommendations back to the original source.
They are credible, incorporating only validated data from reliable research.
They are explainable, clearly showing the connections between entities like diseases, treatments, and study types.

By integrating LLMs as tools for extracting information and combining them with Knowledge Graphs for reasoning and explainability, we can create systems that are both powerful and trustworthy. This hybrid approach ensures that doctors have access to reliable, evidence-based insights, minimizing the risks of misinformation and ultimately improving patient outcomes.

The future of healthcare lies not in replacing traditional methods or blindly trusting AI but in combining the best of both worlds — human expertise, AI innovation, and structured, explainable data.

This project aims to explore how advanced tools and methodologies can be combined to create a robust pipeline that not only processes vast amounts of data but also ensures its usability and credibility. By leveraging the strengths of both LLMs and Knowledge Graphs, we take a step closer to building intelligent systems that are explainable, evidence-based, and capable of supporting decision-making in high-stakes environments like medicine.

Simplifying Complexity: The Project’s Two Core Parts

While the project is divided into four parts for clarity, it fundamentally revolves around two primary components:

Part 1: Distilling the Data

This section establishes the foundation for the project:

Data Scraping: Collecting PubMed data and supplementing it with journal impact factors and university research scores.
Data Visualization: Conducting exploratory data analysis (EDA) to understand data distributions and correlations between features.
Mathematical Modeling: Developing a scoring methodology to rank papers based on feature importance.

Part 2: Building Knowledge Graphs

The focus here is on leveraging LLMs to construct sophisticated knowledge graphs:

Named Entity Recognition (NER): Employing biomedical transformers to identify critical entities such as institutions, researchers, and study attributes.
Relationship Modeling: Using Llama 3.1 to establish connections between these entities, enabling the creation of meaningful relationships.
Knowledge Graphs: Constructing and querying knowledge graphs to visualize data and derive actionable insights.

Why This Matters

This proof-of-concept project demonstrates how artificial intelligence (AI) can transform unstructured data into structured, explainable knowledge. By automating the pipeline with AI models, the project highlights a range of impactful applications:

Healthbots: Automating patient inquiries with AI-driven chatbots powered by high-quality research.
Recommender Systems: Guiding researchers and clinicians to relevant studies based on robust scoring and relationships.
Explainable AI: Enhancing trust by providing clear, evidence-backed recommendations.

What’s Next: A Four-Part Series

This blog marks the beginning of a series that delves deeper into the methods and findings of the project. Here’s what you can expect:

PubMed Data Part 1: Web Scraping: Exploring how PubMed data, journal impact factors, and university research scores were collated from different sources and integrated into a single dataframe.
PubMed Data Part 2: Data Visualisation: Uncovering patterns and correlations through exploratory data analysis (EDA).
PubMed Data Part 3: Mathematical Modelling: Developing a mathematical equation for sorting out the articles using unsupervised learning method.
Part 4: Building Knowledge Graphs: Leveraging transformers and LLMs for advanced NER and identifying various relationships and constructing Knowledge graphs of it.

This four-part series aims to demonstrate how a combination of data science, machine learning, and advanced AI models can streamline complex biomedical research workflow. By starting with data collection and progressing through visualization, modeling, and Knowledge Graph construction, each part builds on the previous to showcase a holistic approach to transforming unstructured data into actionable insights. Whether you’re a data scientist, researcher, or healthcare professional, this series offers a comprehensive guide to leveraging AI for impactful applications.

If you are interested in the full project, it is available on github.

Happy Reading!