<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by FHIR Shot Learning on Medium]]></title>
        <description><![CDATA[Stories by FHIR Shot Learning on Medium]]></description>
        <link>https://medium.com/@fhirshotlearning?source=rss-7e548aa5925b------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*kut084zQwp5kQhG5QX6mpQ.jpeg</url>
            <title>Stories by FHIR Shot Learning on Medium</title>
            <link>https://medium.com/@fhirshotlearning?source=rss-7e548aa5925b------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sat, 30 May 2026 02:21:25 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@fhirshotlearning/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[XplainMD Part 4: From Graph Reasoning to Natural Language — Integrating GNNs with LLMs and Gradio]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@fhirshotlearning/xplainmd-part-4-from-graph-reasoning-to-natural-language-integrating-gnns-with-llms-and-gradio-afa5c636e956?source=rss-7e548aa5925b------2"><img src="https://cdn-images-1.medium.com/max/2502/1*QTQi_MUjJ8OdsUKUJZcMhA.png" width="2502"></a></p><p class="medium-feed-snippet">In Part 3 of this series, the project advanced into the realm of deep learning, training a Relational Graph Convolutional Network (R-GCN)&#x2026;</p><p class="medium-feed-link"><a href="https://medium.com/@fhirshotlearning/xplainmd-part-4-from-graph-reasoning-to-natural-language-integrating-gnns-with-llms-and-gradio-afa5c636e956?source=rss-7e548aa5925b------2">Continue reading on Medium »</a></p></div>]]></description>
            <link>https://medium.com/@fhirshotlearning/xplainmd-part-4-from-graph-reasoning-to-natural-language-integrating-gnns-with-llms-and-gradio-afa5c636e956?source=rss-7e548aa5925b------2</link>
            <guid isPermaLink="false">https://medium.com/p/afa5c636e956</guid>
            <category><![CDATA[gnn]]></category>
            <category><![CDATA[gnnexplainer]]></category>
            <category><![CDATA[graph-data-science]]></category>
            <category><![CDATA[llm-applications]]></category>
            <dc:creator><![CDATA[FHIR Shot Learning]]></dc:creator>
            <pubDate>Wed, 09 Apr 2025 14:28:38 GMT</pubDate>
            <atom:updated>2025-04-09T14:35:24.478Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[XplainMD Part 3: Relational GCN & GNNExplainer: Learning & Explaining Links]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@fhirshotlearning/xplainmd-part-3-relational-gcn-gnnexplainer-learning-explaining-links-6a4a290819fc?source=rss-7e548aa5925b------2"><img src="https://cdn-images-1.medium.com/max/2000/1*vbQ1-bf1NjKF2jcRz-Jbxg.png" width="2000"></a></p><p class="medium-feed-snippet">Introduction</p><p class="medium-feed-link"><a href="https://medium.com/@fhirshotlearning/xplainmd-part-3-relational-gcn-gnnexplainer-learning-explaining-links-6a4a290819fc?source=rss-7e548aa5925b------2">Continue reading on Medium »</a></p></div>]]></description>
            <link>https://medium.com/@fhirshotlearning/xplainmd-part-3-relational-gcn-gnnexplainer-learning-explaining-links-6a4a290819fc?source=rss-7e548aa5925b------2</link>
            <guid isPermaLink="false">https://medium.com/p/6a4a290819fc</guid>
            <category><![CDATA[gnn]]></category>
            <category><![CDATA[graph-data-science]]></category>
            <category><![CDATA[gnnexplainer]]></category>
            <dc:creator><![CDATA[FHIR Shot Learning]]></dc:creator>
            <pubDate>Wed, 09 Apr 2025 13:48:49 GMT</pubDate>
            <atom:updated>2025-04-13T14:40:09.066Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[XplainMD Part 2: Finding the Missing Links with Machine Learning]]></title>
            <link>https://medium.com/@fhirshotlearning/xplainmd-part-2-finding-the-missing-links-with-machine-learning-918c03f613d4?source=rss-7e548aa5925b------2</link>
            <guid isPermaLink="false">https://medium.com/p/918c03f613d4</guid>
            <category><![CDATA[graph-data]]></category>
            <category><![CDATA[graphml]]></category>
            <category><![CDATA[node2vec]]></category>
            <category><![CDATA[graph-data-science]]></category>
            <dc:creator><![CDATA[FHIR Shot Learning]]></dc:creator>
            <pubDate>Wed, 09 Apr 2025 12:47:09 GMT</pubDate>
            <atom:updated>2025-04-12T13:12:06.893Z</atom:updated>
            <content:encoded><![CDATA[<p>In <a href="https://medium.com/@fhirshotlearning/xplainmd-part-1-a-visual-exploration-of-primekg-cf032cbb864a?source=friends_link&amp;sk=b87f78a312f515080e5f8598804c1d29"><strong>Part 1 of the XplainMD</strong></a><strong> series</strong>, we zoomed out to explore the <em>architecture</em> of biomedical knowledge — unpacking the rich topology of PrimeKG through centrality analyses, causal subgraphs, and community detection. By mapping how diseases, drugs, proteins, and phenotypes interconnect in a vast biomedical graph, the foundation has been laid for understanding not just what exists — but what might be <em>missing</em>.</p><p>Now, in <strong>Part 2</strong>, the gears will shift from exploration to prediction.</p><p>Graphs open the door to a wide range of prediction tasks — from <strong>node classification</strong> (predicting properties of nodes) to <strong>link prediction</strong> (inferring missing or potential connections). Since our focus is on understanding <strong>hidden relationships</strong> in biomedical data, this series dives into <strong>link prediction</strong>.</p><p><strong>So, what is link prediction?</strong><br> It’s the task of asking: <em>“Given what we know about this graph’s structure, can we infer meaningful relationships that aren’t explicitly present?”</em></p><p>This is where <strong>representation learning</strong> steps in.</p><p>By applying <strong>Node2Vec</strong>, the rich topology of the biomedical graph is transformed into dense vector embeddings that capture both <strong>semantic proximity</strong> and <strong>structural context</strong>. These embeddings serve as the foundation for downstream tasks — in this case, predicting missing or unknown edges between entities like drugs, diseases, and phenotypes.</p><p>These embeddings further become the input to downstream machine learning models like <strong>Logistic Regression</strong> and <strong>XGBoost</strong>, enabling us to tackle the powerful task of <strong>link prediction</strong> — estimating whether a biologically plausible, yet currently unobserved, connection exists between entities such as a disease and a phenotype.</p><p>This is where XplainMD begins to evolve from graph understanding into <em>graph reasoning</em>.</p><h3>Data Pre-processing : Structuring the Graph</h3><p>Before training any Machine Learning or Deep Learning models, it is essential to ensure the input data is well-formatted, clean, and consistent. This preprocessing step converts the raw PrimeKG CSV into a form that can be used to build a graph structure for machine learning tasks and deep learning as well.</p><h3>1. Data Loading</h3><p>In this project, a <strong>filtered subset of PrimeKG</strong> was loaded into a pandas DataFrame to focus on the most clinically relevant biomedical relationships. Specifically, only the following relation types were extracted:</p><pre>selected_relations = [<br>    &quot;protein_protein&quot;,<br>    &quot;disease_phenotype_positive&quot;,<br>    &quot;bioprocess_protein&quot;,<br>    &quot;disease_protein&quot;,<br>    &quot;drug_effect&quot;,<br>    &quot;pathway_protein&quot;,<br>    &quot;disease_disease&quot;,<br>    &quot;contraindication&quot;,<br>    &quot;drug_protein&quot;,<br>    &quot;indication&quot;<br>]</pre><h4>What Does Each Row Represent?</h4><p>Each row in the DataFrame corresponds to <strong>a single edge</strong> in the biomedical knowledge graph — that is, a meaningful connection between two biomedical entities.</p><p>The relevant columns include:</p><ul><li><strong>x_name, </strong><strong>y_name</strong>: The actual names of the two nodes connected by the relation (e.g., “Alzheimer’s disease”, “APP”).</li><li><strong>x_type, </strong><strong>y_type</strong>: The entity types for each node — such as disease, protein, drug, phenotype, etc.</li><li><strong>relation</strong>: The type of biomedical relationship between the nodes — e.g., disease_protein or drug_effect.</li><li><strong>x_source, </strong><strong>y_source</strong>: These fields <strong>do not indicate directionality</strong> of the edge — instead, they refer to the <strong>original source database</strong> (like NCBI or DrugBank) from which the node was extracted.</li></ul><h4>️ Direction ≠ Semantics</h4><p>While the table structure follows a source → target format (x_name to y_name), this <strong>does not mean the graph is directed</strong>. According to the official <a href="https://www.nature.com/articles/s41597-023-01960-3">PrimeKG paper</a>, the graph is treated as <strong>undirected</strong> during analysis and modelling. This means that relationships are <strong>bidirectional</strong>, even though they are stored in a structured row format.</p><h4>Why This Matters for Graph Construction</h4><p>Understanding the true semantics of these edges is critical. When building the graph later in PyTorch Geometric (or any GNN library):</p><ul><li><strong>Treat the edges as undirected</strong> for most graph algorithms and embeddings like Node2Vec.</li><li>Ensure the <strong>edge type (relation)</strong> and <strong>node types (x/y_type)</strong> are preserved in a mapping — enabling construction of typed heterogeneous graphs.</li></ul><h3>2. Text Normalisation</h3><p>Biomedical datasets often contain inconsistent casing, hidden unicode characters, or stray spaces. To ensure uniformity across node names, every node label is lowercased, stripped of whitespace, and normalised using unicodedata.</p><pre>def clean_text(text):<br>    return unicodedata.normalize(&quot;NFKD&quot;, str(text)).strip().lower()<br><br>df[&quot;x_name&quot;] = df[&quot;x_name&quot;].apply(clean_text)<br>df[&quot;y_name&quot;] = df[&quot;y_name&quot;].apply(clean_text)</pre><h3>3.Type Mapping</h3><p>Node types in PrimeKG can vary in format — e.g., “gene/protein”, “chemical/drug”, or redundant variants like “bioprocess”. These are mapped to canonical categories to simplify modelling and ensure consistency.</p><pre>node_type_mapping = {<br>    &quot;gene/protein&quot;: &quot;protein&quot;,<br>    &quot;chemical/drug&quot;: &quot;drug&quot;,<br>    &quot;drug&quot;: &quot;drug&quot;,<br>    &quot;disease&quot;: &quot;disease&quot;,<br>    ...<br>}</pre><h3>4. Extracting Node Names, Types, and Normalized Relations</h3><p>Before applying any graph machine learning technique, we need to <strong>structure the biomedical data</strong> in a way that respects its semantic complexity. In <strong>PrimeKG</strong>, each row represents a biologically meaningful link — such as a <strong>drug treating a disease</strong>, a <strong>gene associated with a phenotype</strong>, or a <strong>protein interacting with another protein</strong>.</p><p>But models like <strong>R-GCN</strong> or <strong>Node2Vec</strong> don’t just want a list of edges — they need a clear map of <strong>what each node is</strong>, <strong>what role it plays</strong>, and <strong>how it’s connected</strong>.</p><h3>Step 1: Assign Global Node IDs</h3><p>The first step is to collect all unique node names across both columns (x_name and y_name) and assign each one a <strong>global integer ID</strong>. This gives us a consistent reference for each entity throughout the graph.</p><pre>all_nodes = pd.concat([df[&quot;x_name&quot;], df[&quot;y_name&quot;]]).dropna().unique()<br>node_maps = {name: i for i, name in enumerate(sorted(all_nodes))}<br>print(f&quot;[INFO] Total unique nodes: {len(node_maps):,}&quot;)</pre><p><strong>Analogy</strong>: Think of this like assigning a library index number to every book — whether it’s in the “Medicine” section or “Biochemistry,” a unique ID is aasigned to keep everything organised.</p><h3>Step 2: Normalise the Relation Map</h3><p>In a heterogeneous biomedical graph, relationships connect different types of nodes:</p><ul><li>A <strong>disease–protein</strong> interaction is different from a <strong>drug–effect</strong> link</li><li>Some relationships are <strong>directional</strong>, others <strong>symmetrical</strong></li></ul><p>To build a flexible yet structured graph, the relation types were normalised by sorting their source and target node types alphabetically. This ensures consistency and avoids duplication (e.g., drug→disease is treated the same as disease→drug if the model doesn&#39;t care about directionality).</p><pre>relation_map = {}<br><br>for rel in df[&quot;relation&quot;].unique():<br>    subset = df[df[&quot;relation&quot;] == rel]<br>    if subset.empty:<br>        continue<br><br>    type_pairs = set(zip(subset[&quot;x_type&quot;], subset[&quot;y_type&quot;]))<br><br>    for x_type, y_type in type_pairs:<br>        if x_type in node_type_mapping.values() and y_type in node_type_mapping.values():<br>            normalized_pair = tuple(sorted([x_type, y_type]))<br>            relation_map[rel] = normalized_pair<br><br>print(f&quot;[INFO] Total unique normalized relations: {len(relation_map):,}&quot;)</pre><p><strong>Analogy</strong>: This is like grouping roads on a map based on which areas they connect, regardless of direction — a road from “Hospital to Lab” is still the same route as “Lab to Hospital.”</p><h3>Step 3: Build the Node Metadata Table</h3><p>To keep track of each node’s <strong>type</strong> and <strong>global ID</strong>, a unified node_df was created that holds every unique node, its type (e.g., &quot;gene&quot;, &quot;disease&quot;), and the global ID that was previously assigned.</p><pre>node_df = pd.concat([<br>    df[[&quot;x_name&quot;, &quot;x_type&quot;]].rename(columns={&quot;x_name&quot;: &quot;node_name&quot;, &quot;x_type&quot;: &quot;node_type&quot;}),<br>    df[[&quot;y_name&quot;, &quot;y_type&quot;]].rename(columns={&quot;y_name&quot;: &quot;node_name&quot;, &quot;y_type&quot;: &quot;node_type&quot;})<br>]).dropna().drop_duplicates().reset_index(drop=True)<br><br>node_df[&quot;global_id&quot;] = node_df[&quot;node_name&quot;].map(node_maps)</pre><p><strong>Analogy</strong>: This is like creating a clean catalog where every book (node) has its <strong>genre (type)</strong> and <strong>index number (ID)</strong> — critical for graph construction.</p><h3>Constructing the Graph with Global Node Mapping</h3><h3>Step 1: Global Node Indexing</h3><p>Earlier in the pipeline, each unique biomedical entity (e.g., gene, disease, phenotype) was assigned a <strong>global integer ID</strong> using:</p><pre>node_maps = {name: i for i, name in enumerate(sorted(all_nodes))}</pre><p>This ensures that every node — regardless of its type — is mapped to a <strong>unique identifier</strong>, creating a <strong>flat, consistent index space</strong> that simplifies downstream processing.</p><h3>Step 2: Adding Edges to the Graph</h3><p>With node IDs in hand, looping through each relationship type (from the normalised relation_map) and add the corresponding edges:</p><pre>G = nx.Graph()<br><br>for rel in relation_map:<br>    rel_df = df[df[&#39;relation&#39;] == rel]<br>    <br>    src_indices = rel_df[&#39;x_name&#39;].map(node_maps).fillna(-1).astype(int)<br>    dst_indices = rel_df[&#39;y_name&#39;].map(node_maps).fillna(-1).astype(int)<br><br>    valid_edges = [(s, d) for s, d in zip(src_indices, dst_indices) if s != -1 and d != -1]<br>    G.add_edges_from(valid_edges)<br><br>print(&quot;\n[INFO] Graph constructed successfully with global node map.&quot;)</pre><p>Here’s what this does:</p><ul><li>For each relation, it selects the relevant rows from the dataset.</li><li>It converts the source and target node names into global IDs using the node_maps dictionary.</li><li>Any missing or invalid mappings are filtered out (using -1 as a sentinel).</li><li>All valid edges are added to the graph.</li></ul><h3>Learning Node Representations with Node2Vec</h3><p>Using <strong>Node2Vec</strong>, a model is trained to convert nodes into dense, continuous embeddings — capturing semantic and topological relationships between the entities. It is an <strong>unsupervised learning algorithm</strong> that learns low-dimensional embeddings for nodes by simulating random walks on the graph.</p><p>At its core, <strong>Node2Vec</strong> learns by simulating <em>random walks</em> across the graph — just like how Word2Vec learns word embeddings from natural language. It treats each node like a “word” and each walk like a “sentence.” By walking through the graph in flexible, biased ways (some walks stay local, others explore far), it captures both <strong>structural roles</strong> (e.g., hubs, bridges) and <strong>semantic proximity</strong> (e.g., diseases linked by shared phenotypes or pathways).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*w49oZQawbXtJNqcXrdwYZQ.png" /><figcaption>Image Generated using ChatGPT-4o</figcaption></figure><p>The result? A high-dimensional representation where <strong>nodes with similar roles or connections are embedded close together</strong> — even if they aren’t directly connected.</p><p>This makes it possible for ML models to detect missing links, suggest biological analogies, and uncover latent similarities — all from the <em>geometry</em> of the graph.</p><h4>Converting NetworkX Graph to PyTorch Geometric Format</h4><p>Once the undirected graph G is constructed using NetworkX, the next step is to embed its nodes into a continuous vector space using the <strong>Node2Vec</strong> algorithm. These embeddings are designed to capture the <strong>structural roles</strong> and <strong>semantic context</strong> of nodes based on their local and global neighbourhoods.</p><p>To do this effectively in PyTorch Geometric (PyG), the graph needs to be transformed into a format that PyG understands. This is done using:</p><pre>pyg_graph = from_networkx(G)</pre><p>This line converts the G object (a standard NetworkX graph) into a torch_geometric.data.Data object. The resulting pyg_graph includes PyG-friendly attributes like edge_index, a 2D tensor that defines the graph&#39;s connectivity in terms of source and target node indices.</p><blockquote><em>This format allows PyG models like </em><em>Node2Vec, </em><em>GCN, or </em><em>RGCN to efficiently process the graph, optimise over its structure, and learn expressive embeddings.</em></blockquote><p>The edge_index serves as the backbone of all graph-based computations in PyG, enabling operations like random walks, message passing, and convolution to be implemented seamlessly.</p><h4>Sending to Device (CPU/GPU)</h4><pre>pyg_graph.edge_index = pyg_graph.edge_index.to(device)</pre><p>To leverage GPU acceleration (if available), the graph’s edge list is moved to the appropriate device.</p><h4>Initializing Node2Vec</h4><p>Node2Vec doesn’t just look at who’s connected to whom — it <strong>walks the graph like a tourist</strong>, exploring local and global neighbourhoods to uncover hidden structural patterns.</p><p>It combines two clever ideas:</p><ul><li><strong>Random Walks</strong>:<br> For each node, Node2Vec simulates multiple random walks — like sending out a curious explorer to roam the neighbourhood. These walks create sequences of nodes, kind of like sentences in a language.</li><li><strong>Skip-Gram Model</strong>:<br> Inspired by <strong>Word2Vec</strong>, the skip-gram model learns to predict a node’s neighbours (context) from these sequences. It treats nodes like words and walk sequences like sentences, capturing <strong>how often and in what order</strong> nodes appear together.</li></ul><pre>node2vec = Node2Vec(<br>    pyg_graph.edge_index,<br>    embedding_dim=128,<br>    walk_length=10,<br>    context_size=5,<br>    walks_per_node=20,<br>    num_negative_samples=1<br>).to(device)</pre><p>The configuration used in this project is carefully tuned to balance <strong>exploration</strong> and <strong>efficiency</strong> during the Node2Vec training process:</p><ul><li>embedding_dim=128: Each biomedical entity—be it a disease, gene, or drug—is represented by a <strong>128-dimensional vector</strong>, capturing its structural and semantic context in the graph.</li><li>walk_length=10: Each simulated random walk explores <strong>10 steps</strong> from a starting node, allowing it to traverse across nearby biological relationships (e.g., a disease → protein → drug → pathway).</li><li>context_size=5: For every node, only its <strong>5 closest neighbours</strong> in a walk are treated as <em>context</em>. This is akin to saying: <em>“Which other genes are typically discussed near BRCA1 in biomedical pathways?”</em></li><li>walks_per_node=20: The model simulates <strong>20 random walks per node</strong>, giving it enough exposure to both local and global graph structure. For instance, breast cancer might co-occur with immune genes, metabolic pathways, or co-morbid phenotypes in different walks.</li><li>num_negative_samples=1: For every <em>positive pair</em> (e.g., Breast Cancer ↔ TP53, which co-occur in a walk), one <em>negative pair</em> is generated by randomly sampling unrelated nodes (e.g., Breast Cancer ↔ Toe curvature). This teaches the model to <strong>pull meaningful pairs closer</strong> while <strong>pushing irrelevant pairs apart</strong>.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/740/1*AYYBG1GkJtRVLUQbSDbqBA.png" /></figure><p>This setup enables Node2Vec to learn embeddings that reflect <strong>biomedical semantics</strong>, even though the training is entirely <strong>unsupervised</strong>. The internal buffers rowptr and col, essential for sampling operations, are also moved to the correct device (GPU or CPU) to ensure efficient execution.</p><p>Edge Sampling for Training</p><pre>train_edges, val_edges = train_test_split(...)</pre><p>Instead of training on the entire graph at once, the training loop samples batches of edges and trains on mini-batches. A 90–10 split is used for training and validation.</p><h4>Training Loop with Early Stopping</h4><p>The model is trained using Adam optimizer. For each epoch:</p><ol><li>A batch of nodes is sampled based on the edge list.</li><li><strong>Positive random walks</strong> and <strong>negative random walks</strong> are generated.</li><li>The model computes the loss, backpropagation, and updates the weights.</li><li>Validation loss is computed and monitored.</li></ol><pre>if val_loss.item() &lt; best_loss:<br>    ...<br>    torch.save(...)</pre><p>If the validation loss improves, the model is saved. Otherwise, a counter is incremented. Training stops early if no improvement is seen for 200 consecutive epochs.</p><h3>Visualizing Node2Vec Embeddings with t-SNE</h3><p>The Node2Vec model was trained on an undirected biomedical graph to learn vector representations for each node based on its local and global connectivity. After training, these embeddings were projected into two dimensions using <strong>t-SNE</strong>, a non-linear dimensionality reduction technique that preserves local structure and neighbourhood relationships.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*l7yXln5hnZpsfApw5h_iug.png" /><figcaption>Before Node2Vec training</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*n5QTXYR3LyP8EH52x8Ys5A.png" /><figcaption>Embedding after Node2Vec training</figcaption></figure><h3>Visualising the Graph Embeddings with t-SNE</h3><p>The scatter plot above presents a <strong>t-SNE projection</strong> of node embeddings from the PrimeKG graph — with each point representing a node (e.g., disease, phenotype, drug, protein, pathway, or biological process). The axes (<strong>Component 1</strong> and <strong>Component 2</strong>) are <strong>abstract latent dimensions</strong> created by t-SNE and don’t correspond to specific biomedical properties. Instead, they are used to help us <strong>visualise structural similarity</strong> in a 2D space.</p><p>Nodes that appear closer together in this plot were likely embedded with <strong>similar structural contexts</strong> — meaning they share common neighbors, appear in similar paths, or participate in similar types of relationships within the graph.</p><p>This version of the plot is <strong>subsampled</strong> to improve visibility while maintaining the distributional structure of the full graph.</p><h3>Key Observations</h3><ul><li><strong>No single dominant cluster</strong> is present, but we observe <strong>dense regions of overlap</strong> where nodes of different types co-locate — reflecting the highly <strong>interconnected nature</strong> of biomedical entities in PrimeKG.</li><li><strong>Drugs and proteins</strong> appear more uniformly scattered, consistent with their <strong>broad connectivity</strong> across multiple biomedical contexts (e.g., a drug linking to diseases, pathways, and targets).</li><li><strong>Phenotypes and diseases</strong> still form <strong>partial clusters</strong>, often overlapping — which makes sense biologically, as phenotypes are often <strong>clinical manifestations of diseases</strong>, and their embeddings are shaped by similar neighbourhood structures.</li><li><strong>Pathways and biological processes</strong> are sparsely spread out, possibly due to <strong>lower edge density</strong> or fewer random walk interactions — indicating they may function more as <strong>semantic anchors</strong> in the graph than highly connected hubs.</li></ul><h3>What This Means</h3><p>Despite the noise introduced by subsampling and the non-deterministic nature of t-SNE, there are clear <strong>semantic signals</strong> emerging from the structure:</p><ul><li>Nodes of similar types often <strong>drift toward local neighborhoods</strong>, showing that the <strong>graph structure preserves contextual semantics.</strong></li><li>The fact that different biomedical entities aren’t isolated but rather <strong>entangled in shared regions</strong> is reflective of real-world biology, where <strong>interdependencies are the norm.</strong></li></ul><p>These patterns validate that the <strong>graph construction and embedding pipeline</strong> is working — capturing not just node proximity, but meaningful <strong>biomedical associations</strong> that reflect the underlying complexity of healthcare knowledge.</p><h3>Cosine Similarity Between Top Disease Embeddings</h3><p>After training Node2Vec embeddings on the PrimeKG graph, it becomes possible to quantify how “similar” any two nodes are in the latent space using <strong>cosine similarity</strong>. The heatmap below visualises pairwise similarities between a curated set of <strong>disease nodes</strong>, helping assess whether the learned embeddings reflect intuitive medical relationships.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1006/1*H4t897YW1E5PtbdbNjZPcA.png" /></figure><p>The heatmap above shows the <strong>cosine similarity</strong> between the learned embeddings of three related disease nodes: <strong>hypertension</strong>, <strong>insulin resistance</strong>, and <strong>metabolic syndrome</strong>. Each cell represents the cosine similarity score between two disease embeddings. As expected:</p><ul><li>A score of <strong>1.00</strong> (along the diagonal) reflects perfect self-similarity.</li><li>Values closer to <strong>0.00</strong> suggest low or orthogonal similarity.</li><li><strong>Higher off-diagonal values</strong> indicate that the model sees those diseases as structurally or contextually similar in the graph.</li></ul><h3>Interpreting the Patterns</h3><ul><li>The embedding similarity between <strong>hypertension and metabolic syndrome</strong> is the highest among the pairs (<strong>0.18</strong>), which may reflect their shared connection to cardiovascular and metabolic pathways.</li><li><strong>Insulin resistance</strong> has modest similarity with <strong>metabolic syndrome (0.14)</strong> and <strong>hypertension (0.07)</strong>, indicating weaker but non-random alignment — possibly due to sparse shared phenotypes or indirect links via common co-morbidities.</li><li>Despite their biomedical relevance to each other, the similarity values are still <strong>low in absolute terms</strong>, which highlights the structural sparsity and specificity of disease nodes in PrimeKG.</li></ul><h3>Why This Is Useful</h3><p>This kind of similarity analysis provides a <strong>semantic lens into the embedding space</strong> — giving us clues about how the model interprets disease relationships based on graph structure. It can be valuable for:</p><ul><li><strong>Clustering diseases</strong> by mechanism or shared phenotypes</li><li><strong>Identifying potential co-morbidities</strong> based on shared neighbourhoods</li><li><strong>Prioritising links</strong> for drug repurposing or phenotype prediction</li><li><strong>Filtering noise</strong> by removing structurally irrelevant candidates in downstream tasks</li></ul><h3>Why Are Similarity Scores Still So Low?</h3><p>At first glance, one might expect diseases like <strong>insulin resistance</strong> and <strong>metabolic syndrome</strong> to have much higher similarity. But here’s why the scores remain low — and why that’s not necessarily a flaw:</p><h4>1. Node2Vec is structure-aware, not domain-aware</h4><p>Node2Vec learns from <strong>walk patterns</strong>, not domain semantics. Two diseases might be biologically related but embedded in separate neighbourhoods if they don’t share enough <strong>graph connectivity</strong>.</p><h4>2. Cosine similarity focuses on direction, not magnitude</h4><p>Cosine similarity captures <strong>directional alignment</strong>, but ignores vector magnitude. So even two influential nodes with meaningful overlap might show low similarity if they vary in connectivity or feature strength.</p><h3>Link Prediction with Logistic Regression</h3><p>Once the <strong>Node2Vec model</strong> has learned low-dimensional vector representations (embeddings) for each node in the graph, the next step is to <strong>predict whether two nodes should be connected</strong> — even if they currently aren’t. This task is called <strong>link prediction</strong>.</p><p>Think of it like asking:</p><blockquote><em>“Based on their embedding vectors, is there a high chance that </em>disease A<em> and </em>phenotype B<em> are biologically connected?”</em></blockquote><h3>Why Embeddings Matter Here</h3><p>Each node (like <em>asthma</em> or <em>IL6 protein</em>) is now represented by a <strong>128-dimensional vector</strong> that encodes its structural and contextual role in the graph. These embeddings serve as <strong>features</strong> for traditional machine learning models.</p><h3>Step 1: Extract Positive Edges for a Specific Relation</h3><p>This begins by collecting <strong>real edges</strong> for a biomedical relation of interest — for example, &quot;disease_phenotype_positive&quot;. These are the known connections that serve as <strong>positive training examples</strong>.</p><pre>relation_edges = np.sort(<br>    df[df[&quot;relation&quot;] == relation_name][[&quot;x_name&quot;, &quot;y_name&quot;]].values.astype(&quot;U&quot;),<br>    axis=1<br>)<br>relation_edges = np.unique(relation_edges, axis=0)</pre><ul><li>The dataset is filtered for the desired relation type.</li><li><strong>The node pairs </strong>are sorted <strong>alphabetically</strong> to treat edges as <strong>undirected</strong>.</li><li>The duplicates are removed with np.unique to ensure each positive edge is counted only once.</li></ul><h3>Step 2: Collect Valid Nodes by Type</h3><p>The list of valid nodes is extracted for the given source and target types (e.g., disease, phenotype) from the cleaned node metadata:</p><pre>src_nodes = list(node_df[node_df[&quot;node_type&quot;] == src_type][&quot;node_name&quot;])<br>tgt_nodes = list(node_df[node_df[&quot;node_type&quot;] == tgt_type][&quot;node_name&quot;])</pre><p>This ensures that sampling is being done from the correct subsets when generating negatives.</p><h3>Step 3: Map Positive Edges to Global Node IDs</h3><p>All the valid node pairs in the positive edge set into their corresponding <strong>global integer IDs</strong> (as required for embedding lookup and modelling):</p><pre>pos_edges = np.array([<br>    [node_maps[x], node_maps[y]]<br>    for x, y in relation_edges<br>    if x in node_maps and y in node_maps<br>])</pre><h3>Step 4: Generate Negative Samples</h3><p>Since link prediction is a <strong>binary classification task</strong>, we also need <strong>negative examples</strong> — node pairs that are <strong>not connected</strong> in the graph. For this reason the same number of fake edges are generated by randomly sampling node pairs that don’t exist in the original relation set:</p><pre>num_samples = len(pos_edges)<br>neg_edges = np.array([<br>    [node_maps[random.choice(src_nodes)], node_maps[random.choice(tgt_nodes)]]<br>    for _ in range(num_samples)<br>])</pre><p>Note: These are synthetic and may occasionally include real but <strong>unlabelled edges</strong> — which introduces noise, but is common in graph-based negative sampling.</p><h3>Step 5: Train–Test Split</h3><p>The <strong>positive</strong> and <strong>negative</strong> edges are split separately into 80% training and 20% testing:</p><pre>pos_train, pos_test = train_test_split(pos_edges, test_size=0.2, random_state=42)<br>neg_train, neg_test = train_test_split(neg_edges, test_size=0.2, random_state=42)</pre><h3>Step 6: Compute Edge Features</h3><p>Each edge (whether positive or negative) is represented by a <strong>dot product</strong> of its two node embeddings:</p><pre>def edge_features(edges):<br>    return (embeddings[edges[:, 0]] * embeddings[edges[:, 1]]).sum(dim=1).view(-1, 1)<br></pre><ul><li>The dot product measures <strong>vector alignment</strong> — a simple proxy for similarity.</li><li>Higher values suggest a stronger connection between the two nodes.</li></ul><p>Then the feature matrices and labels are constructed:</p><pre>X_train = torch.cat([edge_features(pos_train), edge_features(neg_train)], dim=0).cpu().numpy()<br>y_train = np.array([1] * len(pos_train) + [0] * len(neg_train))<br><br>X_test = torch.cat([edge_features(pos_test), edge_features(neg_test)], dim=0).cpu().numpy()<br>y_test = np.array([1] * len(pos_test) + [0] * len(neg_test))</pre><h3>Step 7: Train Logistic Regression</h3><pre>model = LogisticRegression(class_weight=&quot;balanced&quot;, max_iter=1000)<br>model.fit(X_train, y_train)</pre><ul><li>class_weight=&quot;balanced&quot; helps account for class imbalance.</li><li>max_iter=1000 ensures convergence for larger datasets.</li></ul><h3>Step 8: Score a Specific Node Pair</h3><p>Lastly, the model can be queried for a <strong>specific disease–phenotype</strong> pair — like:</p><pre>&quot;permanent neonatal diabetes mellitus&quot; ↔ &quot;retinopathy&quot;<br></pre><p>The dot product is computed, passed through the classifier, and a probability is returned:</p><pre>pair_feat = (embeddings[u] * embeddings[v]).sum().item()<br>pair_score = model.predict_proba(np.array([[pair_feat]]))[:, 1][0]</pre><h4>Logistic Regression for Disease–Phenotype Link Prediction</h4><p>This evaluation tests whether simple logistic regression on Node2Vec embeddings can predict meaningful biomedical associations. Specifically, it targets the <strong>disease_phenotype_positive</strong> relation—i.e., known links between diseases and observable phenotypes.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/806/1*u47MkUk8g1Q5ONLQh_2EIw.png" /></figure><p>The classifier was trained using <strong>dot-product-based features</strong> between node embeddings for each disease–phenotype pair. The results are broken down into:</p><h4>Evaluation Metrics</h4><p>Metric Value Description <strong>Accuracy</strong> 0.5009. The model is only marginally better than random guessing (which would be ~0.50 in a balanced binary setup). <strong>Precision is </strong>0.5010 which is slightly more than half the predicted links are correct. <strong>Recall</strong> is 0.4677 which means the model missed out a fair number of actual links. <strong>F1 Score</strong> is 0.4709 which is a harmonic mean of precision and recall — reflects overall balance. <strong>ROC-AUC</strong> 0.5100 Shows poor separation between true and false links — close to chance level (0.5).</p><p>These metrics highlight a limitation: although embeddings are informative, <strong>logistic regression alone cannot capture the complexity</strong> of biomedical graph structures.</p><h4>Specific Pair Score</h4><p>The logistic regression model was also queried for a specific edge:</p><blockquote><strong><em>“permanent neonatal diabetes mellitus” ↔ “retinopathy”</em></strong></blockquote><ul><li><strong>Predicted probability</strong>: 0.5027</li></ul><p>This score is barely above 0.5, suggesting the model has <strong>low confidence</strong> in this edge’s existence.</p><h4>Summary</h4><p>This experiment demonstrates that <strong>logistic regression over simple dot-product embeddings is insufficient</strong> for nuanced biomedical link prediction. While this setup works as a baseline, it motivates the use of more powerful models like XGBoost, GNNs, or transformer-based approaches for improved prediction quality.</p><h4>Will XGBoost be any better?</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*VbT7zjhDm9NgkHWp8SA0lA.png" /></figure><h3>XGBoost for Link Prediction on Biomedical Graphs</h3><p>To improve upon the earlier baseline, an <strong>XGBoost classifier</strong> was trained using concatenated Node2Vec embeddings for disease–phenotype pairs. This setup leverages tree-based learning to better capture <strong>nonlinear relationships</strong> between node representations in the biomedical graph.</p><p>The evaluation focused again on the <strong>disease_phenotype_positive</strong> relation.</p><h3>Performance Metrics</h3><ul><li><strong>Accuracy: 0.80</strong><br> Over 80% of the predictions were correct.</li><li><strong>Precision: 0.80</strong><br> High precision means most predicted links are actually relevant.</li><li><strong>Recall: 0.79</strong><br> The model successfully retrieved a large portion of the true links.</li><li><strong>F1 Score: 0.80</strong><br> Reflects a strong balance between precision and recall.</li><li><strong>ROC-AUC: 0.88</strong><br> Indicates a decent discrimination between positive and negative link predictions.</li></ul><p>These results reflect a <strong>clear performance boost over logistic regression</strong>, highlighting <strong>XGBoost’s ability</strong> to capture richer, non-linear patterns from the node embeddings.</p><h4>Specific Pair Score</h4><p>The model was queried for the link:</p><blockquote><strong><em>“permanent neonatal diabetes mellitus” ↔ “retinopathy”</em></strong></blockquote><ul><li><strong>Predicted probability</strong>: 0.4467</li></ul><p>Interestingly, while overall performance is strong, the probability for this specific pair is <strong>lower than expected</strong>, possibly due to <strong>data sparsity or lack of direct co-occurrence</strong> in the walk-based embedding generation process.</p><p>XGBoost proves to be a <strong>powerful link predictor</strong> in the biomedical domain when trained on structural node embeddings. It outperforms logistic regression by a large margin and serves as a <strong>strong baseline</strong> for future comparison with more complex models like <strong>Graph Neural Networks</strong> or <strong>attention-based link predictors</strong>.</p><h3>Conclusion</h3><p>This blog explored the use of <strong>Node2Vec embeddings</strong> on the PrimeKG biomedical graph, followed by <strong>traditional machine learning models</strong> (Logistic Regression and XGBoost) for <strong>link prediction</strong> between disease and phenotype nodes. While XGBoost outperformed Logistic Regression with significantly better precision and AUC scores, both models struggled to <strong>capture the complex semantics</strong> of biomedical relationships. The cosine similarity heatmap further revealed that even with high-dimensional embeddings, the latent space remained <strong>weakly informative</strong> when it came to reflecting true biological proximity.</p><p>This outcome highlights an important limitation: <strong>traditional ML models operating on static embeddings are not sufficient</strong> for relational reasoning in multi-relational graphs like PrimeKG. They treat the problem as a classification task over vector pairs, overlooking the rich contextual interactions and multi-hop dependencies within the graph.</p><p>The full code is available on <a href="https://github.com/amulya-prasad/XplainMD/blob/master/Notebooks/XplainMD_Training_with_ML_models_Part2.ipynb">Github</a></p><h3>Coming Up Next:</h3><p>XplainMD Part 3: <a href="https://medium.com/@fhirshotlearning/xplainmd-part-3-relational-gcn-gnnexplainer-learning-explaining-links-6a4a290819fc?source=friends_link&amp;sk=daed06bf3a79107abb518c6fb2590002">Relational GCN + GNNExplainer: Learning &amp; Explaining Links</a></p><p>In this blog we explored how shallow models like <strong>Node2Vec + XGBoost</strong> can uncover patterns in biomedical graphs. Now, it’s time to level up.</p><p>In the next part of this series, we dive into <strong>Relational Graph Convolutional Networks (R-GCN)</strong> — a graph-native neural architecture built to learn directly from <strong>multi-relational knowledge graphs</strong> like PrimeKG.</p><p>Unlike traditional pipelines, <strong>R-GCN dynamically updates node representations based on both edge types and neighbourhood structure</strong>, capturing the true semantics of biomedical relationships.</p><p>But we won’t stop at prediction.</p><p><strong>Explainability</strong> will take center stage as <strong>GNNExplainer </strong>will be introduced, a tool that reveals the “why” behind each link prediction — uncovering the <strong>subgraph structures and features</strong> that drive the model’s decisions.</p><p>This next post will show how <strong>R-GCN + GNNExplainer</strong> work together to produce <strong>trustworthy, interpretable insights</strong> — a must-have in domains like <strong>drug discovery</strong>, <strong>clinical reasoning</strong>, and <strong>precision medicine</strong>.</p><p>Stay tuned — as this one’s where machine learning meets meaning.</p><h3>References:</h3><ol><li>Grover, A. and Leskovec, J., 2016, August. node2vec: Scalable feature learning for networks. In <em>Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining</em> (pp. 855–864): <a href="https://arxiv.org/abs/1607.00653">https://arxiv.org/abs/1607.00653</a></li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=918c03f613d4" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[XplainMD Part 1: A visual exploration of PrimeKG]]></title>
            <link>https://medium.com/@fhirshotlearning/xplainmd-part-1-a-visual-exploration-of-primekg-cf032cbb864a?source=rss-7e548aa5925b------2</link>
            <guid isPermaLink="false">https://medium.com/p/cf032cbb864a</guid>
            <category><![CDATA[graph-theory]]></category>
            <category><![CDATA[graph-data-science]]></category>
            <category><![CDATA[knowledge-graph]]></category>
            <dc:creator><![CDATA[FHIR Shot Learning]]></dc:creator>
            <pubDate>Wed, 09 Apr 2025 10:23:55 GMT</pubDate>
            <atom:updated>2025-04-13T06:08:35.345Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*G5-8gekXfsMGEALlMt0miA.png" /><figcaption>Island Clusters of Biomedical Relations in PrimeKG</figcaption></figure><h3>Mapping the PrimeKG Graph: A Visual and Analytical Journey</h3><p>In the <a href="https://medium.com/@fhirshotlearning/xplainmd-a-graph-powered-guide-to-smarter-healthcare-fd5fe22504de?source=friends_link&amp;sk=d2b2f55b5f2e8a57d1831b4f646c6020">Introductory post</a>, a future was envisioned where AI isn&#39;t just making clinical predictions but also <strong>explains</strong> them. A future where biomedical knowledge is <strong>structured</strong>, <strong>transparent</strong>, and <strong>interactive</strong> — not buried in black-box models or locked away in tabular data.</p><p>But for an AI system to reason this way — to connect diseases with phenotypes, drugs with proteins, and uncover the biological logic behind them — it needs a foundation that is both <strong>rich in context</strong> and <strong>grounded in structure</strong>.</p><p>That foundation is <strong>PrimeKG!</strong></p><p>Before training any model, it is essential to first <strong>understand the data</strong> — not just its format, but its <strong>form</strong>, its <strong>relationships</strong>, and its <strong>interpretability</strong>. This post offers a <strong>visual deep dive into PrimeKG</strong>, a richly curated precision medicine knowledge graph developed by researchers at Harvard. It explores the graph’s composition, its entities and relationships, and why it serves as an ideal substrate for building explainable, graph-based reasoning systems in healthcare systems.</p><p>Graphs, unlike traditional datasets, <strong>speak a different language</strong>. They capture <strong>connections</strong>, <strong>structures</strong>, and <strong>hierarchies</strong> — and as such, they require <strong>unique visualisation techniques</strong>. This post also serves as a gentle introduction to the world of graph data science, offering a glimpse into how graphs can enhance transparency and inference in biomedical AI applications.</p><p>But first, to truly appreciate the visualisations, it’s important to understand a few foundational concepts in graph theory and how they apply to structured biomedical data.</p><h3>Understanding Graph Basics Through a Biomedical Example</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/574/1*BSDjsmXWkRnRI2KXeD4PnQ.png" /><figcaption>Shortest Path between Autism Spectrum Disorder and Tamoxifen</figcaption></figure><p>The image above represents a <strong>miniature subgraph</strong> — a small portion of a much larger biomedical knowledge graph like <strong>PrimeKG</strong>. Even with just three nodes and a few connections, it demonstrates how graphs capture <strong>biological relationships</strong> in a structured, interpretable way.</p><h4>Nodes (Entities)</h4><p>Each <strong>circle</strong> represents a <strong>node</strong> — an entity in the biomedical world:</p><ul><li><strong>Tamoxifen</strong> (green) → a drug</li><li><strong>AR</strong> (blue) → a protein (Androgen Receptor)</li><li><strong>Autism Spectrum Disorder</strong> (red) → a disease</li></ul><p>In a traditional dataset, these might be separate rows or values in unrelated columns.<br> But in a graph, they’re <strong>connected</strong> — and those connections carry a meaning.</p><h4>Edges (Relationships)</h4><p>The <strong>curved lines</strong> represent <strong>edges</strong>, or <strong>relationships</strong> between entities:</p><ul><li><strong>drug_protein</strong> → Tamoxifen interacts with the AR protein</li><li><strong>disease_protein</strong> → AR is associated with Autism Spectrum Disorder</li></ul><p>Each edge is labelled— meaning it’s not just a connection, but a <strong>specific kind of biological relationship</strong>.</p><p>This distinction is important: graphs allow the model to know that <em>“Tamoxifen targets AR”</em> is a different kind of interaction than <em>“AR is linked to Autism”</em> — even though they share a common node.</p><h4>Degree</h4><p>The degree represents the number of edges attached to the node. In the subgraph AR has a degree of 2.</p><h3>Why This Matters</h3><p>This small subgraph illustrates how <strong>knowledge graphs integrate diverse biomedical data</strong> into a single, connected structure. With this one visualisation, we can begin to ask:</p><ul><li>Could <strong>Tamoxifen</strong>, typically used in other contexts, have a <strong>potential repurposing role</strong> in conditions linked to AR?</li><li>Is <strong>AR</strong> acting as a <strong>bridge</strong> between different diseases and drug mechanisms?</li><li>What other entities are connected to this path?</li></ul><p>In traditional tabular data, surfacing questions like this would require manually stitching together databases. In a graph, the structure <strong>makes these paths visible</strong> — and ready for both human interpretation and machine reasoning.</p><p>Now that some of the basics are clear, lets dive in and explore the beauty of graph theory and how it can revolutionise healthcare in the long run.</p><h3>1. Introduction &amp; Setup</h3><p>So as I had mentioned earlier: PrimeKG is a massive precision medicine knowledge graph containing diverse entities: drugs, diseases, phenotypes, proteins etc. Each relation (like disease_protein or drug_effect) forms a valuable edge type. The raw dataset is available on their <a href="https://github.com/mims-harvard/PrimeKG">github</a> page.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/436/1*BSnvCHk5CRqGvSGAYNYoKQ.png" /><figcaption>Unique nodes and Edges</figcaption></figure><p>As you can see it contains total unique 129375 nodes and 8100128 total unique edges. For the ease of the project, I selected a list of relationships and created another dataframe and throughout the project this dataframe will be considered.</p><p>So the final list looks like this:</p><pre>selected_relations = [<br>    &quot;protein_protein&quot;,<br>    &quot;disease_phenotype_positive&quot;,<br>    &quot;bioprocess_protein&quot;,<br>    &quot;disease_protein&quot;,<br>    &quot;drug_effect&quot;,<br>    &quot;pathway_protein&quot;,<br>    &quot;disease_disease&quot;,<br>    &quot;contraindication&quot;,<br>    &quot;drug_protein&quot;,<br>    &quot;indication&quot;<br>]</pre><pre>Total unique nodes: 68857<br>Total unique edges: 1803526</pre><h3>2. Basic Graph Exploration</h3><p>Let’s begin with basic graph exploration by examining which types of nodes are most prevalent in the dataset. Unsurprisingly, the figure below shows <strong>gene/protein</strong> nodes dominate — a pattern that’s typical in biomedical knowledge graphs. Genes and proteins serve as central players in biological systems:<br> they are <strong>associated with diseases</strong>, often as causal factors or biomarkers, and they also <strong>interact directly with drugs</strong> through mechanisms like binding, inhibition, or activation. Their high connectivity and biological significance make them pivotal in understanding disease mechanisms and therapeutic strategies, which is reflected in their heavy representation within PrimeKG.</p><p><strong>Node Types</strong> (e.g., disease, protein, drug)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/844/1*uxJHhWXomSQmAnRCthNx5Q.png" /></figure><p>Next, the E<strong>dge type distribution</strong> is explored to understand how relationships are represented in the precision medicine knowledge graph. As shown in the figure below, the <strong>protein_protein</strong> edge type overwhelmingly dominates the graph. This is expected — proteins are the functional workhorses of the cell and engage in a wide variety of interactions, ranging from signalling and metabolic processes to forming structural complexes. These interactions are not only abundant but also critical for downstream biological effects, which explains their heavy representation in datasets like PrimeKG. Other prominent relations include <em>disease–phenotype</em>, <em>bioprocess–protein</em>, and <em>disease–protein</em> interactions, all of which are vital for modelling biological mechanisms and clinical conditions.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/858/1*v9CK-Q1r9fBH4tue4Na8WQ.png" /></figure><p>To understand which entities are most connected in the graph, the <strong>Top 15 nodes by degree</strong> were computed. As mentioned above, degree here refers to the number of connections (edges) a node has with other nodes. The results, shown in the bar chart below, highlight <strong>UBC</strong> (a ubiquitin C protein) as the most connected node in the graph. This makes biological sense, as UBC plays a central role in protein degradation and signalling pathways, interacting with a large number of other proteins.</p><p>Interestingly, <strong>Autosomal Recessive Inheritance</strong> and <strong>Autosomal Dominant Inheritance</strong> also appear among the highest-degree nodes, underlining the prevalence of inheritance patterns across multiple diseases and phenotypes in biomedical data.</p><p>Other high-degree nodes include various diseases like <strong>breast cancer</strong>, <strong>hereditary breast ovarian cancer syndrome</strong>, and <strong>squamous cell carcinoma</strong>, as well as proteins like <strong>TRAF2</strong>, <strong>ETS1</strong>, and <strong>PLCG1</strong> — all known for their involvement in major biological and pathological processes.</p><p>These highly connected nodes (or hubs) can play a pivotal role in learning embeddings, as they influence the message passing process(which will be discussed in the upcoming blogs) more significantly than low-degree nodes.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-4pKKMHxQQ3ko_AmlOkzAw.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*3Tag7DNJKlqGome1Bcpvjw.png" /><figcaption>Local subgraph around ‘breast cancer’ illustrating multi-relation connections with key genes, proteins, and phenotypes.</figcaption></figure><p>The figure above is a <strong>biomedical constellation</strong>, with <strong>breast cancer at its center</strong>.</p><p>Each line, each node, each colour tells a story:</p><ul><li>In the center, <strong>Breast Cancer</strong> (gold) acts as the anchor point — the node that was queried.<br> Surrounding it are <strong>25 of its most connected neighbours</strong>, forming a mini-universe of <strong>proteins</strong>, <strong>genes</strong>, and other diseases that play a role in its biology.</li><li>The <strong>sky-blue coloured circles</strong> are <strong>proteins or genes</strong> linked to breast cancer — some may be drug targets, some might influence pathways related to tumour growth or suppression.</li><li>The <strong>salmon coloured nodes</strong> represent other <strong>diseases(breast carcinoma, breast neoplasm)</strong>, possibly sharing phenotypes or risk factors, or even common strategies of therapy.</li><li>The <strong>purple coloured edges</strong> labeled disease_protein highlight known associations — like &quot;this protein is known to interact with this disease.”</li></ul><p>The <strong>more connected a node</strong>, the more central it may be to the disease’s behaviour — or, potentially, to its treatment.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*DRklY9hKHZMRQZsUPPZLRA.png" /><figcaption>Top 10 neighbours of breast cancer. The yellow node represents breast cancer while the blue nodes represents the proteins it is associated with.</figcaption></figure><h4>Why This Matters?</h4><p>Graphs like this aren’t just visual aids — they’re interactive blueprints of biomedical reasoning.</p><p>Instead of parsing flat gene lists or static co-morbidity tables, this structure invites deeper questions:</p><ul><li>What proteins serve as bridges between breast cancer and other diseases?</li><li>Are there central hubs in this neighbourhood — nodes that consistently appear across multiple disease pathways?</li><li>Could any of these connections hint at drug repurposing opportunities or open up new lines of inquiry?</li></ul><p>This is the power of graph-based storytelling: not just identifying <em>what</em> is connected, but uncovering <em>how</em> and <em>why</em> those connections could matter.</p><p>To explore these patterns further, the next section dives into <strong>centrality analysis</strong> — a key step toward prioritising influential biomedical entities.</p><h3>3. Centrality Measures</h3><p>In graph theory, <strong>centrality</strong> is a key concept that helps quantify the <strong>importance</strong> or <strong>influence</strong> of a node within a network.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*AuICg6hwhLHAZFjXrqj3NQ.png" /></figure><p>Let’s break it down with a simple analogy.</p><p>Imagine you’re part of a social network. You have 10 friends, while your friend has just 2.<br> In this scenario, your <em>degree</em> — your number of direct connections — is 10, compared to their 2.</p><p>Now ask: who’s more likely to have greater influence, broader reach, or higher visibility?</p><p>The answer is obvious: <strong>you</strong>.</p><p>This is the core idea behind <strong>centrality</strong>.<br> Just like a celebrity with millions of followers can spread information faster and wider, nodes with high centrality in a graph can exert greater influence on the entire structure.</p><p>In the context of biomedical graphs, this has powerful implications:</p><ul><li>If a <strong>gene</strong>, <strong>disease</strong>, or <strong>protein</strong> node has thousands of connections, then any disruption — be it a mutation, a treatment, or an interaction — can create a cascade of effects.</li><li>Such nodes are not just important; they are <strong>biological bottlenecks</strong>, <strong>gateways</strong>, or <strong>vulnerabilities</strong> within the system.</li></ul><p>That’s why centrality isn’t just a mathematical measure — it’s a strategic lens for identifying <strong>critical biomedical entities</strong> and mapping <strong>crucial biological pathways</strong>.</p><h3>Types of Centrality Used in This Project</h3><p>To capture different <em>flavours</em> of “importance” in a graph, this project explores four key centrality measures:</p><ul><li><strong>Degree Centrality</strong>: Who has the most direct connections?<br> Measures how many edges a node has — useful for identifying immediate hubs.</li><li><strong>Betweenness Centrality</strong>: Who acts as a bridge between different parts of the graph?<br> Highlights nodes that often lie on the shortest paths, playing the role of connectors or gatekeepers.</li><li><strong>Closeness Centrality</strong>: Who can reach others the fastest?<br> Prioritises nodes that are, on average, closest to all others — indicating efficient spreaders or integrators.</li><li><strong>PageRank (Eigenvector Centrality)</strong>: Who holds influence based on <em>who</em> they’re connected to?<br> It’s not just about how many connections a node has, but how important those connections are.</li></ul><p>Each of these offers a unique lens to identify key players in the biomedical graph — whether they’re <strong>hubs</strong>, <strong>bridges</strong>, <strong>gateways</strong>, or <strong>influencers</strong> in the system’s flow of biological information.</p><h3>Understanding Degree Centrality in PrimeKG</h3><p><strong>Degree centrality</strong> is perhaps the most straightforward way to measure a node’s importance in a graph.<br> It counts the number of <strong>direct connections</strong> (or edges) a node has to other nodes — telling us, quite simply, <em>who has most number of connections?</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*f647Zixq1K6yIGrc_s_KJw.png" /><figcaption><strong>Left Image:</strong> Subgraph sampled with 1,000 edges from the full PrimeKG graph (randomly selected for performance-friendly visualisation).<br> <strong>Right Image:</strong> Local subgraph centered around the <strong>AR (Androgen Receptor)</strong> node, showing a limited number of its immediate neighbours.</figcaption></figure><p>In the image shown above, on the left, you see the entire graph coloured by degree centrality. Each dot represents a gene or protein, and the more connected a node is, the more it stands out visually.<br> This full network reveals something interesting — a large number of nodes have relatively high degrees. This is common in biological networks, where certain proteins or genes participate in many interactions due to their multifunctional roles.</p><p>Zooming in on the right, we see one such node: <strong>AR (Androgen Receptor)</strong>. The subgraph shows AR connected to several neighbouring proteins like <strong>CREB1</strong>, <strong>YES1</strong>, and <strong>GTF2F1</strong>, highlighting its hub-like role in the network.</p><p>This full network reveals something interesting — <strong>a large number of nodes have relatively high degrees</strong>.<br> This is common in biological networks, where certain proteins or genes participate in <strong>many interactions</strong> due to their multifunctional roles.</p><p><strong>The right side shows a zoomed in version of a subgraph</strong>, we see one such node: <strong>AR</strong> (<em>Androgen Receptor</em>). The subgraph shows AR connected to several neighbouring proteins like <strong>CREB1</strong>, <strong>YES1</strong>, and <strong>GTF2F1</strong>, highlighting its <strong>hub-like role</strong> in the network.</p><h3>So What Does This Mean?</h3><ul><li><strong>AR has a high degree — it connects to many other nodes.</strong></li><li>It’s play a key interaction in the network, potentially playing a central role in regulatory or signalling pathways.</li><li>Nodes like AR could be biological bottlenecks, making them attractive targets for research or therapeutic intervention.</li></ul><blockquote><strong><em>Real-World Implication:</em></strong><em><br> The </em><strong><em>Androgen Receptor</em></strong><em> is a well-established target in </em><strong><em>prostate cancer therapy</em></strong><em>. Its high degree in PrimeKG reflects its biological relevance and validates the use of graph analytics in surfacing known drivers of disease.</em></blockquote><blockquote><strong><em>Did You Know?</em></strong><em><br> In social networks, high-degree users are often “influencers.” In biology, such nodes are called </em><strong><em>party hubs</em></strong><em> — they interact with many partners but may do so in a non-specific manner. Targeting them could affect a wide array of processes, making them powerful but risky intervention points.</em></blockquote><h4><em>Disclaimer &amp; Context</em></h4><p><em>The visualisation above represents a </em><strong><em>sampled subgraph</em></strong><em> from the full PrimeKG biomedical network, containing 1,000 randomly selected edges. While node colours reflect degree centrality quantiles (top 10%, mid 30%, and remaining), the graph itself is not ranked or filtered by centrality — it includes nodes with varying degrees to offer a representative slice of the overall topology.</em></p><p><em>In the accompanying bar plot, certain genes and proteins like </em><strong><em>UBC</em></strong><em> exhibit extremely high degrees (often exceeding 5,000), confirming their role as global hubs. However, in this visualisation, nodes like UBC may appear with much smaller degrees due to sampling constraints, which help avoid rendering overload.</em></p><p><em>The subgraph on the right, centered on </em><strong><em>AR (Androgen Receptor)</em></strong><em>, was extracted by selecting AR as a top-degree node and plotting it alongside a handful of its immediate neighbours. While AR’s full degree in the original graph is much higher, this </em><strong><em>zoomed-in view offers a focused look at its local interactions</em></strong><em>, helping to highlight its structural role without the clutter of its full connectivity.</em></p><p><strong>Degree centrality gives a strong first impression of a node’s involvement</strong>, but it doesn’t tell us how information flows <em>through</em> the network.<br> To dig deeper, the next section explores <strong>betweenness</strong> and <strong>closeness centrality</strong>, which shift the focus from direct connections to the <em>paths</em> that enable influence, navigation, and control in complex biological systems.</p><h3>Betweenness Centrality: Who Connects the Clusters?</h3><p>Just knowing a lot of people doesn’t always make someone influential. Sometimes, it’s <strong>where someone stands</strong> — right at the intersection of different groups — that makes them truly powerful.</p><p>Imagine a person who isn’t the most popular, but who connects your school friends with your work circle, or your gym buddies with your college network. <strong>They become the bridge</strong>, the one through whom ideas, opportunities, or even gossip travel.</p><p>In graph terms, that’s what <strong>betweenness centrality</strong> captures — not who knows the most people, but <strong>who connects the most communities</strong>.</p><p>The same holds true in biomedical graphs. Some nodes matter <strong>not because they connect to many others</strong>, but because they sit <strong>between</strong> them — bridging otherwise disconnected parts of the network.</p><p>That is the essence of <strong>betweenness centrality.</strong> It measures how often a node lies on the shortest paths between other nodes. In simple terms, it reveals <strong>which nodes serve as bridges</strong> for flow of information.</p><p>Another Example: Think of it like an airport hub! A place like <strong>Doha</strong> or <strong>Istanbul</strong> might not have the most overall flights, but they connect continents — <strong>Europe, Asia, and Africa</strong> — with strategic efficiency. Similarly, in a biomedical context, a <strong>gene or disease</strong> might not be the most connected, but if it <strong>links two critical biological modules</strong> — say, neuro-degeneration and immune signalling — it becomes structurally essential.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*3O21Wtf1SbcsGnrk3xTsag.png" /><figcaption><strong>Left:</strong> Betweenness centrality visualized on a sampled subgraph of PrimeKG (1,000 edges). Nodes with darker purple color exhibit higher betweenness — acting as critical bridges within the biomedical network.<br> <strong>Right:</strong> Local subgraph centered on <strong>Schizophrenia</strong>, one of the top-ranked nodes by betweenness. Its immediate neighbors include genes, proteins, and a drug (<strong>Mosapramine</strong>) — illustrating its central role in connecting multiple biological pathways.</figcaption></figure><h3>A Snapshot of Betweenness in PrimeKG</h3><p>The figure above visualises this in action. On the <strong>left</strong>, a sampled subgraph from PrimeKG (1,000 edges) is shown, with node colour indicating betweenness.</p><ul><li><strong>Darker purple nodes</strong> are the ones with higher betweenness scores — they appear on more shortest paths, suggesting they influence how signals or relationships traverse the graph.</li><li>These are the <strong>bridge builders</strong> — the ones holding the graph together.</li></ul><p>On the <strong>right</strong>, we zoom into one such node: <strong>Schizophrenia</strong>. Despite being a psychiatric condition, it sits at the center of multiple relationships, acting as a connector between <strong>genes, proteins, and a drug (Mosapramine)</strong>.</p><h3>Why Does Schizophrenia Matter in This Network?</h3><p>At first glance, it may seem surprising to see <strong>Schizophrenia</strong> — a psychiatric disorder — appear as a central hub in a <strong>biomedical knowledge graph</strong>. But when viewed through the lens of systems biology, it makes sense for it to be in this <strong>strategic position.</strong></p><h4>Schizophrenia as a Cross-System Connector</h4><p>Schizophrenia is not just a brain disorder; it’s a <strong>multi-system condition</strong>:</p><ul><li><strong>Neurotransmitter genes</strong> like <em>GSK3B</em> and <em>NMDA receptors</em> are involved in its pathophysiology.</li><li><strong>Inflammatory pathways</strong> and <strong>oxidative stress</strong> have increasingly been recognised in the research of schizophrenia.</li><li>It shares <strong>genetic architecture</strong> with other complex diseases like <strong>bipolar disorder</strong>, <strong>Alzheimer’s</strong>, and even <strong>autoimmune disorders</strong>.</li></ul><p>This means <strong>Schizophrenia “sits” at the intersection of multiple biological modules</strong> — neurological, immunological, and pharmacological.</p><h3>Why Betweenness Centrality Confirms That</h3><p>High betweenness means Schizophrenia <strong>lies on many shortest paths between other nodes</strong> — acting like a <strong>connector</strong>:</p><ul><li>It links <strong>proteins involved in inflammation</strong> with those related to <strong>neurodevelopment</strong>.</li><li>It bridges <strong>drug targets</strong> to <strong>genetic risk factors</strong>, potentially exposing new angles for <strong>drug repurposing</strong> or <strong>comorbidity research</strong>.</li></ul><h3>Real-World Implications</h3><ul><li><strong>Drug development</strong>: If a drug affects Schizophrenia-linked pathways, it may also impact other diseases it’s “connected to.”</li><li><strong>Biomarker discovery</strong>: Understanding its neighbourhood could highlight <strong>shared biomarkers</strong> with other neurological or systemic disorders.</li><li><strong>Precision medicine</strong>: It provides a <strong>network-based rationale</strong> for why some patients with Schizophrenia show immune or metabolic symptoms.</li></ul><blockquote><em>So in essence, Schizophrenia’s presence as a high-betweenness node isn’t random — it reflects a </em><strong><em>biologically rich and clinically nuanced</em></strong><em> role in the graph, helping to connect seemingly unrelated processes across the human body.</em></blockquote><blockquote><strong>Note</strong>: This is a sampled graph. Full connectivity (e.g., actual node degrees) may be much higher — this visualisation is optimised for clarity, not scale.</blockquote><h3>Closeness Centrality</h3><p><strong>Closeness centrality</strong> measures how near a node is to <em>all</em> other nodes in the network. It doesn’t focus on how many connections a node has, but rather how quickly it can reach everyone else.</p><p>Think of it like this: In a massive biomedical graph of diseases, proteins, and drugs, a node with high closeness centrality isn’t necessarily the most connected — it’s just the most <em>efficiently placed</em>. It can spread information faster because it lies at the “center” of the network in terms of path lengths.</p><p>This makes closeness a powerful way to find entities that are <em>highly reachable</em>. These could be genes or diseases that sit at the heart of biological communication — not because they are hubs, but because they’re a few steps away from most others.</p><p>In healthcare, such nodes are valuable. They can act as regulators or bottlenecks, making them excellent candidates for targeted treatments or early interventions — since their position allows them to influence the system quickly and broadly.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*1G7fgMHLb1SQaO0JdW7jzA.png" /><figcaption><strong>Left Image :</strong> A network-level visualisation of closeness centrality on a sampled subgraph (1,000 edges). Nodes with larger size and deeper blue colour have higher closeness scores, indicating that they are topologically closer to the rest of the network. <strong>Right Image:</strong> A zoomed-in subgraph around <strong>AADACL2</strong>, one of the nodes with the highest closeness centrality. Its close proximity to a diverse set of entities — including proteins, enzymes, and compounds — reveals its strategic position in the graph.</figcaption></figure><p>You might notice in the image above on the right side that some nodes — like <strong>AADACL2</strong> or <strong>CES1P1</strong> — appear disproportionately large. That’s because nodes are sized based on their normalised closeness score. A higher score indicates that the node, on average, has <strong>shorter paths to all other nodes</strong>, and is therefore more “central” in terms of reachability.</p><h3>What is AADACL2 and Why Does It Matter?</h3><p><strong>AADACL2</strong> (Arylacetamide Deacetylase-Like 2) is an enzyme that belongs to the <strong>serine hydrolase family</strong>, which plays a role in <strong>breaking down lipid-like molecules</strong> (such as esters and amides). While it’s not as heavily studied as other enzymes, research suggests it’s involved in:</p><ul><li><strong>Lipid metabolism</strong>, especially in processing bioactive lipid compounds.</li><li><strong>Drug metabolism</strong>, meaning it may influence how certain drugs are activated or broken down in the body.</li></ul><p>Because it’s connected to many different <strong>proteins, enzymes, and compounds</strong>, its high <strong>closeness centrality</strong> suggests that <strong>AADACL2 might serve as a central biochemical “router”</strong> — helping relay metabolic or pharmacological signals efficiently.</p><p>In a biomedical context, this makes <strong>AADACL2</strong> a potentially <strong>strategic molecule</strong> for understanding cross-talk between metabolic pathways, or even for exploring new drug targets related to lipid disorders or metabolism.</p><p>Although Betweenness Centrality and Closeness Centrality share many parallels with each other, they still provide distinguishable roles:</p><p>While <em>Schizophrenia</em> and <em>AADACL2</em> both emerge as central nodes in the biomedical graph, their roles differ based on the type of centrality. <strong>Schizophrenia</strong>, identified via <strong>betweenness centrality</strong>, acts as a strategic bridge — connecting otherwise distant biological modules like neurodevelopment and inflammation. Its importance lies in being on the shortest paths between other entities. In contrast, <strong>AADACL2</strong>, ranked high in <strong>closeness centrality</strong>, isn’t a bridge but a <strong>hub</strong> — positioned at the topological center of the network. It can “reach” many other nodes quickly, making it ideal for rapid information diffusion or systemic influence. Together, these perspectives highlight how different nodes can matter — not just by how many connections they have, but by where they sit in the graph.</p><h3>Understanding PageRank Centrality: Influence Beyond Just Connections</h3><p>PageRank goes a step beyond simply counting the number of connections of a node. Instead, it captures the idea of <strong>influence by association</strong>. Originally developed by Google to rank websites, the algorithm doesn’t just ask <em>“How many other pages link to this one?”</em> — it also considers <em>“How important are the pages that link to this one?”</em></p><p>It’s like academic papers: being cited 10 times means something, but if those 10 citations come from Nobel laureates, your work holds more weight. In the same way, <strong>PageRank assigns higher scores to nodes that are connected to other high-ranking nodes</strong>, making it ideal for uncovering <strong>biomedical influencers</strong> — genes, diseases, or drugs that quietly shape major biological systems due to the company they keep.</p><p>From a biomedical perspective: Imagine <strong>KLK3</strong>, also known as <strong>Prostate-Specific Antigen (PSA)</strong>. It’s a protein with a well-established role in prostate cancer diagnostics. Suppose KLK3 is directly connected to several other proteins, but most of them are niche players not heavily involved in major biological pathways.</p><p>Now compare it to <strong>TP53</strong>, often dubbed as the “guardian of the genome.” TP53 may not have the highest number of connections overall, but it connects to key proteins in <strong>DNA repair</strong>, <strong>cell cycle regulation</strong>, <strong>apoptosis</strong>, and <strong>tumour suppression</strong>. These proteins, in turn, are connected to critical pathways across <strong>cancer</strong>, <strong>neuro-degeneration</strong>, and <strong>inflammation</strong>.</p><p>Even if TP53 had fewer direct edges, <strong>PageRank would assign it a higher score</strong> because it’s <strong>embedded in a highly influential sub-network</strong>. KLK3, while important in a specific diagnostic context, doesn’t exert the same level of systemic influence as TP53 across the biomedical landscape.</p><p>This is what makes PageRank so insightful — <strong>it helps uncover central regulators like TP53</strong> that aren’t just popular, but <strong>connected to the popular kids</strong> — the true power brokers of the graph.</p><p>This makes PageRank particularly valuable for identifying <strong>hidden hubs</strong> in the network — entities that may not be the most connected or centrally located, but play an outsize role due to their <strong>indirect influence</strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*B4myjqJsCk5hfl12uR47HA.png" /><figcaption>PageRank centrality highlighting “Intellectual Disability” and its influential neighbours in the biomedical graph.</figcaption></figure><h3>The Visualisation</h3><ul><li><strong>Left image</strong>: A sampled subgraph from PrimeKG coloured by PageRank values. Deeper red nodes indicate higher PageRank scores — meaning these nodes are not only well-connected but also <em>strategically positioned</em> in the network.</li><li><strong>Right zoomed-in panel</strong>: A subgraph centered around <strong>“Intellectual Disability”</strong>, one of the top-ranking nodes by PageRank. It is connected to several rare syndromes and phenotypes — and its centrality highlights how it acts as a hub for many other conditions.</li></ul><h3>Why Intellectual Disability Ranks So High in PageRank</h3><p>In the network visualisation, <em>Intellectual Disability</em> stands out not because it’s connected to the most nodes — but because of <strong>who it’s connected to</strong>. PageRank rewards nodes that are plugged into <strong>influential neighbourhoods</strong>, and this condition fits that profile perfectly.</p><p>If you zoom into the subgraph, you’ll notice that <em>Intellectual Disability</em> is linked to several <strong>rare syndromes</strong>, <strong>genetic deletions</strong>, and <strong>neurological disorders</strong>, such as:</p><ul><li><em>Williams Syndrome</em></li><li><em>Tetra-Amelia Syndrome</em></li><li><em>Lambert Syndrome</em></li><li><em>Specific language impairment</em></li><li><em>Wolf–Hirschhorn syndrome</em></li></ul><p>These aren’t random edges — they represent <strong>meaningful biomedical relationships</strong> across <strong>neurodevelopment, cognition, and rare disease modules</strong>.</p><p>From a PageRank perspective, <em>Intellectual Disability</em> sits at the <strong>crossroads of high-impact sub-networks</strong>. It’s not just “popular” — it’s <strong>important because it’s embedded among other important nodes</strong>.</p><h3>Why This Matters in Biomedical Research</h3><p>Conditions like <em>Intellectual Disability</em> often serve as <strong>phenotypic end-points</strong> of multiple upstream disruptions — genetic, metabolic, developmental. High PageRank suggests that it:</p><ul><li><strong>Links many rare diseases</strong> that may otherwise be studied in isolation.</li><li><strong>Sits close to diagnostic phenotypes</strong> and <strong>genomic loci</strong> frequently associated with neurological development.</li><li><strong>May serve as a pivot for comorbidity studies</strong>, where understanding one connection could help unravel others.</li></ul><p>In essence, PageRank shows that <em>Intellectual Disability</em> is not just a clinical diagnosis — it’s a <strong>graph-theoretical hub</strong> that could unlock relationships between diverse disorders.</p><h4>A Note on Similarity Across Centrality Measures</h4><p>While we visualised different types of centrality — <strong>degree</strong>, <strong>betweenness</strong>, <strong>closeness</strong>, and <strong>PageRank</strong> — many of the subgraphs looked strikingly <strong>similar</strong>.</p><p>Why?<br> Because in this particular biomedical graph:</p><ul><li>Many nodes have <strong>simultaneously high scores across multiple centrality metrics</strong>.</li><li>High-degree nodes often <strong>also act as bridges or hubs</strong> in terms of betweenness or PageRank.</li><li>Biological networks are naturally <strong>dense and modular</strong>, where central players influence several aspects of the system.</li></ul><p>So while each centrality measure offers a unique lens, in practice, <strong>the same influential nodes often appear across them all</strong> — justifying their biological relevance even further.</p><h3>4. Understanding Network Properties: Why They Matter in Biomedicine</h3><p>When we analyse graphs — especially biomedical knowledge graphs — we’re not just interested in the number of nodes and edges. We care about <strong>how these nodes are connected</strong>, how <strong>information flows</strong>, and what the <strong>structure reveals</strong> about the underlying biology.</p><p>This is where <strong>network topology</strong> comes in: By studying properties like <strong>connected components</strong>, <strong>clustering</strong>, and <strong>path lengths</strong>, we gain insights into whether our biomedical graph resembles real-world biological systems or behaves more like random noise.</p><p>These properties help us answer questions like:</p><ul><li><strong>Connected Components</strong>: Are entities isolated or part of a larger biological cluster?</li><li><strong>Clustering Coefficient</strong>: Do nodes tend to form tightly-knit neighbourhoods?</li><li><strong>Average Shortest Path</strong>: How quickly can one entity influence another?</li></ul><p>Understanding these traits helps in <strong>designing better machine learning models</strong> and even spotting gaps or inconsistencies in biomedical knowledge.</p><h3>Largest Connected Component (LCC)</h3><p>In a graph, all nodes are not always connected. The <strong>Largest Connected Component (LCC)</strong> refers to the biggest cluster of nodes where each node is reachable from every other node in that cluster. When analysing biological networks, focusing on the LCC helps us concentrate on the most meaningful and structurally relevant part of the graph — the “mainland” where most of the action happens, rather than the scattered “islands.”</p><h3>Average Clustering Coefficient</h3><p>The <strong>clustering coefficient</strong> measures how connected a node’s neighbours are to each other. It answers the question: <em>“If A is connected to B and C, how likely is it that B and C are also connected?”</em></p><ul><li>A high average clustering coefficient indicates a “cliquish” structure — common in social or tightly-knit biological communities (like protein complexes).</li><li>A low value suggests that although nodes are connected, their neighbours don’t interact much — like spokes on a wheel.</li></ul><h3>Average Shortest Path Length</h3><p>This metric tells us how many steps (edges) it takes, on average, to travel from one node to any other node in the network.</p><ul><li>A lower value means the network is more tightly connected — information (or influence) can travel quickly.</li><li>In biological graphs, it can indicate how efficiently signals or interactions propagate through molecular pathways.</li></ul><h3>Small-World Networks &amp; the Watts–Strogatz Model</h3><p>In many complex systems — ranging from <strong>social circles</strong> to <strong>biological pathways</strong> — the underlying structure of connections doesn’t follow a completely random pattern, nor is it perfectly regular. Instead, these systems often exhibit properties of a <strong>small-world network</strong>.</p><p>A small-world network is characterised by two main features:</p><ol><li><strong>High clustering</strong>: Nodes tend to form closely-knit groups. In biology, this could resemble protein complexes where a group of proteins interact tightly within a specific cellular function.</li><li><strong>Short average path lengths</strong>: Despite the clustering, any two nodes can typically be reached via only a few steps — just like the “six degrees of separation” often cited in social networks.</li></ol><p>To model and investigate this behaviour, the <strong>Watts–Strogatz model</strong> is widely used. This model begins with a structured graph — such as a ring lattice, where each node is connected to its immediate neighbours. Then, a small fraction of the edges are randomly rewired, creating <strong>shortcuts</strong> across the network. These shortcuts <strong>preserve clustering</strong> while dramatically reducing the <strong>average path length</strong>, simulating the balance between <strong>local cohesion</strong> and <strong>global reach</strong> that defines real-world systems.</p><h3>Real-World Analogy</h3><p>Consider a group of researchers in a large scientific community. Most are tightly connected within their own labs or institutions (high clustering). But occasionally, someone collaborates with a peer in a distant university (a shortcut). Even though the system has tight local groups, the occasional long-range collaboration ensures that <strong>any two scientists are connected through just a few steps</strong> — a classic small-world property.</p><h3>Relevance in Biomedical Networks</h3><p>In this analysis, the Watts–Strogatz model was used as a baseline to assess whether real biomedical subgraphs — such as <strong>disease-protein</strong> or <strong>protein-protein</strong> networks — exhibit small-world characteristics. By comparing metrics such as:</p><ul><li><strong>Clustering Coefficient</strong></li><li><strong>Average Path Length</strong></li></ul><p>between the real network and its Watts–Strogatz equivalent, one can infer structural efficiency. If the real network shows <strong>higher clustering</strong> and <strong>comparable or shorter path lengths</strong>, it suggests that the system is <strong>modular</strong>, <strong>resilient</strong>, and optimized for <strong>biological communication</strong> — hallmarks of small-world organization.</p><h3>Case 1: Disease–Protein Network</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*FCQBjErRdXGBsopUvOjd0w.png" /></figure><p>To investigate the small-world properties within the biomedical graph, the <strong>disease–protein</strong> subgraph was extracted and analysed. Focusing on its <strong>largest connected component (LCC)</strong>, the following statistics were observed:</p><ul><li><strong>Nodes</strong>: 14673</li><li><strong>Edges</strong>: 80411</li><li><strong>Average Shortest Path Length</strong>: 4.4630</li><li><strong>Clustering Coefficient</strong>: 0.0000</li></ul><p>At first glance, the <strong>average path length of 4.46</strong> suggests that nodes are relatively well-connected — it doesn’t take many steps to travel between a disease and a protein. However, the <strong>clustering coefficient of zero</strong> paints a different picture.</p><p>This means there’s <strong>no evidence of local clustering</strong> — the kind of tight-knit groupings seen in small-world systems. In other words, while a protein may be linked to many diseases, <strong>those diseases aren’t interconnected</strong> with each other. They don’t form meaningful “modules” or communities, as one would expect in a small-world network.</p><p>This result suggests that the disease–protein portion of the biomedical graph is more <strong>bipartite and dispersed</strong> than modular. It lacks the <strong>dense local pockets</strong> of interaction typically seen in biological subsystems — indicating that this specific subgraph <strong>does not exhibit small-world behaviour</strong>, even though it has short path lengths.</p><h3>Case 2: Protein–Protein Network</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ePdIPhwI-wOE9xxYPenQfQ.png" /></figure><p><strong>Protein–protein interaction (PPI) networks</strong> are often celebrated in systems biology for exhibiting <strong>small-world properties</strong> — a structure that balances <strong>local clustering</strong> with <strong>global efficiency</strong>. This means that proteins often form tight interaction modules (e.g., signalling complexes), while still maintaining short paths to other proteins across the network — much like how social networks operate.</p><p>To test this behaviour within the dataset, the <strong>largest connected component (LCC)</strong> of the protein_protein subgraph was analysed:</p><ul><li><strong>Nodes</strong>: 18,354</li><li><strong>Edges</strong>: 32,1075</li><li><strong>Clustering Coefficient</strong>: 0.1135</li><li><strong>Average Shortest Path Length</strong>: 2.98</li></ul><p>At first glance, the <strong>average path length is impressively low</strong>, suggesting efficient connectivity. However, the <strong>clustering coefficient</strong> — a key hallmark of small-world networks — was surprisingly modest.</p><p>To benchmark this, a <strong>Watts–Strogatz small-world simulation</strong> was performed using the same number of nodes and average degree (n = 18,345, k = 35, p = 0.05). The simulation yielded:</p><ul><li><strong>Clustering Coefficient</strong>: 0.6241</li><li><strong>Average Path Length</strong>: 4.29</li></ul><p>Despite its short paths, the real PPI graph lacks the <strong>high clustering</strong> expected of small-world networks. This suggests that while the network is <strong>efficient in terms of communication</strong>, it does <strong>not exhibit the modular structure</strong> characteristic of classical small-world topologies.</p><p>In simpler terms:</p><blockquote><em>Proteins in this dataset are well-connected globally, but they don’t form tight local communities as densely as expected — making the small-world signature </em><strong><em>incomplete</em></strong><em>.</em></blockquote><h4>Disclaimer on Subgraph Scope and Small-World Properties</h4><p><em>The current analysis was performed on a </em><strong><em>subgraph</em></strong><em> of the full PrimeKG dataset — specifically, the </em><strong><em>largest connected component</em></strong><em> of the </em><em>protein_protein network. While this subgraph is sizable, it still represents only a </em><strong><em>portion of the total protein–protein interactions</em></strong><em> in biomedical reality.</em></p><p><em>Had the </em><strong><em>entire protein–protein network</em></strong><em> been considered — capturing more protein families, paralogs, and overlapping pathways — the </em><strong><em>clustering coefficient may have been significantly higher</em></strong><em>. This is because proteins, especially within the same biological processes or cellular compartments, are </em><strong><em>more likely to interact and form densely connected communities</em></strong><em>, a key hallmark of </em><strong><em>small-world networks</em></strong><em>.</em></p><p><em>Thus, the absence of clear small-world properties in this analysis might not reflect a true structural limitation of the data, but rather a </em><strong><em>sampling artifact</em></strong><em> due to subgraph boundaries and visualization constraints.</em></p><h3>5. Community Detection:Finding Meaningful Modules PrimeKG</h3><p>In large-scale biomedical knowledge graphs, the web of connections can feel overwhelming. Yet, hidden within this complexity are <strong>communities</strong> — clusters of nodes that are <strong>more densely connected to each other than to the rest of the network</strong>. These tightly knit groups often correspond to <strong>biologically meaningful modules</strong>: diseases that share similar symptoms, proteins involved in the same pathway, or drugs targeting related mechanisms.</p><p>To uncover these underlying structures, the <strong>Louvain algorithm</strong> was applied — a widely used method for <strong>unsupervised community detection</strong>. It works by <strong>optimizing a metric called modularity</strong>, which evaluates how well a graph can be split into distinct communities. A <strong>higher modularity score</strong> indicates clearer boundaries between groups, helping reveal meaningful biological or therapeutic patterns embedded in the graph’s architecture.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/564/1*B_xjjp3_RoNYakum_0eKHA.png" /></figure><h3>Interpretation and Significance</h3><p>The image above shows a subgraph of a biomedical knowledge graph, with nodes colored by communities identified using the Louvain algorithm. Each color represents a distinct cluster of nodes that are more densely connected internally than to the rest of the network — capturing potential biological or pharmacological modules.</p><p>This particular subgraph highlights one of the most densely interconnected communities. At its core are phenotypes such as <strong>epistaxis</strong>, <strong>vertigo</strong>, and <strong>respiratory distress</strong>, surrounded by a range of drugs including <strong>Milnacipran</strong>, <strong>Tacrolimus</strong>, <strong>Dicyclomine</strong>, and <strong>Azithromycin</strong>. These relationships suggest shared therapeutic indications, adverse effect profiles, or treatment contexts.</p><p>Peripheral phenotypes such as <strong>alopecia</strong>, <strong>tongue pain</strong>, and <strong>urinary hesitancy</strong> further support the idea that this cluster reflects <strong>pharmacovigilance signals</strong> — real-world patterns of drug-related outcomes observed in clinical use.</p><h3>Why This Subgraph Matters</h3><ul><li><strong>Pharmacological Modules</strong>: The dense drug-phenotype interplay may indicate drugs that share mechanisms of action or are commonly co-prescribed.</li><li><strong>Polypharmacy Risk Exploration</strong>: The structural proximity of these drugs allows for the analysis of potential cumulative side effects or drug-drug interactions.</li><li><strong>Drug Repurposing &amp; Mechanistic Insight</strong>: Communities like this one surface latent relationships between drugs and phenotypes, offering a basis for repositioning opportunities and mechanistic hypotheses.</li></ul><p>This subgraph tells a <strong>biological story</strong> — capturing the interwoven nature of treatments, symptoms, and molecular processes. Community detection transforms vast biomedical graphs into interpretable, actionable clusters, revealing functional groupings that may be missed in traditional tabular analysis.</p><p>By allowing the <strong>graph structure</strong> to guide discovery — rather than relying solely on predefined categories — we unlock new pathways for understanding disease mechanisms, optimising therapeutic strategies, and navigating the complexity of biomedical knowledge.</p><h3>6. Structural &amp; Semantic Subgraphs</h3><p>After analysing global patterns like centrality and community structure, the next step is to zoom in — examining how individual biomedical concepts are embedded within the graph. This section explores both the <strong>structural layout</strong> and <strong>semantic context</strong> of nodes using k-hop subgraphs, revealing local patterns that offer biological insight.</p><h4>Understanding Semantic Neighbourhoods: The Power of K-hop Subgraphs</h4><p>In vast biomedical graphs, it’s easy to lose sight of <strong>local meaning</strong>. Concepts like “breast cancer” don’t exist in isolation — they’re surrounded by proteins, pathways, drugs, and phenotypes that help shape their biological relevance. To explore these relationships, <strong>k-hop subgraphs</strong> are a powerful tool.</p><h4>What Is a K-hop Subgraph?</h4><p>A <strong>k-hop subgraph</strong> extracts the immediate network around a node — everything reachable in <em>k</em> steps:</p><ul><li><strong>1-hop</strong> shows direct connections (e.g., drugs used to treat a disease).</li><li><strong>2-hop</strong> brings in neighbours of those neighbours — capturing more subtle but still localised context, such as downstream effects or secondary pathways.</li></ul><p>This approach balances <strong>focus and scope</strong> — providing enough detail for insight without being overwhelmed by the full graph’s scale.</p><h3>Visualising Breast Cancer’s Semantic Neighbourhood</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*fWkQ4UjKka7sSaI1VG8Mcw.png" /><figcaption>The figure illustrates a 2-hop semantic neighborhood around a central biomedical entity, visualized as a subgraph extracted from a larger knowledge graph. Nodes represent various biomedical entities — genes, drugs, diseases, and biological processes — colored by category. Edges capture direct or indirect associations within the 2-hop boundary. While some nodes are densely connected, others appear more isolated, reflecting the diverse nature of contextual relationships.</figcaption></figure><h3>Interpretation and Significance</h3><p>The visualisation reveals a <strong>2-hop neighbourhood</strong> around a selected biomedical node, capturing its extended context within the knowledge graph. Nodes represent a mix of <strong>genes/proteins</strong> (light blue), <strong>diseases/phenotypes</strong> (red), <strong>biological processes</strong> (gray), and <strong>drugs</strong> (green or purple), and are connected via known relationships like regulatory effects, therapeutic use, or shared pathways.</p><p>Several <strong>small clusters</strong> are visible, each centered around a key biological or clinical term. For instance:</p><ul><li>The <strong>red nodes</strong> mark diseases or phenotypes such as <em>prostate cancer</em>, <em>ovarian clear cell adenocarcinoma</em>, or <em>attention deficit-hyperactivity disorder</em>.</li><li><strong>Green nodes</strong> such as <em>Propafenone</em> and <em>Cinchocaine</em> represent drugs, possibly linked via shared treatment targets or adverse effects.</li><li><strong>Light blue nodes</strong> dominate the graph, representing genes and proteins that may be involved in upstream signalling or downstream response mechanisms.</li></ul><h3>Why This Matters</h3><ul><li><strong>Functional Modules</strong>: The visualisation helps identify functionally related groupings, like genes that co-regulate disease expression or drugs associated with shared phenotypic responses.</li><li><strong>Latent Associations</strong>: Some nodes are only loosely connected or isolated — a reflection of weak yet relevant semantic relationships not fully captured in the core graph structure.</li><li><strong>Biological Insight from Topology</strong>: For example, a gene like <strong>DNAJA3</strong> connected to <em>NF-κB signalling</em> and <em>influenza</em> could suggest a pathway-level role that’s worth exploring further in disease contexts or drug targeting.</li></ul><h3>On Visually Isolated Nodes</h3><p>Some nodes appear <strong>disconnected</strong> from others despite being included in the 2-hop neighbourhood. This isn’t an error — it’s a reflection of how <strong>visual constraints</strong> (e.g., maximum edge count, subgraph boundary limits) affect layout.</p><p>These nodes are still part of the semantic neighbourhood and may:</p><ul><li>Represent <strong>weaker or single-hop connections</strong>,</li><li>Be <strong>part of a peripheral pathway or side-effect profile</strong>,</li><li>Or act as <strong>bridges</strong> to other communities in the larger graph.</li></ul><h4>A Simple Analogy:</h4><p>Imagine being at a party. Everyone there is somehow connected to the host — but not necessarily to each other. Some guests arrived as a quiet plus-one, standing off in a corner, not mingling much. They’re still part of the guest list — they just don’t have as many visible connections.</p><p>These “disconnected” nodes still matter. Their presence tells us the local biomedical neighbourhood is <strong>diverse</strong> — some entities are deeply intertwined, others act as bridges, and some are passive outliers. All contribute meaningfully to the context.</p><h3>7. Causal Discovery (Advanced)</h3><p>So far, the focus has been on structural patterns, centrality, and semantic proximity. But graphs aren’t just about <em>who</em> is connected — they’re also about <em>how</em> they’re connected.</p><p>In biomedical knowledge graphs, certain relationships carry a directional implication: a drug <strong>targets</strong> a protein, a protein <strong>modulates</strong> a pathway, or a drug is indicated <strong>for</strong> a disease. These are not causal in the strict statistical sense, but they <strong>encode mechanistic or clinical directionality</strong> — which makes them <strong>semantically causal</strong> and incredibly relevant for downstream tasks like treatment discovery or hypothesis generation.</p><p>In the final section, edges with causal-like semantics (such as &quot;target&quot;, &quot;indication&quot;, &quot;enzyme&quot;) are extracted and explored to surface potential <strong>influence pathways</strong> across the biomedical graph. These directional links help illuminate how interventions might propagate and where key levers of action exist within the network.</p><h3>Bevacizumab-Centric Causal Subgraph</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*hlawfWrSb0yIXJlfl3ZOxg.png" /></figure><p>The graph above illustrates a <strong>2-hop causal subgraph</strong> centered around <strong>Bevacizumab</strong>, an anti-angiogenic drug widely used in the treatment of cancers. To focus on meaningful directional relationships, only edges with <strong>causal-like semantics</strong> were included, such as:</p><ul><li><strong>Bevacizumab → Disease</strong> (indication)</li><li><strong>Bevacizumab → Protein</strong> (target)</li><li><strong>Protein → Disease</strong> (target, indication)</li></ul><p><strong>Node color and meaning:</strong></p><ul><li><strong>Bevacizumab</strong> (central drug node)</li><li><strong>Diseases</strong> it is indicated for (e.g., breast cancer, glioblastoma)</li><li><strong>Proteins or genes</strong> it targets or that are linked to related diseases (e.g., VEGFA, C1QB, FCGR2A)</li></ul><p><strong>Why This Matters:</strong></p><p>This graph tells a therapeutic story. Bevacizumab doesn’t operate in isolation — it <strong>intersects multiple biological axes</strong>:</p><ul><li>It targets key <strong>immune and angiogenesis-related proteins</strong> (like VEGFA).</li><li>It is <strong>indicated for a wide array of cancers</strong>, spanning breast, cervical, ovarian, and glioblastoma.</li><li>Some of its downstream targets (e.g., complement proteins C1QA, C1QB) suggest broader immunological involvement, beyond just vascular inhibition.</li></ul><p>This view provides a <strong>mechanistic snapshot</strong> of how a single drug engages with both <strong>molecular targets</strong> and <strong>clinical outcomes</strong>, and may surface opportunities for <strong>drug repurposing</strong>, <strong>comorbidity mapping</strong>, or understanding <strong>off-target effects</strong>.</p><p>In short, this subgraph is not just about what Bevacizumab treats — it reflects <strong>how</strong> it acts, <strong>where</strong> it acts, and <strong>why</strong> its place in the biomedical network is central.</p><h3>Wrapping Up: From Graphs to Ground Truths</h3><p>What began as a tangle of nodes and edges has unraveled into something far more powerful — a structured lens on the complexity of human biology. Through centrality scores, semantic neighbourhoods, causal edges, and community structures, this exploration of <strong>PrimeKG</strong> revealed how graphs don’t just <em>store</em> knowledge — they <em>shape</em> it.</p><p>We saw how <strong>proteins form the backbone</strong> of biomedical interactions, how diseases like <strong>Schizophrenia or Intellectual Disability</strong> emerge as influential connectors, and how drugs like <strong>Bevacizumab</strong> map out therapeutic footprints across molecular and clinical landscapes.</p><p>Each visual and metric wasn’t just an abstract representation — it was a <strong>story waiting to be discovered</strong>. Stories of influence, proximity, causality, and connection. Stories that hint at <strong>new hypotheses</strong>, <strong>repurposing opportunities</strong>, or <strong>underlying biological mechanisms</strong> yet to be fully understood.</p><p>As the field of biomedical AI accelerates, it’s becoming clear that <strong>networks aren’t just data structures — they’re mirrors of biology’s design</strong>. And learning to read them well may hold the key to more personalised, explainable, and effective healthcare.</p><p>This blog is just the beginning. In the next part of this series, we’ll move beyond shallow metrics and handcrafted rules — and dive deep into <strong>Graph Neural Networks (GNNs)</strong>, learning embeddings, training models, and generating predictions from these richly connected datasets.</p><p>To view the code behind the visualisation check out this <a href="https://nbviewer.org/github/amulya-prasad/XplainMD/blob/master/Notebooks/PrimeKG_Data_Visualization_Part1.ipynb">notebook</a>.</p><h3>Coming up Next:</h3><p><a href="https://medium.com/@fhirshotlearning/xplainmd-part-2-finding-the-missing-links-with-machine-learning-918c03f613d4?source=friends_link&amp;sk=3e0205292ae1ffecd7189fdf97f899a8"><strong><em>XplainMD Part 2: Finding the Missing Links with Machine Learning</em></strong></a></p><p>In the next part of this series, we shift from structural analysis to predictive modelling. Using <strong>Node2Vec embeddings</strong>, we’ll transform the graph into numerical vectors that capture both local neighbourhoods and global patterns. These embeddings will power <strong>machine learning models</strong> — from logistic regression to XGBoost — to predict missing links between diseases, proteins, and drugs.</p><p>We’ll explore how well these models perform in identifying <strong>biologically plausible but unseen connections</strong>, assess their accuracy, and compare strengths and limitations — setting the stage for the deep learning methods to follow.</p><h4>References:</h4><ol><li><a href="https://www.turing.com/kb/graph-centrality-measures">https://www.turing.com/kb/graph-centrality-measures</a></li><li>Watts, D. J., &amp; Strogatz, S. H. (1998). Collective dynamics of ‘small-world’networks. nature, 393(6684), 440.</li><li><a href="https://chih-ling-hsu.github.io/2020/05/15/watts-strogatz">https://chih-ling-hsu.github.io/2020/05/15/watts-strogatz</a></li><li><a href="https://medium.com/data-science/community-detection-algorithms-9bd8951e7dae">https://medium.com/data-science/community-detection-algorithms-9bd8951e7dae</a></li></ol><p>5. Stanford CS224W: ML with Graphs | 2021 | Lecture 2.1 — Traditional. Feature-based Methods: Node:<a href="https://youtu.be/3IS7UhNMQ3U?si=_FDb2LtxoxI_fYtb">https://youtu.be/3IS7UhNMQ3U?si=_FDb2LtxoxI_fYtb</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=cf032cbb864a" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[XplainMD: A Graph-Powered Guide to Smarter Healthcare]]></title>
            <link>https://medium.com/@fhirshotlearning/xplainmd-a-graph-powered-guide-to-smarter-healthcare-fd5fe22504de?source=rss-7e548aa5925b------2</link>
            <guid isPermaLink="false">https://medium.com/p/fd5fe22504de</guid>
            <category><![CDATA[chatbots]]></category>
            <category><![CDATA[graph-data-science]]></category>
            <category><![CDATA[graph-neural-networks]]></category>
            <category><![CDATA[gnnexplainer]]></category>
            <category><![CDATA[explainable-ai]]></category>
            <dc:creator><![CDATA[FHIR Shot Learning]]></dc:creator>
            <pubDate>Sat, 29 Mar 2025 17:30:13 GMT</pubDate>
            <atom:updated>2025-04-09T14:32:41.850Z</atom:updated>
            <content:encoded><![CDATA[<p>A beautiful synergy between GNNs and LLMs for transparent and trustworthy Clinical Decision Support Systems</p><h3>Introduction</h3><p><strong>Imagine the future of healthcare…</strong></p><p>A doctor encounters a puzzling symptom combination in a patient and turns to an AI system for insight. The AI returns a prediction — perhaps a rare disease, an overlooked comorbidity, or repurposing a promising drug.</p><p>But then comes the critical question:<br> <strong>“How can the doctor trust this prediction?”</strong></p><p>After all, without transparency, even the most confident output might just be a well-worded guess?</p><p>And this is exactly where things take an exciting turn.</p><p>When the doctor clicks <strong>“Why?”</strong>, the AI doesn’t just return citations (although those matter too). Instead, it also reveals a <strong>subgraph</strong> — a glowing, interpretable network of diseases, phenotypes, and drug interactions that explain the prediction.<br> A <strong>missing link emerges</strong> — one the doctor hadn’t considered — now made visible through the graph structure.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Xik-b-JZEmqxrQnmlbfjow.png" /><figcaption>Doctor viewing patient subgraph generated by ChatGPT-4o</figcaption></figure><h3>Now imagine another scenario!</h3><p>The patient&#39;s condition is improving. But how can we be sure its the drug or the treatment?</p><p>Is it the drug that’s responsible — or other variables, like <strong>genetics, lifestyle, or concurrent treatments</strong>?</p><p>This is where <strong>graphs step in!</strong> They don’t just represent data — they reveal <strong>relationships</strong>, enable <strong>reasoning</strong>, and allow us to explore <strong>causality</strong>.</p><p>By adding or removing nodes, adjusting edges, or examining an entity’s neighbourhood, clinicians can simulate scenarios, challenge assumptions, and reason through outcomes — not with guesswork, but with a proper pathway(subgraph).</p><blockquote><strong>Graphs don’t just connect information. They connect understanding.</strong></blockquote><p>In traditional machine learning, data is often treated as <strong>isolated rows</strong> in a table — one record per patient, one feature vector at a time. Each prediction is made based solely on the input features for that specific instance, without considering how it relates to other features.</p><p>But healthcare doesn’t work in isolation. Diseases are linked to phenotypes. Drugs interact with proteins. Genes influence both — along with <strong>millions of interconnected biological factors</strong> constantly affecting one another. If we want AI to understand these relationships, then <strong>how we represent the data matters</strong> just as much as the data itself. And to reason over such complex, interconnected information, we need a model that’s designed for it — <br> Not just any neural network, but one that can <strong>learn from structure</strong>.</p><p>This is where <strong>graphs fundamentally change the game</strong>.</p><p>Graphs let us <strong>connect the dots</strong> — not just between entities, but between <strong>layers of biological meaning</strong>. A graph doesn’t just store that <em>“Drug A treats Disease B”</em> — it shows that Drug A interacts with a protein, which participates in a pathway, which relates to a phenotype seen in Disease B.</p><p>With that structure in place, one can:</p><ul><li>Explore <strong>causality</strong>, not just correlation</li><li>Reason across <strong>multiple biological scales</strong></li><li>Understand <strong>how and why</strong> predictions emerge</li></ul><p>So when I say <strong>“graphs connect understanding,”</strong> I mean they offer a <strong>richer, more contextual view</strong> of complex systems — something traditional models simply can’t achieve when they treat each input as a <strong>flat, disconnected data.</strong></p><p>To <em>learn</em> from such interconnected data, we need <strong>neural networks that are structure-aware</strong> — models that don’t just look at individual features, but understand the <strong>relationships between them</strong>.</p><p>That’s exactly what <strong>Graph Neural Networks (GNNs)</strong> are built for!</p><h4><em>So what could this mean for the future of medicine?</em></h4><p>What if a <strong>digital twin</strong> of a patient wasn’t just a file of lab reports and sensor data, but an <strong>interactive knowledge graph</strong>?<br> A graph where each node represents a part of their biology or history — and modifying that graph could uncover risks, suggest therapies, or even prevent misdiagnoses. What if <strong>every missing node or edge </strong><br> was the difference between a delayed diagnosis and an early breakthrough?</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*zksYU7LGl1uOF2-mNax9Fg.png" /><figcaption>Digital Twin Illustration generated by ChatGPT-4o</figcaption></figure><p>What if we could actually build something like this?</p><p>Not just imagine it — but design a system that can predict biomedical relationships <em>and</em> explain them visually and contextually.<br> A system that combines structured knowledge with reasoning.<br> One that doctors — and researchers — could actually trust.</p><p><strong>That’s the vision behind XplainMD — </strong>a predictive and explainable medical assistant that brings together the structure of <strong>graph neural networks (GNNs)</strong> and the language fluency of <strong>large language models (LLMs)</strong>.<br>It’s built on top of <strong>Harvard’s PrimeKG</strong>, a richly curated biomedical knowledge graph designed for precision medicine.</p><h3>Connecting the Dots: From Architecture to Execution</h3><p>This project didn’t begin with a dataset nor with a specific model in mind.<br> It began with a <strong>vision</strong> — a rough architectural sketch I drew two years ago, mapping out a system I wasn’t yet ready to build… but knew I had to.</p><p>A system that could bring structure to biomedical knowledge, reason over it, and explain itself — end to end.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/864/1*OvHYsJsLHlrLM1DhYBhqrg.png" /><figcaption>CDSS Diagram envisioned in 2023</figcaption></figure><p>Before building <strong>XplainMD</strong>, I explored a foundational challenge: how can one extract meaningful, high-quality biomedical knowledge from unstructured literature?</p><p>In my previous <a href="https://medium.com/@fhirshotlearning/harnessing-pubmed-a-deep-dive-in-medical-knowledge-extraction-powered-by-llms-4e895b4f0839"><strong>PubMed Data Extraction</strong></a> series, I built a pipeline that did just that. It filtered <strong>high-quality articles</strong>, performed <strong>metadata analysis</strong>, and used <strong>large language models (LLMs)</strong> to extract entities and relationships — ultimately constructing a biomedical <strong>knowledge graph</strong> from scratch.</p><p>But as the project evolved, so did the realisation that this approach had critical limitations.</p><p>First, there was <strong>no human-in-the-loop validation</strong> for the extracted entities. While tools like <strong>BERN2</strong> are powerful, biomedical terminology is inherently ambiguous. Consider <em>insulin</em> — biologically a protein, clinically a drug. Without proper context, the same term can be interpreted multiple ways, leading to noisy or misleading structures of graph.</p><p>Second, the <strong>relationship extraction</strong> process was based on individual sentences. The LLM would classify an association (positive/negative) based on a single snippet of text — but in scientific literature, context is everything. A relationship that appears causal in one sentence might be contradicted or clarified elsewhere in the same abstract. This sentence-level view simply wasn’t enough for accurate biomedical relationship extraction.</p><p>And most importantly, there was <strong>no expert validation</strong>. Even with strong models, constructing biomedical graphs without clinician oversight risks encoding <strong>false associations</strong> — which, in a domain as sensitive as medicine, can be very dangerous.</p><p>That’s when I discovered <strong>PrimeKG</strong> — a <strong>precision medicine knowledge graph</strong> developed by Harvard, integrating over 20 high-quality biomedical sources. With more than 17,000 diseases and 4 million+ relationships across 10 biological scales, PrimeKG offered something my earlier pipeline couldn’t: <strong>clinical relevance, validated structure, and multi-modal depth</strong>.</p><h3>XplainMD: Brought to Life by PrimeKG</h3><p><strong>PrimeKG</strong>, developed by researchers at Harvard, is a <strong>richly curated biomedical knowledge graph</strong> that connects diseases, drugs, proteins, phenotypes, and much more. But it’s not just comprehensive — it’s <strong>clinically meaningful</strong>.</p><p>What sets PrimeKG apart is its <strong>depth and precision</strong>. Whether it’s approved treatments or experimental compounds, PrimeKG doesn’t just capture the surface — it <strong>maps the underlying structure of biomedical knowledge</strong>, making it one of the most complete and actionable disease-centric graphs available today.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*QN3ZuW8LQw0HmJp8W7Epww.png" /><figcaption>Overview of the pipeline of XplainMD</figcaption></figure><p>With PrimeKG as the foundation, <strong>XplainMD</strong> was developed — a system that doesn’t just predict, but <strong>explains</strong> its own predictions.</p><p>For this project, <strong>Relational Graph Convolutional Networks (R-GCN)</strong> was used to perform <strong>link prediction</strong> on PrimeKG, surfacing potential <strong>drug-disease</strong>, <strong>disease-phenotype</strong>, and <strong>drug-protein</strong> relationships.</p><p>To interpret these predictions, <strong>GNNExplainer </strong>was used, to extract the <strong>subgraph-level evidence</strong> that contributed to each prediction. These subgraphs were then compared against the ground truth to assess confidence and alignment with validated knowledge.</p><p>Next, these insights were passed into a <strong>Large Language Model (LLaMA 3.1 8B Instruct)</strong>, which generated <strong>natural language explanations</strong> that accompany the visual subgraphs — giving users clear, contextual interpretations instead of black-box outputs.</p><p>Finally, the entire pipeline is wrapped in an intuitive <strong>Gradio-based chatbot</strong>, where clinicians (or curious users!) can ask biomedical questions, receive predictions, explore subgraphs, and most importantly — understand the <em>why</em> behind every answer.</p><p>This is <strong>explainable AI for real-world healthcare</strong> — and it’s just the beginning of exciting times ahead!</p><p>This project series offers a <strong>gentle introduction to the world of graph data science</strong> — and how we can build more <strong>trustworthy, transparent systems</strong> using graph neural networks and explainability tools while using LLM to enable natural language understanding of the predictions.</p><h3><strong>Coming up Next</strong></h3><p><strong>XplainMD</strong> is a four-part journey that explores the complete pipeline from structured biomedical data to explainable AI-driven insights:</p><p><strong><em>Part 1:</em></strong><em> </em><a href="https://medium.com/@fhirshotlearning/xplainmd-part-1-a-visual-exploration-of-primekg-cf032cbb864a?source=friends_link&amp;sk=b87f78a312f515080e5f8598804c1d29"><em>A Visual Exploration of PrimeKG</em></a><em><br>A beginner-friendly introduction to graph theory and biomedical knowledge graphs, with a deep dive into the structure of PrimeKG.</em></p><p><strong><em>Part 2:</em></strong><em> </em><a href="https://medium.com/@fhirshotlearning/xplainmd-part-2-finding-the-missing-links-with-machine-learning-918c03f613d4?source=friends_link&amp;sk=3e0205292ae1ffecd7189fdf97f899a8"><em>Finding the Missing Links with Machine Learning</em></a><em><br> Using Node2Vec and classical ML techniques to uncover hidden relationships in the graph.</em></p><p><strong><em>Part 3:</em></strong><em> </em><a href="https://medium.com/@fhirshotlearning/xplainmd-part-3-relational-gcn-gnnexplainer-learning-explaining-links-6a4a290819fc?source=friends_link&amp;sk=daed06bf3a79107abb518c6fb2590002"><em>Relational GCN + GNNExplainer: Learning &amp; Explaining Links</em></a><em><br>Training a Relational Graph Convolutional Network (R-GCN) to predict drug-disease and disease-phenotype links — and interpreting those predictions with GNNExplainer.</em></p><p><strong><em>Part 4:</em></strong><em> </em><a href="https://medium.com/@fhirshotlearning/xplainmd-part-4-from-graph-reasoning-to-natural-language-integrating-gnns-with-llms-and-gradio-afa5c636e956?source=friends_link&amp;sk=ce7739d24a6d190f81093f8f40ed840d">From Graph Reasoning to Natural Language — Integrating GNNs with LLMs and Gradio</a><em><br>Integrating an LLM to generate natural language explanations, wrapped in an intuitive chatbot interface using Gradio.</em></p><p>If you’re passionate about <strong>trustworthy AI</strong>, <strong>clinical decision support</strong>, or <strong>graph-powered reasoning</strong>, this series was built for you. Check out the full project on <a href="https://github.com/amulya-prasad/XplainMD">Github</a>.</p><p><em>Because in </em><strong><em>medicine</em></strong><em>, relationships </em><strong><em>matter</em></strong><em>— but understanding those relationships?<br>That has the power to save millions of lives!</em></p><h3>References:</h3><ol><li>Chandak, P., Huang, K. &amp; Zitnik, M. Building a knowledge graph to enable precision medicine. <em>Sci Data</em> <strong>10</strong>, 67 (2023). <a href="https://doi.org/10.1038/s41597-023-01960-3">https://doi.org/10.1038/s41597-023-01960-3</a></li><li>Ying, Z., Bourgeois, D., You, J., Zitnik, M. and Leskovec, J., 2019. Gnnexplainer: Generating explanations for graph neural networks. <em>Advances in neural information processing systems</em>, <em>32</em>.<a href="https://doi.org/10.48550/arXiv.1903.03894"><br>https://doi.org/10.48550/arXiv.1903.03894</a></li><li>Schlichtkrull, M., Kipf, T.N., Bloem, P., Van Den Berg, R., Titov, I. and Welling, M., 2018. Modeling relational data with graph convolutional networks. In <em>The semantic web: 15th international conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, proceedings 15</em> (pp. 593–607). Springer International Publishing. <a href="https://doi.org/10.48550/arXiv.1703.06103">https://doi.org/10.48550/arXiv.1703.06103</a></li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=fd5fe22504de" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[PubMed Data Part 4: Building Knowledge Graphs]]></title>
            <link>https://medium.com/@fhirshotlearning/pubmed-data-part-4-building-knowledge-graphs-b1cd0cb382b6?source=rss-7e548aa5925b------2</link>
            <guid isPermaLink="false">https://medium.com/p/b1cd0cb382b6</guid>
            <dc:creator><![CDATA[FHIR Shot Learning]]></dc:creator>
            <pubDate>Tue, 07 Jan 2025 07:17:46 GMT</pubDate>
            <atom:updated>2025-01-07T07:17:46.520Z</atom:updated>
            <content:encoded><![CDATA[<p>After scoring and clustering papers in <a href="https://medium.com/@fhirshotlearning/pubmed-data-part-3-mathematical-modelling-e3c698a1e5ed">Part 3</a>, the final phase of this project focuses on constructing <strong>Knowledge Graphs (KGs)</strong> — powerful tools for structuring relationships and making complex data more interpretable. In this stage, advanced techniques such as <strong>Regex tokenization</strong>, <strong>BERN2 for Named Entity Recognition (NER)</strong>, and <strong>Llama 3.1 for relationship extraction</strong> are employed to build insightful and meaningful knowledge graphs.</p><p><strong>Knowledge Graphs</strong> represent a transformative approach for organizing biomedical research. They not only provide interpretability but also address critical limitations of transformer models, such as hallucinations and a lack of transparency. By combining transformer-based entity and relationship extraction with the robust foundation of KGs, this approach ensures reliable, evidence-backed AI recommendations.</p><p>The process will be broken down step-by-step, along with the corresponding implementation code, to illustrate how these knowledge graphs are constructed.</p><h3><strong>Step 1 : Extracting Sentences with Regular Expressions</strong></h3><p>To prepare the text for entity recognition and relationship extraction, the sentences from research papers are <strong>tokenized</strong> using a fast and efficient method. Tokenization is critical because processing individual sentences improves the accuracy of entity extraction. Although there are many tokenizers available like spacy sentencizer or transformer based tools, for this project, a regex based function is used to extract the sentences</p><p>Here is the code used for <strong>sentence extraction</strong>:</p><pre>def fast_extract_sentences(text):<br>    # Split on sentence-ending punctuation<br>    sentences = re.split(r&#39;(?:[.!?]\s+)&#39;, text)<br>    # Remove any empty strings or extra whitespace<br>    return [sentence.strip() for sentence in sentences if sentence.strip()]<br><br># Function to extract sentences from a .txt file<br>def extract_sentences_from_text_file(file_path):<br>    with open(file_path, &#39;r&#39;) as file:<br>        text = file.read()<br>    # Use fast method to split text into sentences<br>    sentences = fast_extract_sentences(text)<br>    return sentences</pre><h3>Step 2 : Named Entity Recognition using BERN 2</h3><p>This step focuses on extracting biomedical entities, a crucial task in biomedical natural language processing (NLP). Named Entity Recognition (NER) and Named Entity Normalization (NEN) play a pivotal role in automatically identifying and standardizing entities like diseases, drugs, and organizations from the ever-expanding biomedical literature. While numerous tools exist for extracting general English entities, options for biomedical-specific entities are limited. Among these, <strong>BERN2 (Biomedical Entity Recognition and Normalization)</strong> stands out as the most advanced tool available.</p><p><strong>BERN2</strong> builds on previous neural network-based NER tools by integrating a multi-task NER model with neural network-based NEN models. This combination allows BERN2 to deliver significantly faster and more accurate entity recognition and normalization, making it an indispensable resource for biomedical NLP tasks. They provide a RESTful API to access their model( link: <a href="http://bern2.korea.ac.kr./documentation">http://bern2.korea.ac.kr./documentation</a>). The implementation of this is shown below:</p><pre># Query BERN2 for named entity recognition<br>def query_raw(text, url=&quot;http://bern2.korea.ac.kr/plain&quot;):<br>    try:<br>        return requests.post(url, json={&#39;text&#39;: text}).json()<br>    except:<br>        print(&#39;Invalid sentence&#39;)<br>        return None<br><br># Extract entities from the BERN2 response<br>def extract_entities(entities):<br>    if not entities.get(&#39;annotations&#39;):<br>        return {&#39;text&#39;: entities[&#39;text&#39;], &#39;text_sha256&#39;: hashlib.sha256(entities[&#39;text&#39;].encode(&#39;utf-8&#39;)).hexdigest()}<br><br>    e = []<br>    for entity in entities[&#39;annotations&#39;]:<br>        other_ids = [id for id in entity[&#39;id&#39;] if not id.startswith(&quot;BERN&quot;)]<br>        entity_type = entity[&#39;obj&#39;]<br>        entity_name = entities[&#39;text&#39;][entity[&#39;span&#39;][&#39;begin&#39;]:entity[&#39;span&#39;][&#39;end&#39;]]<br>        entity_id = next((id for id in entity[&#39;id&#39;] if id.startswith(&quot;BERN&quot;)), entity_name)<br>        e.append({<br>            &#39;entity_id&#39;: entity_id,<br>            &#39;other_ids&#39;: other_ids,<br>            &#39;entity_type&#39;: entity_type,<br>            &#39;entity&#39;: entity_name<br>        })<br><br>    return {&#39;entities&#39;: e, &#39;text&#39;: entities[&#39;text&#39;], &#39;text_sha256&#39;: hashlib.sha256(entities[&#39;text&#39;].encode(&#39;utf-8&#39;)).hexdigest()}<br></pre><p>The <strong>query_raw</strong> function sends a POST request to the BERN2 RESTful API with the input text in JSON format and retrieves the API&#39;s response.</p><p>The <strong>extract_entities</strong> function processes the raw response from BERN2 to extract and structure named entities along with their metadata.</p><p><strong>Explanation of this function:</strong></p><ol><li><strong>Input Arguments</strong>:</li></ol><ul><li>entities: The JSON response from the BERN2 API.</li></ul><p><strong>2. Handling Missing Annotations</strong>:</p><ul><li>If the API response does not contain an annotations key, the function:</li><li>Returns the original text.</li><li>Computes a <strong>SHA-256 hash</strong> of the text to uniquely identify it.</li><li>This ensures robustness in cases where no entities are identified.</li></ul><p><strong>3. Extracting Entity Details</strong>:</p><ul><li>For each entity in the annotations field:</li><li><strong>entity_name</strong>: Extracted text span of the entity using begin and end indices.</li><li><strong>entity_type</strong>: The type of entity (e.g., &quot;ORG&quot; for organization, &quot;DISEASE&quot; for disease).</li><li><strong>entity_id</strong>: A unique identifier for the entity. If a BERN-specific ID exists, it is used; otherwise, the entity name is used as the fallback.</li><li><strong>other_ids</strong>: Any additional IDs (e.g., MeSH or PubMed IDs) associated with the entity.</li></ul><p><strong>4. Return Value</strong>:</p><p>A dictionary containing:</p><ul><li><strong>entities</strong>: A list of extracted entities with their details.</li><li><strong>text</strong>: The original text.</li><li><strong>text_sha256</strong>: A hash of the text for unique identification.</li></ul><p>The output is the following:</p><pre> &quot;text&quot;: &quot;Abstract 5: immunotherapy targeting programmed cell death-1 (pd-1) and pd-l1 immune checkpoints has reshaped treatment paradigms across several cancers, including breast cancer&quot;,<br>        &quot;text_sha256&quot;: &quot;64b0bdbd0e6a7eca8779b6f983f077f494e392c1706fda426efbc60beffad3a0&quot;<br>    },<br>    {<br>        &quot;entities&quot;: [<br>            {<br>                &quot;entity_id&quot;: &quot;pd-1&quot;,<br>                &quot;other_ids&quot;: [<br>                    &quot;NCBIGene:5133&quot;<br>                ],<br>                &quot;entity_type&quot;: &quot;gene&quot;,<br>                &quot;entity&quot;: &quot;pd-1&quot;<br>            },<br>            {<br>                &quot;entity_id&quot;: &quot;pd-l1&quot;,<br>                &quot;other_ids&quot;: [<br>                    &quot;NCBIGene:29126&quot;<br>                ],<br>                &quot;entity_type&quot;: &quot;gene&quot;,<br>                &quot;entity&quot;: &quot;pd-l1&quot;<br>            },<br>            {<br>                &quot;entity_id&quot;: &quot;triple-negative breast cancer&quot;,<br>                &quot;other_ids&quot;: [<br>                    &quot;mesh:D064726&quot;<br>                ],<br>                &quot;entity_type&quot;: &quot;disease&quot;,<br>                &quot;entity&quot;: &quot;triple-negative breast cancer&quot;<br>            },<br>            {<br>                &quot;entity_id&quot;: &quot;patients&quot;,<br>                &quot;other_ids&quot;: [<br>                    &quot;NCBITaxon:9606&quot;<br>                ],<br>                &quot;entity_type&quot;: &quot;species&quot;,<br>                &quot;entity&quot;: &quot;patients&quot;<br>            }<br>        ]</pre><h4>Explanation of the Output:</h4><p>Each entity represents a meaningful biomedical concept extracted from the text. Let’s break down the entities:</p><ol><li><strong>Entity 1: PD-1</strong></li></ol><ul><li><strong>entity_id</strong>: &quot;pd-1&quot; (Programmed Cell Death Protein 1).</li><li><strong>other_ids</strong>: Includes the identifier <strong>NCBIGene:5133</strong>, referencing the PD-1 gene in the NCBI database.</li><li><strong>entity_type</strong>: &quot;gene&quot;, indicating that PD-1 is a gene.</li><li><strong>entity</strong>: &quot;pd-1&quot;, the exact text from the input referring to this entity.</li></ul><p><strong>2. Entity 2: PD-L1</strong></p><ul><li><strong>entity_id</strong>: &quot;pd-l1&quot; (Programmed Death Ligand 1).</li><li><strong>other_ids</strong>: Includes <strong>NCBIGene:29126</strong>, referencing the PD-L1 gene in the NCBI database.</li><li><strong>entity_type</strong>: &quot;gene&quot;, indicating it is also a gene.</li><li><strong>entity</strong>: &quot;pd-l1&quot;, the mention of this immune checkpoint in the text.</li></ul><p><strong>3. Entity 3: Triple-Negative Breast Cancer</strong></p><ul><li><strong>entity_id</strong>: &quot;triple-negative breast cancer&quot;.</li><li><strong>other_ids</strong>: Includes <strong>MeSH:D064726</strong>, referencing this specific type of breast cancer in the Medical Subject Headings (MeSH) database.</li><li><strong>entity_type</strong>: &quot;disease&quot;, indicating this entity is a disease.</li><li><strong>entity</strong>: &quot;triple-negative breast cancer&quot;, as mentioned in the input text.</li></ul><p><strong>4. Entity 4: Patients</strong></p><ul><li><strong>entity_id</strong>: &quot;patients&quot;.</li><li><strong>other_ids</strong>: Includes <strong>NCBITaxon:9606</strong>, referencing the human species in the NCBI Taxonomy database.</li><li><strong>entity_type</strong>: &quot;species&quot;, denoting this entity refers to humans.</li><li><strong>entity</strong>: &quot;patients&quot;, as mentioned in the text.</li></ul><p>PD-L1 is a <strong>protein</strong> encoded by the <strong>CD274 gene</strong>. While BERN2 classified it as a “gene” based on database associations, it is technically a protein. This is a <strong>misclassification</strong>. NER tools like <strong>BERN2</strong> often associate proteins with their encoding genes for simplicity and because they are closely linked in biological texts. This classification does not invalidate the output but highlights the importance of domain knowledge to interpret results accurately.</p><h3>Step 3 : Relationship Extraction</h3><p><strong>Llama 3.1 (8B-Instruct)</strong>, a large language model optimized for instruction-based tasks, was employed for this step as many biomedical transformers are not yet capable of relationship extraction. For the sake of simplicity, the relationships between biomedical entities were classified into four major categories:</p><ol><li><strong>Positive Association</strong>: Indicates a direct or favorable link between two entities.</li></ol><ul><li>Example: <em>“Drug A is effective in treating Disease B.</em></li></ul><p><strong>2. Negative Association</strong>: Represents an adverse or unfavorable connection between entities.</p><ul><li>Example: <em>“Treatment C exacerbates Condition D.”</em></li></ul><p><strong>3. Positive Correlation</strong>: Highlights a statistically significant, positive relationship where one entity increases or improves with another.</p><ul><li>Example: <em>“Higher dosages of Drug E correlate with improved outcomes for Disease F.”</em></li></ul><p><strong>4. Negative Correlation</strong>: Denotes a statistically significant, negative relationship where one entity decreases or worsens as another increases.</p><ul><li>Example: <em>“Long-term use of Drug G is negatively correlated with Patient H’s recovery time.”</em></li></ul><h3>Step 3: Loading the Model through Huggingface</h3><p>The model was loaded using Hugging Face’s Transformers library with <strong>4-bit quantization</strong> to optimize memory usage and enable efficient inference on modern GPUs. The tokenizer preprocesses input prompts for compatibility with the Llama architecture, while the quantized model allows high-performance execution by reducing the precision of computations without compromising accuracy.</p><pre>from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig<br>import torch<br><br>model_id = &quot;meta-llama/Llama-3.1-8B-Instruct&quot;<br>quant_config = BitsAndBytesConfig(<br>    load_in_4bit=True<br>)<br><br>tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=huggingface_token)<br>model = AutoModelForCausalLM.from_pretrained(<br>    model_id,<br>    use_auth_token=huggingface_token,<br>    device_map=&quot;auto&quot;,<br>    quantization_config=quant_config<br>)</pre><h3>Step 4 : Extracting Relationships using LLAMA 3.1</h3><p>Further, anextract_relationships_with_llama function is used that utilizes <strong>Llama 3.1</strong> to identify and classify relationships between entities within a given sentence. This is achieved through a structured prompt-based approach, leveraging the capabilities of the LLM for contextual understanding and inference.</p><pre># Function to query Llama 3.1 for relationship extraction<br>def extract_relationships_with_llama(sentence, entities):<br>    # Create the prompt for the LLM<br>    prompt = (<br>        f&quot;The following text contains entities:\n\n{sentence}\n\n&quot;<br>        f&quot;Entities:\n{entities}\n\n&quot;<br>        &quot;Identify relationships between these entities and classify them into one of the following categories: &quot;<br>        &quot;Positive association, Negative association, Positive correlation, Negative correlation, Neutral.\n\n&quot;<br>        &quot;Output the relationships in a structured format: Each row should contain [Source Entity, Target Entity, Relationship].\n&quot;<br>        &quot;Relationships:&quot;<br>    )<br><br>      # Generate output<br>    with torch.inference_mode():<br>        # Tokenize and move to device<br>        inputs = tokenizer(prompt, return_tensors=&quot;pt&quot;, truncation=True).to(&quot;cuda&quot;)<br><br>        # Record how many tokens are in the prompt<br>        start_index = inputs[&quot;input_ids&quot;].shape[-1]<br><br>        # Generate<br>        outputs = model.generate(**inputs, max_new_tokens=1500)<br>        new_tokens = outputs[0][start_index:]<br><br>        # Decode new tokens only (exclude prompt)<br>        result_text = tokenizer.decode(new_tokens, skip_special_tokens=True)<br><br>    return result_text</pre><p>Why Use Llama 3.1 for Relationship Extraction?</p><ul><li><strong>Contextual Understanding</strong>: Llama 3.1 excels at capturing nuanced relationships by understanding the context of biomedical sentences.</li><li><strong>Flexible Classification</strong>: The model can classify relationships into specific categories, making the output more actionable.</li><li><strong>Structured Outputs</strong>: By guiding the model with a well-designed prompt, the function ensures that relationships are extracted in a machine-readable format.</li></ul><h3>Step 5: Structuring the Data to create Knowledge Graphs</h3><p>To build a <strong>Knowledge Graph (KG)</strong>, it’s essential to represent data in a structured and relational format. The <strong>source entity</strong>, <strong>target entity</strong>, and <strong>relationship</strong> form the core components of a KG, enabling it to model complex interactions between entities effectively.</p><h4>Core Components of a Knowledge Graph:</h4><ol><li><strong>Source Entity</strong>: Represents the starting point of a relationship.</li></ol><ul><li>Example: <em>Stanford University</em> in the relationship “Stanford University → Conducted_By → Diabetes.”</li></ul><p><strong>2. Target Entity</strong>: Represents the endpoint or recipient of a relationship.</p><ul><li>Example: <em>Diabetes</em> in the same relationship.</li></ul><p><strong>3. Relationship</strong>: Defines the type or nature of the interaction between the source and target entities.</p><ul><li>Example: <em>Conducted_By</em>, indicating that Stanford University conducted research on diabetes.</li></ul><pre># Function to extract only the relevant table<br>def extract_relationships_table(raw_output_list):<br>    relationships = []<br><br>    for entry in raw_output_list:  # Iterate through each entry in the list<br>        # Ensure entry is a dictionary and contains the &quot;relationships&quot; key<br>        if isinstance(entry, dict) and &quot;relationships&quot; in entry:<br>            relationships_text = entry[&quot;relationships&quot;]<br><br>            # Split the text into rows<br>            rows = [row.strip() for row in relationships_text.split(&quot;\n&quot;) if row.strip() and row.startswith(&quot;|&quot;)]<br><br>            # Parse rows into a structured format<br>            for row in rows:<br>                columns = [col.strip() for col in row.split(&quot;|&quot;)[1:-1]]<br>                if len(columns) == 3:<br>                    relationships.append({<br>                        &quot;Source Entity&quot;: columns[0],<br>                        &quot;Target Entity&quot;: columns[1],<br>                        &quot;Relationship&quot;: columns[2]<br>                    })<br><br>    # Convert to DataFrame<br>    return pd.DataFrame(relationships)<br><br># Extract the relationships table<br>relationships_df = extract_relationships_table(raw_output)<br><br># Display the cleaned DataFrame<br>print(relationships_df)</pre><h3>Step 6: Knowledge Graph Construction</h3><p>This code below shows how to construct and visualize a <strong>Knowledge Graph (KG)</strong> using NetworkX, where entities are represented as <strong>nodes</strong> and their relationships as <strong>directed edges</strong>. A subset of the dataset containing entity relationships is used to create the graph, with attributes like relationship type displayed as edge labels. This visualization provides an intuitive way to explore connections between entities, such as drugs and diseases, and uncover patterns in biomedical data. While NetworkX is suitable for small-scale visualization, for larger, scalable applications, <strong>Neo4j</strong> can be used to build and query the Knowledge Graph. Neo4j is a graph database optimized for handling complex relationships and allows users to store, query, and traverse large graphs efficiently using Cypher queries. This makes it an ideal choice for dynamic, real-world Knowledge Graph applications like healthcare recommender systems or drug discovery pipelines.</p><pre>import networkx as nx<br>import matplotlib.pyplot as plt<br># Load the processed dataset (replace &#39;processed_data_with_abstracts.csv&#39; with your actual file path)<br>df = pd.read_csv(&quot;entity_relationships.csv&quot;)<br><br># Select the first 10 rows of the DataFrame for the knowledge graph<br>plot_data = df.head(10)<br><br># Create a directed graph<br>G = nx.DiGraph()<br><br># Add edges with relationships from the DataFrame<br>for index, row in plot_data.iterrows():<br>    source = row[&#39;Source Entity&#39;]<br>    target = row[&#39;Target Entity&#39;]<br>    relationship = row[&#39;Relationship&#39;]<br>    G.add_edge(source, target, label=relationship)<br><br># Draw the graph<br>plt.figure(figsize=(12, 8))<br>pos = nx.spring_layout(G)<br><br># Draw nodes and edges<br>nx.draw(G, pos, with_labels=True, node_color=&#39;lightblue&#39;, node_size=3000, font_size=10, font_weight=&#39;bold&#39;, arrowsize=20)<br><br># Draw edge labels<br>edge_labels = nx.get_edge_attributes(G, &#39;label&#39;)<br>nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_color=&#39;red&#39;, font_size=8)<br><br>plt.title(&quot;Knowledge Graph for First 10 Source Entities&quot;)<br>plt.show()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ciiFi50qd_YxbgzruiOX7w.png" /></figure><h3>Conclusion and Key Takeaways</h3><p>This final part of the blog series demonstrated the process of building and visualizing a <strong>Knowledge Graph (KG)</strong> to represent and explore relationships between biomedical entities. Starting from tokenized text and identified entities, relationships were extracted using <a href="https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct"><strong>Llama 3.1</strong></a>, categorized into meaningful classifications, and structured into a graph format. The graph was then visualized using NetworkX to provide a clear and interpretable representation of entity connections, laying the foundation for real-world applications.</p><h4>Key Takeaways:</h4><ol><li><strong>Power of Knowledge Graphs</strong>:</li></ol><ul><li>KGs offer an interpretable and explainable way to structure and visualize relationships between entities. This is particularly useful in domains like biomedical research, where understanding the connections between diseases, treatments, and organizations is critical.</li></ul><p><strong>2. Leveraging AI for Efficiency</strong>:</p><ul><li>The combination of tools like <strong>BERN2</strong> for Named Entity Recognition and <strong>Llama 3.1</strong> for relationship extraction shows how advanced AI models can transform unstructured biomedical text into actionable insights.</li></ul><p><strong>3. Scalability and Flexibility</strong>:</p><ul><li>While NetworkX is effective for small-scale visualization, tools like <strong>Neo4j</strong> are better suited for building scalable Knowledge Graphs that can handle dynamic data and support complex queries.</li></ul><p><strong>Real-World Applications</strong>:</p><ol><li>The Knowledge Graph pipeline demonstrated here can be extended to power <strong>healthcare chatbots</strong>, <strong>recommender systems</strong>, and <strong>decision-support tools</strong>, enhancing explainability and transparency in AI-driven healthcare.</li><li><strong>Addressing Challenges</strong>:</li></ol><ul><li>By integrating structured KGs with transformer models, this approach overcomes common limitations of language models, such as hallucinations, and provides a reliable backend for applications requiring evidence-based insights.</li></ul><h3>Reflections and Future Directions</h3><p>This blog series has walked through the entire pipeline — from data extraction and scoring to entity recognition, relationship extraction, and Knowledge Graph construction. While these techniques offer valuable insights and practical applications, it’s important to acknowledge their limitations. The scoring methodology, while effective, is constrained by the dataset size and the absence of features like citation counts and study type data. Similarly, open-source models like <strong>Llama 3.1</strong>, though powerful, have their imperfections, and larger, more sophisticated LLMs could yield even better outputs for entity and relationship extraction.</p><p>Despite these challenges, this project showcases how advanced AI systems can be designed to be not only intelligent but also interpretable and impactful. By addressing these limitations in future iterations, the approach can inspire more robust and innovative applications in biomedical research and beyond.</p><p>Through these techniques, I aim to inspire innovative applications in biomedical research and beyond, fostering the development of advanced AI systems that are not only intelligent but also transparent, interpretable, and impactful.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b1cd0cb382b6" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[PubMed Data Part 3: Mathematical Modelling]]></title>
            <link>https://medium.com/@fhirshotlearning/pubmed-data-part-3-mathematical-modelling-e3c698a1e5ed?source=rss-7e548aa5925b------2</link>
            <guid isPermaLink="false">https://medium.com/p/e3c698a1e5ed</guid>
            <dc:creator><![CDATA[FHIR Shot Learning]]></dc:creator>
            <pubDate>Mon, 06 Jan 2025 12:34:46 GMT</pubDate>
            <atom:updated>2025-01-09T16:13:43.685Z</atom:updated>
            <content:encoded><![CDATA[<p>In the <a href="https://medium.com/@fhirshotlearning/pubmed-data-part-2-data-visualisation-1b403a800875">part 2</a>, we explored the dataset through extensive data analysis and visualisation. The focus was on understanding the structure of the data, identifying correlations, and uncovering trends. Through a mix of univariate, bivariate, and multivariate analyses, the challenges were highlighted that were mainly posed by limited access to high-impact-factor journals and the implications of open-access publishing models. This groundwork laid a solid foundation for diving into the next phase of the project.</p><h3>Mathematical Modeling and Scoring Research Papers</h3><p>In this section, the focus is to develop a mathematical model to assign scores to research papers based on numerical features, such as the 5-Year Impact Factor of journals and the Research Score of contributing universities. To achieve this, clustering algorithms were employed to group the papers into distinct categories, enabling pattern identification and scoring within these clusters.</p><p>While a supervised learning approach would be ideal for such a task if the dataset were labeled, the absence of labels in this case necessitated the use of unsupervised learning techniques. Clustering allowed us to structure the dataset meaningfully and derive insights without requiring pre-defined outcomes, making it a powerful alternative for this challenge.</p><p>Several clustering algorithms, including <strong>K-Means</strong>, <strong>DBSCAN</strong>, and <strong>Gaussian Mixture Models</strong>, were explored. Among these, the <strong>Gaussian Mixture Model (GMM)</strong> proved to be the best fit, as it effectively captured the underlying structure of the dataset and produced well-defined clusters, making it ideal for scoring research papers.</p><h3>Why Gaussian Mixture Model (GMM)?</h3><p>When faced with complex data, where clusters might overlap or take on irregular shapes, <strong>Gaussian Mixture Models (GMM)</strong> can be particularly powerful. At a high level, a GMM views the dataset as arising from a combination — or <em>mixture</em> — of different Gaussian (normal) distributions. Here’s how that works in practice, <strong>without</strong> diving into the underlying equations:</p><ol><li><strong>Multiple “Centers of Gravity”</strong><br>Unlike simple clustering methods that assume all data points in a group are positioned around a single center, GMM envisions multiple regions of density. Think of each region as having its own “center of gravity,” but one that can stretch or skew in various directions. This allows GMM to capture clusters that aren’t strictly spherical.</li><li><strong>Probability of Belonging</strong><br>A core idea in GMM is that every data point receives a <em>probability</em> of belonging to each cluster, rather than being assigned to one cluster in a hard, all-or-nothing way. For instance, a research paper could be 80% likely to belong to Cluster A and 20% likely to belong to Cluster B, reflecting real-world uncertainties in which category it fits best.</li><li><strong>Flexible Cluster Shapes</strong><br>One of the limitations of methods like K-Means is the assumption that clusters are roughly circular (or spherical in higher dimensions). GMM sidesteps this by allowing clusters to take on elliptical or elongated shapes, accommodating the variety found in real data — like research metrics that can vary widely from paper to paper.</li><li><strong>Adaptability</strong><br>Because each Gaussian can have different parameters, GMM is adept at modeling data where variance and covariance differ across clusters. In other words, if one group of data points forms a tight ball and another forms a broader, more spread-out region, GMM can still handle both. This adaptability helps it detect underlying structures that a single fixed shape would overlook.</li><li><strong>Intuitive Interpretation</strong><br>From a <em>conceptual standpoint</em>, once GMM finds these mixtures, each “Gaussian” can be viewed as a different “theme” or “category” in your dataset. You can then look at how each paper (or data point) distributes its probabilities across these themes, giving you a nuanced understanding of how a paper aligns with each cluster’s characteristics.</li></ol><p>By leveraging these features, GMM captures the <em>gray areas</em> between categories better than many other clustering methods. This nuance proves especially helpful in fields like research evaluation, where a paper may share similarities with multiple groups and shouldn’t be forced into a single label.</p><h3><strong>Step 1: Normalizing the dataset and performing GMM clustering</strong></h3><p>For clustering two key metrics were extracted — <strong>5-Year Impact Factor</strong> and <strong>Research Score.</strong></p><ol><li><strong>Prepare and Scale the Data</strong><br>StandardScaler function was used for normalizing the data to avoid any feature bias.</li><li><strong>Gaussian Mixture Model (GMM) Clustering</strong><br>Next, <strong>GaussianMixture</strong> is instantiated with a chosen number of clusters (in this case, 3). After fitting the model to the scaled data, cluster assignments — called gmm_labels— is stored in a new column of the DataFrame.</li><li><strong>Profiling the Clusters</strong><br>The original data is then grouped by these newly assigned cluster lables and then summary statistics (like the mean) of <strong>Impact Factor</strong> and <strong>Research Score </strong>is computed. This gives an <strong>at-a-glance profile</strong> of each cluster’s average values.</li><li><strong>Evaluating Performance</strong><br>To see how well GMM separated the data, <strong>silhouette score</strong> was calculated. Higher scores generally indicate better-defined clusters.The <strong>silhouette score</strong> is a metric used to evaluate the quality of clustering in a dataset. It measures how similar each data point is to its own cluster (cohesion) compared to other clusters (separation). The score ranges from <strong>-1 to 1</strong>, where:</li></ol><ul><li><strong>1</strong> indicates well-defined clusters, with data points close to their own cluster and far from others.</li><li><strong>0</strong> suggests overlapping clusters, where data points are equally close to multiple clusters.</li><li><strong>-1</strong> indicates poorly defined clusters, with data points closer to other clusters than their own.</li></ul><p>To assess the performance of the Gaussian Mixture Model (GMM), the silhouette score was calculated, with higher scores reflecting better-defined and more distinct clusters.</p><p><strong>5. Visualizing the Results</strong><br>Finally, a scatter chart was plotted of <strong>Research Score</strong> on the x-axis and <strong>Impact Factor</strong> on the y-axis. This provides an intuitive, at-a-glance view of how GMM grouped our research papers.</p><pre># Define numerical features<br>numerical_features = [&#39;Impact_Factor_5Years&#39;, &#39;Research_Score&#39;]  # Replace with your actual column names<br><br>data_subset = df[numerical_features].copy()<br># data_subset = data_subset.fillna(data_subset.mean())  # Fill missing values with mean (or other strategy)<br><br># Standardize the data<br>scaler = StandardScaler()<br>data_scaled = scaler.fit_transform(data_subset)<br><br># Step 2: Gaussian Mixture Model (GMM) Clustering<br>gmm = GaussianMixture(n_components=3, random_state=42)<br>gmm_labels = gmm.fit_predict(data_scaled)<br>df[&#39;GMM_Cluster&#39;] = gmm_labels<br><br># Step 3: Profile the Clusters<br>def profile_clusters(data, cluster_col, features):<br>    cluster_profiles = data.groupby(cluster_col)[features].mean()<br>    print(f&quot;Cluster Profiles for {cluster_col}:\n&quot;)<br>    print(cluster_profiles)<br>    return cluster_profiles<br><br># Profile GMM Clusters<br>print(&quot;Profiling GMM Clusters&quot;)<br>gmm_profiles = profile_clusters(df, &#39;GMM_Cluster&#39;, numerical_features)<br><br># Evaluate GMM Clustering<br>silhouette_gmm = silhouette_score(data_scaled, gmm_labels)<br>print(f&quot;Silhouette Score for GMM Clustering: {silhouette_gmm}&quot;)<br><br># Visualize GMM Clusters with actual features<br>plt.figure(figsize=(8, 6))<br>sns.scatterplot(<br>    x=df[&#39;Research_Score&#39;], <br>    y=df[&#39;Impact_Factor_5Years&#39;], <br>    hue=gmm_labels, <br>    palette=&#39;coolwarm&#39;, <br>    alpha=0.7<br>)<br>plt.title(&#39;Gaussian Mixture Model (GMM) Clustering&#39;)<br>plt.xlabel(&#39;Research Score&#39;)<br>plt.ylabel(&#39;Impact Factor (5 Years)&#39;)<br>plt.legend(title=&#39;Cluster&#39;)<br>plt.grid(True)<br>plt.show()</pre><p>The figure below showcases the clustering results of the Gaussian Mixture Model (GMM) based on two features: the <strong>5-Year Impact Factor</strong> and the <strong>Research Score</strong>.</p><p>Three clusters were identified:</p><ul><li><strong>Cluster 0 (blue)</strong>: Represents journals with low impact factors (mean ~3.36) and universities with lower research scores (mean ~28.81).</li><li><strong>Cluster 1 (gray)</strong>: Includes journals with slightly higher impact factors (mean ~5.52) and universities with moderate research scores (mean ~63.38).</li><li><strong>Cluster 2 (red)</strong>: Comprises journals with significantly higher impact factors (mean ~21.83) and universities with higher research scores (mean ~63.15).</li></ul><p>The silhouette score of <strong>0.387</strong> suggests moderate separation between clusters. While the clustering captures some meaningful groupings, there is room for improvement, as seen in the overlap between Cluster 0 and Cluster 1. The scatterplot visually demonstrates the spread of clusters, highlighting how Cluster 2 stands out due to its higher impact factors.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1006/1*q4UySD_jlP8bw11rY3UkKA.png" /></figure><h3><strong>Step 2 : Deriving Feature Weights</strong></h3><p>Following clustering using the Gaussian Mixture Model (GMM), the contribution of each feature (Impact Factor and Research Score) to the formation of distinct clusters was analyzed. By examining the variance within each cluster, it was observed that the Impact Factor played a more significant role (≈0.78) in separating the clusters compared to the Research Score (≈0.22). These values represent the relative importance or weights of the features in the clustering process.</p><p><strong>Creating the Scoring Equation</strong></p><p>Using the derived feature weights, a linear combination of the two features was constructed to develop a scoring model. This approach directly leverages the insights obtained from the GMM clustering, emphasising the more influential feature (Impact Factor) while still accounting for the contribution of the Research Score. The resulting model provides a structured and data-driven way to score research papers based on their most impactful attributes.</p><p>The code below calculates the relative importance (weights) of numerical features in forming clusters based on variance analysis:</p><ol><li><strong>Total Variance</strong>: The overall variance for each feature is computed.</li><li><strong>Between-Cluster Variance</strong>: Variance of feature means across clusters is calculated, representing the feature’s role in differentiating clusters.</li><li><strong>Within-Cluster Variance</strong>: Derived as the difference between total variance and between-cluster variance.</li><li><strong>Importance Ratios</strong>: The ratio of between-cluster variance to total variance is computed for each feature, indicating its contribution to cluster formation.</li><li><strong>Normalization</strong>: Ratios are normalized to sum to 1, resulting in feature weights.</li></ol><p>These weights are then used to create a scoring equation, assigning greater emphasis to features that play a larger role in separating the clusters, such as the <strong>Impact Factor</strong> and <strong>Research Score</strong>.</p><pre># Calculate total variance for each feature<br>total_variance = df[numerical_features].var()<br><br># Calculate between-cluster variance<br>cluster_means = gmm_profiles.mean(axis=0)<br>between_cluster_variance = gmm_profiles.var()<br><br># Calculate within-cluster variance<br>within_cluster_variance = total_variance - between_cluster_variance<br><br># Calculate importance ratios<br>importance_ratios = between_cluster_variance / total_variance<br><br># Normalize importance ratios to sum to 1<br>weights = importance_ratios / importance_ratios.sum()<br><br>print(&quot;Feature Weights Based on Cluster-Centric Variance Ratios:&quot;)<br>print(weights)<br><br># Develop the scoring equation<br>impact_factor_weight = weights[&#39;Impact_Factor_5Years&#39;]<br>research_score_weight = weights[&#39;Research_Score&#39;]<br>#rank_median_weight = weights[&#39;Rank_Median&#39;]<br><br>print(&quot;\nScoring Equation:&quot;)<br>print(f&quot;Score = {impact_factor_weight:.4f} * Impact_Factor_5Years + &quot;<br>      f&quot;{research_score_weight:.4f} * Research_Score&quot;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/692/1*VZV3WsgBUyFWfFnxmPexsA.png" /></figure><h3>Conclusion</h3><p>The derived feature weights were used to generate scores for each article, enabling the categorization of articles into three tiers: <strong>High</strong>, <strong>Medium</strong>, and <strong>Low</strong>. This scoring approach provides a structured way to rank articles based on their impact factor and research score. Additionally, the top 10 articles, based on their scores, were identified, and their abstracts were saved in a .txt file for further analysis.</p><h3>Scoring Equation : Future Improvement</h3><p>While the current scoring equation effectively utilizes <strong>Impact Factor</strong> and <strong>Research Score</strong> to rank articles, it has room for improvement with the inclusion of additional features and a larger dataset.</p><ol><li><strong>Incorporating Study Type Data</strong>: If data on study types were available, such as randomized controlled trials or meta-analyses, it could serve as an additional feature to refine the scoring. Certain study types often hold more credibility and relevance in academic and clinical research, making them valuable for ranking.</li></ol><p><strong>2. Adding Citation Count</strong>: Citation count is a strong indicator of an article’s influence and relevance within the research community. Including this metric could provide a more comprehensive assessment of an article’s impact.</p><p><strong>3. Increasing the Sample Size</strong>: A larger dataset would enable more robust clustering and reduce the noise caused by outliers, improving the accuracy of the derived weights and the scoring model. With a broader representation of articles, the model could better capture patterns and trends across a more diverse range of journals and research outputs.</p><p>By incorporating these additional features and expanding the dataset, the scoring equation could become more precise and reflective of the true impact and quality of research articles.</p><h3>Next Steps</h3><p>In the <a href="https://medium.com/@fhirshotlearning/pubmed-data-part-4-building-knowledge-graphs-b1cd0cb382b6">upcoming blog</a>, the focus will shift to utilizing <strong>Large Language Models (LLMs)</strong> for advanced tasks such as extracting entities from the article abstracts and identifying accurate relationships between them. This step will further enhance the ability to build meaningful knowledge graphs and derive actionable insights from the dataset.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=e3c698a1e5ed" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[PubMed Data Part 2 : Data Visualisation]]></title>
            <link>https://medium.com/@fhirshotlearning/pubmed-data-part-2-data-visualisation-1b403a800875?source=rss-7e548aa5925b------2</link>
            <guid isPermaLink="false">https://medium.com/p/1b403a800875</guid>
            <dc:creator><![CDATA[FHIR Shot Learning]]></dc:creator>
            <pubDate>Mon, 06 Jan 2025 11:23:59 GMT</pubDate>
            <atom:updated>2025-01-09T16:11:49.825Z</atom:updated>
            <content:encoded><![CDATA[<h3>PubMed Data Part 2 : Data Visualisation</h3><blockquote>“No data is clean, but most is useful.” ~ Dean Abbott, Co-founder and Chief Data Scientist at SmarterHQ</blockquote><p>A thorough understanding of the data is essential before developing any model. Examining it from multiple angles yields greater clarity, revealing patterns and trends that might otherwise remain hidden. To achieve this one must look at data individually and then pair it up and finally as a whole. This is achieved using Data Analysis and Visualisation which will be explored in this section. By perusing through different graphs, we’ll try to unlock all the secrets of our mysterious data and understand its inherent limitations.</p><h3>Step 1: Data Cleaning</h3><p>Before visualizing, it’s essential to remove extraneous data, such as NaN values in numerical columns, and drop rows with missing information in these crucial fields. These features play a vital role in the next phase of the project — which involves scoring the articles.</p><pre>columns = [&quot;Standardized_University&quot;, &quot;University&quot;, &quot;Rank&quot;, &quot;Research_Score&quot;]<br><br># Drop rows with &quot;unknown&quot; or &quot;nan&quot; in object columns and NaN in numeric columns<br>for col in columns:<br>    if df[col].dtype == &#39;object&#39;:  # If the column is a string/object type<br>        df[col] = df[col].astype(str).str.strip().str.lower()<br>        df = df[~df[col].isin([&quot;unknown&quot;, &quot;nan&quot;])]<br>    else:  # If the column is numeric<br>        df = df[df[col].notna()]  # Keep rows where the value is not NaN<br><br># Reset index<br>df = df.reset_index(drop=True)<br><br>print(df)</pre><p>After dropping the incomplete rows, there were some entries in the University Rank column were presented as a range. To standardize these values, the median was calculated of each range and assigned it to the respective universities.</p><pre>def get_median_rank(rank_str: str):<br>    &quot;&quot;&quot;<br>    If rank_str is a range like &#39;501–600&#39; or &#39;501-600&#39;,<br>    return the midpoint. If it&#39;s a single number (e.g. &#39;10&#39;),<br>    return that number as float.<br>    &quot;&quot;&quot;<br>    # Normalize dash variations: 501–600 -&gt; 501-600<br>    rank_str = rank_str.replace(&#39;–&#39;, &#39;-&#39;).strip()<br>    <br>    # Find all digits (e.g., &#39;501&#39;, &#39;600&#39;)<br>    numbers = re.findall(r&#39;\d+&#39;, rank_str)<br>    if not numbers:<br>        return None  # or np.nan, depending on preference<br>    <br>    if len(numbers) == 2:<br>        lower, upper = map(int, numbers)<br>        return (lower + upper) / 2.0<br>    else:<br>        # If there&#39;s just one number<br>        return float(numbers[0])</pre><p>For the sake of simplicity only the articles written in English language have been retrieved.</p><pre># Check unique values in the &quot;Languages&quot; column<br>unique_languages = df[&#39;Language&#39;].unique()<br>print(&quot;Unique languages:&quot;, unique_languages)<br><br>df = df[~df[&#39;Language&#39;].str.strip().str.lower().eq(&#39;por&#39;)]</pre><h3>Step 2 : Statistical Summary</h3><p>To generate a statistictal summary , <strong>basic_statistics</strong> function is created. It calculates descriptive statistics for both numeric and categorical columns in a DataFrame. It also computes the <strong>number of missing values</strong> and <strong>unique values</strong> for each column, returning two structured outputs: one for <strong>numeric columns</strong> and another for <strong>categorical columns</strong>.</p><pre>def basic_statistics(df):<br>    &quot;&quot;&quot;<br>    Calculate basic statistics for numeric and categorical columns.<br>    &quot;&quot;&quot;<br>    numeric_stats = df.describe(include=[float, int]).T  # Numeric columns<br>    categorical_stats = df.describe(include=[object]).T  # Categorical columns<br><br>   <br>    numeric_stats[&#39;missing_values&#39;] = df.isnull().sum()<br>    numeric_stats[&#39;unique_values&#39;] = df.nunique()<br><br>   <br>    categorical_stats[&#39;missing_values&#39;] = df.isnull().sum()<br>    categorical_stats[&#39;unique_values&#39;] = df.nunique()<br><br>    return numeric_stats, categorical_stats<br><br><br>numeric_stats, categorical_stats = basic_statistics(df)<br><br>print(&quot;Numeric Column Statistics:\n&quot;, numeric_stats)<br>print(&quot;\nCategorical Column Statistics:\n&quot;, categorical_stats)</pre><p>This is the output of the summary:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Yvxl-4w5BKD1T5FaPn6RpQ.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/998/1*cyyAMiAlAsGQ4CvjHxg13g.png" /></figure><p>As can be observed after removing missing rows based on the column list created above, it can be observed that from 1k articles, it has reduced to approximately 500 articles.</p><h3>Step 3 : Univariate Analysis</h3><p>Next we look at the frequency distribution of the date</p><p>As shown in the image below the frequency distributions of key numeric attributes provides some interesting insights:</p><ul><li><strong>Impact Factor for 5 Years:</strong> The distribution is heavily skewed towards lower values. This skewness is primarily because PubMed is an open-source medical repository, and high-impact-factor journals are often published by expensive, subscription-based publishers, limiting their availability in open databases.</li><li><strong>University Rank:</strong> The dataset primarily includes articles from top-ranked universities (ranked below 500). While this ensures access to research from prestigious institutions, it also suggests that these articles were more likely published in lower-impact-factor journals, making them freely accessible on PubMed.</li><li><strong>Research Score of Universities:</strong> Logically, one would expect the research score to have an inverse relationship with university rank. However, the figure below does not seem to reflect this expected trend.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1000/1*QdGC8wmDqZ0GauiOS6HJXQ.png" /></figure><p>Next we have look at the top 20 journals by frequency distribution</p><p>As illustrated in the figure below, the majority of journals in this dataset — such as <em>Scientific Reports</em>, <em>PLOS ONE</em>, and <em>Frontiers in Immunology</em> — embrace an open-access publishing model. This highlights a significant advantage: the research is widely accessible to the public without subscription barriers.</p><p>However, the dataset has a notable limitation: it excludes top-tier, high-impact journals like <em>NEJM or Elsevier</em>, which remain behind paywalls. These prestigious publications, often subscription-based, represent critical sources of cutting-edge research that are absent here. To build a truly comprehensive and world-class knowledge graph, access to these articles is indispensable.</p><p>It’s time for leading journals to reconsider their publishing models, embracing greater accessibility. Doing so would not only foster inclusivity but also catalyze global scientific collaboration, advancing research for the benefit of all.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1000/1*TmG2Xmxag_CpeimCXeypMA.png" /></figure><p>The figure below showcases the frequency of articles published by various universities. It’s important to note that this does not suggest that top universities publish fewer articles. Instead, it reflects a tendency for these institutions to prioritize publishing in high-impact-factor journals. Unfortunately, such journals are often gated behind subscription barriers, making their research less accessible to the broader public.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1000/1*NpLzMzdPlWGqjYscleKD-g.png" /></figure><h3>Step 4: Bivariate Analysis</h3><p>Now, let’s analyze the trends between two features to verify the accuracy of the data collation.</p><p>First, we’ll examine the top ten journals based on their impact factor. As shown below, a few top-tier articles are indeed available on PubMed although as shown above the frequency of such articles maybe low.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1000/1*5ukkAj_HQa_3rWZD6eckRw.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1000/1*CPunjBZq-LOO9-_iUMzhUg.png" /></figure><p>The chart highlights the top 15 universities based on the impact factor of their respective publications. Cornell University leads with a significantly higher impact factor of 50.5, followed by Stockholm University at 36.1.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1000/1*WUZF-mvwh1X067cU6dx6_g.png" /></figure><h3>Step 5 : Correlation Matrix</h3><p>The heatmap above illustrates the correlations between the numerical columns in the dataset. While one might intuitively anticipate a correlation between impact factor and research score — both being indicators of research quality and influence — no significant relationship was observed, likely due to the limited sample size. The most notable correlation is between research score and university rank, which shows a strong negative correlation (-0.81), aligning with expectations that higher-ranked universities tend to have better research scores.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/800/1*JEcOc2OYQ-89MG1y8Ew_kg.png" /></figure><h3>Step 6: Multivariate Analysis</h3><p>Finally to have a look at the features as a whole, multivariate analysis was performed.</p><h3>Relationship Between Impact Factor and Research Score</h3><p>The scatter plot illustrates the relationship between the <strong>impact factor</strong> of journals and the <strong>research score</strong> of the universities contributing to the articles. Key observations include:</p><ol><li><strong>Weak Correlation (R² = 0.03)</strong>:</li></ol><ul><li>The R² value of 0.03 indicates a very weak correlation between the impact factor of journals and the research score of universities.</li><li>This suggests that a university’s research score does not strongly influence the impact factor of the journals where its research is published. This could be due to low sample size.</li></ul><p><strong>2. Wide Dispersion</strong>:</p><ul><li>Data points are widely scattered, reflecting the diversity in publication venues chosen by universities regardless of their research scores.</li><li>Universities with high research scores are seen publishing in both high and low-impact journals, which could be due to the open-access nature of many journals in the dataset.</li></ul><p><strong>3. Slight Upward Trend</strong>:</p><ul><li>The fitted regression line shows a marginal positive slope, implying a slight tendency for universities with higher research scores to publish in journals with higher impact factors, though this trend is not statistically significant maybe because of low sample size.</li></ul><p><strong>4. Implications</strong>:</p><ul><li>This result also aligns with earlier findings that many high-impact-factor journals are less accessible and may not prominently feature in open datasets like PubMed.</li></ul><p>This insight underscores the importance of expanding access to high-impact journals for a more balanced representation in academic research repositories.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*oHhNkKa1yySl9XWWA8X5hQ.png" /></figure><p>Some distributions are not explicitly displayed here but can be accessed through the following GitHub link: <a href="https://nbviewer.org/github/amulya-incorrigible/Biomedical-Text-Extraction/blob/main/pubmed/Notebooks/Pubmed_EDA_part2.ipynb">Pubmed EDA Part 2</a>. Certain data, such as the study type, has been excluded from this blog as a significant portion of the extracted information was labeled as “Unknown,” making its inclusion less meaningful or insightful in this context.</p><h3>Key Takeaways</h3><p><strong>Challenges with Accessibility</strong>:</p><ul><li>The dataset primarily consists of open-access articles from journals like <em>Scientific Reports</em> and <em>PLOS ONE</em>. However, the absence of high-impact-factor journals, such as <em>NEJM</em> or <em>Elsevier</em>, highlights a significant limitation in building a truly comprehensive knowledge graph.</li></ul><p><strong>Correlations and Trends</strong>:</p><ul><li>University rank and research score exhibit a strong negative correlation, as expected.</li><li>Contrary to intuition, research scores showed no significant relationship with the impact factor of journals, likely due to dataset constraints such as small sample size and the nature of open-access journals.</li><li>A weak positive trend exists between research scores and the impact factors of journals, but the diversity of publication venues dilutes this relationship.</li></ul><p><strong>The Need for Open Research</strong>: The findings underscore the need for greater accessibility to high-impact journals, which would enable a more balanced representation of academic research and facilitate global scientific collaboration.</p><h3>Next in the Series:</h3><p>In the next part of this series, we will dive into <a href="https://medium.com/p/e3c698a1e5ed/edit">PubMed Data Part 3: <strong>Mathematical Modelling</strong></a>, where the focus will shift from exploratory analysis to creating an equation to sort out the articles. Using the cleaned and refined dataset, we will develop models to score the articles and categorize them into different tiers.</p><p>Stay tuned as we take the first steps toward building a robust and impactful biomedical knowledge graph!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=1b403a800875" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[PubMed Data Part 1: Web Scraping]]></title>
            <link>https://medium.com/@fhirshotlearning/pubmed-data-part-1-web-scraping-c307bca93008?source=rss-7e548aa5925b------2</link>
            <guid isPermaLink="false">https://medium.com/p/c307bca93008</guid>
            <dc:creator><![CDATA[FHIR Shot Learning]]></dc:creator>
            <pubDate>Mon, 06 Jan 2025 06:56:30 GMT</pubDate>
            <atom:updated>2025-01-09T16:09:56.615Z</atom:updated>
            <content:encoded><![CDATA[<p>In the <a href="https://medium.com/@fhirshotlearning/harnessing-pubmed-a-deep-dive-in-medical-knowledge-extraction-powered-by-llms-4e895b4f0839">previous blog</a>, the overarching vision of transforming PubMed’s vast repository of research into actionable knowledge was introduced and discussed the challenges of navigating unstructured data and the need for intelligent systems to assist in identifying high-quality studies. This blog builds on that foundation, focusing on the first crucial step: preparing and enriching the data to enable advanced analysis and visualization.</p><h3>Laying the Groundwork: Setting the Foundation for Data Analysis</h3><p>This part of the project establishes the foundation by focusing on three key objectives:</p><ul><li><strong>Extracting relevant articles</strong> based on predefined criteria.</li><li><strong>Enriching the data</strong> with journal impact factors and university rankings.</li><li><strong>Structuring the data</strong> for advanced analysis and visualization.</li></ul><h3>Step 1: Retrieving Data — The Search Begins</h3><p>The first step in the pipeline involves retrieving relevant articles from PubMed. For this, <strong>Biopython </strong>was utilized, a Python library designed for computational biology and bioinformatics. Biopython provides programmatic access to online biological databases, such as NCBI, via the <strong>Entrez API</strong>, making it an efficient tool for fetching large datasets.</p><p>To keep things simple, the query was limited to five major diseases:</p><ul><li><strong>Diabetes</strong></li><li><strong>Cardiovascular Disease</strong></li><li><strong>Cancer</strong></li><li><strong>Alzheimer’s</strong></li><li><strong>Dementia</strong></li></ul><p>The search was further refined to include only articles published in the <strong>last five years </strong>to ensure recency and relevancy. This approach ensures that the data remains up-to-date and reflective of the latest advancements in biomedical research. The function returns a comprehensive list of <strong>PubMed IDs</strong> corresponding to these criteria, forming the foundation for the next steps in the pipeline.</p><pre>def search_pubmed(email):<br>    &quot;&quot;&quot;<br>    Args:<br>    - email (str): Email address for Entrez login.<br><br>    Returns:<br>    - list : List of PubMed IDs.<br>    &quot;&quot;&quot;<br><br>    # Define email for entrez login<br>    Entrez.email = email<br><br>    # Setup Date range for past 5 years<br>    current_year = datetime.now().year<br>    date_range = f&quot;{current_year - 5}[PDAT] : {current_year}[PDAT]&quot;<br><br>    # Create top 5 list of diseases<br>    diseases = [&quot;Diabetes&quot;, &quot;Cardiovascular disease&quot;, &quot;Cancer&quot;, &quot;Alzheimer&#39;s&quot;, &quot;Dementia&quot;]<br><br>    # Initialize list to collect all PubMed IDs<br>    pubmed_ids = []<br><br>    for disease in diseases:<br>        query = f&quot;{disease} AND {date_range}&quot;<br>        handle = Entrez.esearch(db=&#39;pubmed&#39;, term=query, retmax=1000)<br>        record = Entrez.read(handle)<br>        handle.close()<br><br>        # Append the list of IDs for the current disease to the master list<br>        pubmed_ids.extend(record[&#39;IdList&#39;])<br><br>    # Return the collected list of PubMed IDs after the loop<br>    return pubmed_ids</pre><h3>Step 2: Fetching Article Metadata</h3><p>The fetch_articles function is used to retrieve articles, accepting a list of pubmed_ids as input. Utilizing the <strong>Entrez </strong><strong>efetch</strong> functionality, the data is processed in manageable chunks to prevent overloading the API. An email address is required to use the API, as Entrez may contact users in case of server issues caused by their requests.</p><p>Recognizing the inevitability of network interruptions or incomplete reads, the function is designed to retry data fetching up to three times, ensuring reliability and minimizing data loss.</p><p>While there was an attempt to include citation counts, although it wasn’t very successful. For the proof of concept, this part has no longer been consider. However, incorporating this metric in the future could significantly enhance the analysis.</p><pre>from http.client import IncompleteRead<br><br>def fetch_articles(email, ids_list, retries=3):<br>    &quot;&quot;&quot;<br>    Fetch details for a list of PubMed IDs.<br><br>    Args:<br>    - email (str): Email address for Entrez login.<br>    - ids_list (list): List of PubMed IDs.<br><br>    Returns:<br>    - list: List of dictionaries with article details.<br>    &quot;&quot;&quot;<br>    ids = &#39;,&#39;.join(ids_list)<br>    Entrez.email = email<br>    attempt = 0<br>    while attempt &lt; retries:<br>        try:<br>            # Fetch article details<br>            handle = Entrez.efetch(db=&#39;pubmed&#39;, retmode=&#39;xml&#39;, id=ids)<br>            results = Entrez.read(handle)<br>            handle.close()<br>            <br>            # Add citation counts<br>            for paper in results[&#39;PubmedArticle&#39;]:<br>                pmid = paper[&#39;MedlineCitation&#39;][&#39;PMID&#39;]<br>                #citation_count = get_citation_count(pmid, email)<br>                #paper[&#39;CitationCount&#39;] = citation_count<br>            <br>            return results<br>        except IncompleteRead as e:<br>            print(f&quot;Incomplete read error encountered. Attempt {attempt + 1} of {retries}. Retrying...&quot;)<br>            attempt += 1<br>            if attempt == retries:<br>                print(&quot;Maximum retries reached. Raising last exception.&quot;)<br>                raise</pre><h3>Step 3 : Parsing the details of the article</h3><p>Once we have the raw metadata, the next step is to extract specific details that are crucial for our analysis. Key attributes like:</p><ul><li><strong>Title and Abstract</strong>: For textual analysis and understanding the focus of the study.</li><li><strong>Journal Name</strong>: Extracts the journal name</li><li><strong>Authors and Affiliations</strong>: To identify the Authors and the institutions they are affiliated with.</li><li><strong>Publication Date</strong>: To analyze trends over time.</li></ul><p>This process ensures the data is clean, structured, and ready for enrichment.</p><pre>def extract_article_details(paper):<br>    &quot;&quot;&quot;<br>    Extract specific details from a PubMed article, including citation count.<br><br>    Args:<br>    - paper (dict): Dictionary of article details.<br><br>    Returns:<br>    - tuple: Extracted article details, including citation count.<br>    &quot;&quot;&quot;<br><br>    title = paper.get(&#39;MedlineCitation&#39;, {}).get(&#39;Article&#39;, {}).get(&#39;ArticleTitle&#39;, &#39;No Title&#39;).lower()<br>    abstract_data = paper.get(&#39;MedlineCitation&#39;, {}).get(&#39;Article&#39;, {}).get(&#39;Abstract&#39;, {}).get(&#39;AbstractText&#39;, [&#39;No Abstract&#39;])<br>    abstract = abstract_data[0].lower() if isinstance(abstract_data, list) else abstract_data.lower()<br>    journal = paper.get(&#39;MedlineCitation&#39;, {}).get(&#39;Article&#39;, {}).get(&#39;Journal&#39;, {}).get(&#39;Title&#39;, &#39;No Journal&#39;).lower()<br>    language = paper.get(&#39;MedlineCitation&#39;, {}).get(&#39;Article&#39;, {}).get(&#39;Language&#39;, [&#39;No Language&#39;])[0]<br>    pubdate = paper.get(&#39;MedlineCitation&#39;, {}).get(&#39;Article&#39;, {}).get(&#39;Journal&#39;, {}).get(&#39;JournalIssue&#39;, {}).get(&#39;PubDate&#39;, {})<br>    year = pubdate.get(&#39;Year&#39;, &#39;No Data&#39;)<br>    month = pubdate.get(&#39;Month&#39;, &#39;No Data&#39;)<br>    authors_data = paper.get(&#39;MedlineCitation&#39;, {}).get(&#39;Article&#39;, {}).get(&#39;AuthorList&#39;, [])<br>    authors_list = []<br>    affiliations_list = []<br><br>    for author in authors_data:<br>        # Initialize variables for each author<br>        author_name = None<br>        affiliation = &#39;No Affiliation&#39;<br><br>        # Check for author name and concatenate if present<br>        if &#39;LastName&#39; in author and &#39;ForeName&#39; in author:<br>            author_name = f&quot;{author[&#39;LastName&#39;]} {author[&#39;ForeName&#39;]}&quot;<br>            authors_list.append(author_name)<br><br>            # Check if &#39;AffiliationInfo&#39; exists and is not an empty list<br>            affiliation_info = author.get(&#39;AffiliationInfo&#39;)<br>            if affiliation_info and isinstance(affiliation_info, list) and affiliation_info[0]:<br>                affiliation = affiliation_info[0].get(&#39;Affiliation&#39;, &#39;No Affiliation&#39;).lower()<br><br>        # Append affiliation to the list<br>        affiliations_list.append(affiliation)<br><br>    # Get Citation Count<br>    #citation_count = paper.get(&#39;CitationCount&#39;, &#39;No Citation Count&#39;)<br><br>    # Join the authors and affiliations into strings<br>    authors = &#39;, &#39;.join(authors_list)<br>    affiliations = &#39;, &#39;.join(affiliations_list)<br><br>    # Return the extracted information<br>    return title, abstract, journal, language, year, month, authors, affiliations</pre><h3>Step 4: Creating a Dataframe</h3><p>The create_dataframe function brings us one step closer to organising the entire data in a tabular format. In this function, we call the above functions in a streamlined pipeline.</p><p>Using the fetch_articles function, it retrieves up to 1,000 articles in a single execution. The extract_article_details function is then applied to extract key features from each article, such as the title, abstract, authors, and affiliations. Once all relevant information has been processed, it is compiled into a structured <strong>DataFrame</strong>, consolidating the extracted metadata into an easily analyzable format.</p><pre>def create_dataframe(email, ids_list, chunk_size=1000):<br>    &quot;&quot;&quot;<br>    Create a DataFrame containing details of PubMed articles, including citation count.<br><br>    This function fetches articles from PubMed in chunks and extracts relevant details<br>    such as title, abstract, journal, etc., to populate a DataFrame.<br><br>    Args:<br>    - email (str): Email address for Entrez login.<br>    - ids_list (list of str): List of PubMed IDs to fetch.<br>    - chunk_size (int, optional): The number of articles to fetch in each request. Default is 1000.<br><br>    Returns:<br>    - pandas.DataFrame: A DataFrame where each row represents an article and columns<br>      contain details like title, abstract, journal, language, year, month, study type,<br>      authors, affiliations, and citation count.<br>    &quot;&quot;&quot;<br>    pubmed_df = {<br>        &#39;Title&#39;: [], &#39;Abstract&#39;: [], &#39;Journal&#39;: [], &#39;Language&#39;: [], &#39;Year&#39;: [], &#39;Month&#39;: [],<br>         &#39;Authors&#39;: [], &#39;Affiliations&#39;: []<br>    }<br><br>    for chunk_i in range(0, len(ids_list), chunk_size):<br>        chunk = ids_list[chunk_i:chunk_i + chunk_size]<br>        papers = fetch_articles(email, chunk)<br><br>        if papers is None or &#39;PubmedArticle&#39; not in papers:<br>            print(f&quot;Warning: No data returned for chunk starting at index {chunk_i}&quot;)<br>            continue<br><br>        for paper in papers[&quot;PubmedArticle&quot;]:<br>            # Extract article details from the paper<br>            title, abstract, journal, language, year, month, authors, affiliations = extract_article_details(paper)<br><br>            # Append the details to the respective lists in the dictionary<br>            pubmed_df[&#39;Title&#39;].append(title)<br>            pubmed_df[&#39;Abstract&#39;].append(abstract)<br>            pubmed_df[&#39;Journal&#39;].append(journal)<br>            pubmed_df[&#39;Language&#39;].append(language)<br>            pubmed_df[&#39;Year&#39;].append(year)<br>            pubmed_df[&#39;Month&#39;].append(month)<br>            pubmed_df[&#39;Authors&#39;].append(authors)<br>            pubmed_df[&#39;Affiliations&#39;].append(affiliations)<br><br>    # Convert the dictionary to a pandas DataFrame<br>    pubmed_df = pd.DataFrame(pubmed_df)<br><br>    return pubmed_df</pre><h3>Step 5: Merging the Impact Factors</h3><p>The next step in the pipeline was to determine the <strong>Impact Factor</strong> of the journals associated with the retrieved articles. The <strong>Impact Factor</strong> is a crucial metric that measures the average number of citations a journal receives, with higher values signifying greater influence in the scientific community. Since PubMed does not directly provide this information, alternative methods were explored, including using APIs. However, many existing Python libraries for Impact Factor data retrieval are no longer functional. For this project, a <strong>CSV file</strong> containing Impact Factor data was sourced from the <strong>Journal Citation Reports</strong> website.</p><p>The merge_impact_factors function plays a key role in this step by merging the extracted PubMed data with the Impact Factor dataset. It matches the two DataFrames based on journal names or unique identifiers like <strong>ISSN/EISSN</strong>, ensuring a reliable integration of the Impact Factor into the pipeline. This step enriches the dataset, making it more robust for analysis and scoring methodologies.</p><pre>def merge_impact_factors(pubmed_df, impact_factor_csv_path, journal_col=&#39;Journal&#39;):<br>    &quot;&quot;&quot;<br>    Merge impact factors into the PubMed articles DataFrame, retain articles with impact factors,<br>    and drop columns that only contain NaN values.<br><br>    Args:<br>    - pubmed_df (DataFrame): DataFrame containing PubMed articles.<br>    - impact_factor_csv_path (str): Path to the CSV file with impact factors.<br>    - journal_col (str): Column name for journal titles in the PubMed DataFrame.<br><br>    Returns:<br>    - DataFrame: The merged DataFrame with impact factors and without NaN-only columns.<br>    &quot;&quot;&quot;<br><br>    # Load the impact factor CSV file<br>    impact_factors_df = pd.read_csv(impact_factor_csv_path)<br><br>    # Format the journal titles consistently (strip whitespaces and convert to lowercase)<br>    pubmed_df[journal_col] = pubmed_df[journal_col].str.strip().str.lower()<br>    impact_factors_df[&#39;Name&#39;] = impact_factors_df[&#39;Name&#39;].str.strip().str.lower()<br>    impact_factors_df[&#39;Abbr Name&#39;] = impact_factors_df[&#39;Abbr Name&#39;].str.strip().str.lower()<br><br>    # Attempt to merge based on multiple keys: Name, Abbreviated Name, ISSN, and EISSN<br>    merged_df = pubmed_df.merge(<br>        impact_factors_df,<br>        how=&#39;left&#39;,<br>        left_on=journal_col,<br>        right_on=&#39;Name&#39;<br>    )<br><br>    # Attempt merging with additional identifiers if no matches are found<br>    if merged_df[&#39;JIF&#39;].isna().all():<br>        merged_df = pubmed_df.merge(<br>            impact_factors_df,<br>            how=&#39;left&#39;,<br>            left_on=journal_col,<br>            right_on=&#39;Abbr Name&#39;<br>        )<br>    elif merged_df[&#39;JIF&#39;].isna().all() and &#39;ISSN&#39; in pubmed_df.columns:<br>        merged_df = pubmed_df.merge(<br>            impact_factors_df,<br>            how=&#39;left&#39;,<br>            left_on=&#39;ISSN&#39;,<br>            right_on=&#39;ISSN&#39;<br>        )<br>    elif merged_df[&#39;JIF&#39;].isna().all() and &#39;EISSN&#39; in pubmed_df.columns:<br>        merged_df = pubmed_df.merge(<br>            impact_factors_df,<br>            how=&#39;left&#39;,<br>            left_on=&#39;EISSN&#39;,<br>            right_on=&#39;EISSN&#39;<br>        )<br><br>    # Rename relevant columns for clarity<br>    merged_df.rename(columns={<br>        &#39;JIF&#39;: &#39;Impact_Factor&#39;,<br>        &#39;JIF5Years&#39;: &#39;Impact_Factor_5Years&#39;,<br>        &#39;Category&#39;: &#39;Journal_Category&#39;<br>    }, inplace=True)<br><br>    # Retain only articles with available impact factors<br>    merged_df = merged_df.dropna(subset=[&#39;Impact_Factor&#39;])<br><br>    # Drop columns that only contain NaN values<br>    merged_df = merged_df.dropna(axis=1, how=&#39;all&#39;)<br>    return merged_df</pre><h3>Step 6: Entity Recognition for extracting Universities and Study Type using GLINER</h3><p>From the extracted data, two major challenges were identified:</p><ol><li><strong>Study Type Identification</strong>: The metadata does not explicitly specify the study type. This information must be inferred from the article title, adding complexity to the data processing pipeline.</li><li><strong>Affiliation Cleanup</strong>: The affiliations column contains excessively verbose text. Extracting and isolating university names from this unstructured data requires additional processing.</li></ol><p>To address these issues, <strong>GLINER</strong>, a Named Entity Recognition (NER) transformer, was employed. GLINER is built on a BERT-like transformer architecture and offers a robust solution for entity extraction. Unlike many traditional NER tools, which are restricted to predefined entity categories, GLINER is highly adaptable and lightweight, making it an excellent choice for processing large datasets and extracting custom entities like universities and study types efficiently.</p><p>The <strong>extract_universities_gliner</strong> function extracts university names from an affiliation string by using the <strong>GLINER</strong> transformer to predict entities labeled as &quot;Organization.&quot; If no universities are found, it returns &quot;Unknown.&quot; Similarly, the <strong>extract_study_type_gliner</strong> function identifies study types from the abstract text by extracting entities labeled as &quot;Study Type.&quot; Both functions apply GLINER&#39;s entity prediction capability to handle unstructured text efficiently and populate the respective columns in the DataFrame.</p><pre>from gliner import GLiNER<br>import pandas as pd<br><br># Load the GLiNER model<br>model = GLiNER.from_pretrained(&quot;urchade/gliner_medium-v2.1&quot;)<br><br># Labels for entity prediction<br>labels_universities = [&quot;Organization&quot;]<br>labels_study_types = [&quot;Study Type&quot;]<br><br># Function to extract universities from affiliations using GLiNER<br>def extract_universities_gliner(affiliation):<br>    &quot;&quot;&quot;<br>    Extract universities from the affiliation string using GLiNER.<br><br>    Args:<br>    - affiliation (str): The affiliation string.<br><br>    Returns:<br>    - str: Extracted university names.<br>    &quot;&quot;&quot;<br>    if not isinstance(affiliation, str) or affiliation.strip() == &quot;&quot;:<br>        return &quot;Unknown&quot;<br><br>    # Perform entity prediction using GLiNER<br>    entities = model.predict_entities(affiliation, labels_universities, threshold=0.5)<br><br>    # Extract universities from the identified entities<br>    universities = [entity[&quot;text&quot;] for entity in entities if entity[&quot;label&quot;] == &quot;Organization&quot;]<br><br>    # Return universities as a comma-separated string or &#39;Unknown&#39; if none found<br>    return &quot;, &quot;.join(universities) if universities else &quot;Unknown&quot;<br><br># Function to extract study types from abstract using GLiNER<br>def extract_study_type_gliner(abstract):<br>    &quot;&quot;&quot;<br>    Extract study types from the abstract text using GLiNER.<br><br>    Args:<br>    - abstract (str): Abstract of the study.<br><br>    Returns:<br>    - str: The type of study.<br>    &quot;&quot;&quot;<br>    if not isinstance(abstract, str) or abstract.strip() == &quot;&quot;:<br>        return &quot;Unknown&quot;<br><br>    # Perform entity prediction using GLiNER<br>    entities = model.predict_entities(abstract, labels_study_types, threshold=0.5)<br><br>    # Extract study type from the identified entities<br>    study_types = [entity[&quot;text&quot;] for entity in entities if entity[&quot;label&quot;] == &quot;Study Type&quot;]<br><br>    # Return the first matched study type or &#39;Unknown&#39; if none found<br>    return study_types[0] if study_types else &quot;Unknown&quot;<br><br># Apply the GLiNER extraction functions to the DataFrame<br>final_df[&#39;Universities&#39;] = final_df[&#39;Affiliations&#39;].apply(extract_universities_gliner)<br>final_df[&#39;Study_Type_Extracted&#39;] = final_df[&#39;Abstract&#39;].apply(extract_study_type_gliner)</pre><h3>Step 7: Standardisation of the Universities</h3><p>Inconsistent naming conventions in affiliations can pose significant challenges for analysis. University names in the extracted data often vary due to differences in formatting, case sensitivity, or the inclusion of extra descriptive text. For instance:</p><ul><li>“Stanford University, Department of Medicine” → “Stanford University”</li><li>“University of California, Los Angeles (UCLA)” → “university of california”</li></ul><p>To address these variations, standardization ensures that all names are reduced to a consistent and comparable format. The standardize_university_names function streamlines this process by:</p><ul><li><strong>Utilizing Regular Expressions (Regex)</strong>: Extracting the main university or institutional name from complex strings.</li><li><strong>Handling Ambiguities</strong>: Assigning “Unknown” to cases where a clear match cannot be identified.</li></ul><p>This step improves data consistency, enabling accurate analysis and integration with external datasets.</p><pre>import re<br>def standardize_university_names(universities_column):<br>    standardized_names = []<br>    for university in universities_column:<br>        if university.lower() == &#39;unknown&#39;:<br>            standardized_names.append(&#39;Unknown&#39;)<br>            continue<br><br>        # Extract main university name using regex<br>        match = re.search(r&#39;([a-zA-Z]+\s*(university|institute|college|academy|school))&#39;, university, re.IGNORECASE)<br>        if match:<br>            standardized_names.append(match.group(0).strip().lower())<br>        else:<br>            standardized_names.append(&#39;Unknown&#39;)<br><br>    return standardized_names<br><br>final_df[&#39;Standardized_University&#39;] = standardize_university_names(final_df[&#39;Universities&#39;])</pre><h3>Step 8 : Fetching University rankings and merging it in the dataframe</h3><p>To enhance the paper sorting methodology, it was essential to incorporate university rankings and corresponding research scores into the dataset. This was achieved using the extract_and_merge_university_ranking function, which fetches global university rankings and research scores from a specified API and seamlessly integrates this information into an existing DataFrame.</p><p>The function works by sending a GET request to the API, parsing the JSON response to extract relevant data (university names, rankings, and research scores), and structuring this information into a new DataFrame. To ensure compatibility with the existing dataset, university names are standardized by converting them to lowercase. The enriched ranking data is then merged with the original DataFrame using the standardized university names as a key.</p><p>Additionally, the function handles API request errors gracefully, cleans and formats ranking values, and ensures that the final DataFrame is comprehensive and ready for further analysis. The result is an updated dataset with new columns for university rankings and research scores, enabling a more robust and data-driven approach to ranking papers.</p><p>This step completes the data extraction pipeline, providing a comprehensive dataset ready for downstream analysis.</p><pre> <br>def extract_and_merge_university_ranking(final_df, api_url):<br>    &quot;&quot;&quot;<br>    Extracts university rankings from a given API and merges them with the existing DataFrame.<br><br>    Args:<br>        final_df (DataFrame): Existing DataFrame with a column named &#39;Standardized_University&#39;.<br>        api_url (str): URL to the API that provides university rankings.<br><br>    Returns:<br>        DataFrame: Updated DataFrame containing &#39;Rank&#39; and &#39;Research_Score&#39; columns.<br>    &quot;&quot;&quot;<br>    headers = {<br>        &quot;User-Agent&quot;: &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3&quot;<br>    }<br><br>    try:<br>        # Sending GET request to the URL<br>        response = requests.get(api_url, headers=headers)<br>        response.raise_for_status()  # Raise an error for bad responses<br>        data = response.json()<br><br>        # Extracting relevant data from the API response<br>        university_names = []<br>        ranks = []<br>        research_scores = []<br><br>        for university in data.get(&#39;data&#39;, []):<br>            uni_name = university.get(&#39;name&#39;)<br>            rank = university.get(&#39;rank&#39;)<br>            research_score = university.get(&#39;scores_research&#39;)<br><br>            university_names.append(uni_name.lower())  # Convert to lowercase for standardization<br>            ranks.append(rank)<br>            research_scores.append(research_score)<br><br>        # Creating DataFrame from extracted data<br>        ranking_df = pd.DataFrame({<br>            &#39;University&#39;: university_names,<br>            &#39;Rank&#39;: ranks,<br>            &#39;Research_Score&#39;: research_scores<br>        })<br><br>        # Cleaning up rank values to remove symbols like &#39;=&#39; and converting to int<br>        ranking_df[&#39;Rank&#39;] = ranking_df[&#39;Rank&#39;].replace(&#39;=&#39;, &#39;&#39;, regex=True).astype(str)<br><br>        # Standardizing &#39;Standardized_University&#39; column in final_df to lowercase for matching<br>        final_df[&#39;Standardized_University&#39;] = final_df[&#39;Standardized_University&#39;].str.lower()<br><br>        # Merging rankings with the original DataFrame<br>        final_df = final_df.merge(ranking_df, left_on=&#39;Standardized_University&#39;, right_on=&#39;University&#39;, how=&#39;left&#39;, suffixes=(&#39;&#39;, &#39;_Ranking&#39;))<br>        return final_df<br><br>    except requests.exceptions.RequestException as e:<br>        print(f&quot;An error occurred while fetching university rankings: {e}&quot;)<br>        return final_df<br><br>api_url = &quot;https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2024_0__91239a4509dc50911f1949984e3fb8c5.json&quot;<br><br><br># Call the method and update final_df<br>pubmed_final = extract_and_merge_university_ranking(final_df, api_url)<br>print(pubmed_final.head())</pre><h3>The Result: A Comprehensive Dataset</h3><p>With all necessary fields consolidated into a single dataset, the foundation for advanced analysis is now complete. This enriched and structured data is primed for deeper exploration and meaningful insights.</p><h3>What’s Next?</h3><p>In the next blog, we’ll focus on <a href="https://medium.com/@fhirshotlearning/pubmed-data-part-2-data-visualisation-1b403a800875"><strong>data visualization</strong> </a>to uncover patterns, trends, and feature correlations within the dataset. These insights will play a crucial role in developing a robust scoring methodology for ranking articles and will set the stage for constructing sophisticated Knowledge Graphs in subsequent steps.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c307bca93008" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Harnessing PubMed: A deep dive in medical knowledge extraction powered by LLMs]]></title>
            <link>https://medium.com/@fhirshotlearning/harnessing-pubmed-a-deep-dive-in-medical-knowledge-extraction-powered-by-llms-4e895b4f0839?source=rss-7e548aa5925b------2</link>
            <guid isPermaLink="false">https://medium.com/p/4e895b4f0839</guid>
            <category><![CDATA[large-language-models]]></category>
            <category><![CDATA[proof-of-concept]]></category>
            <category><![CDATA[pubmed]]></category>
            <dc:creator><![CDATA[FHIR Shot Learning]]></dc:creator>
            <pubDate>Wed, 01 Jan 2025 11:20:43 GMT</pubDate>
            <atom:updated>2025-01-07T10:38:00.098Z</atom:updated>
            <content:encoded><![CDATA[<p>PubMed is an open source database of biomedical research articles and life-sciences. Its database contains more than 37 million citations and abstracts of biomedical literature.</p><p>But it is far more than a repository of medical research; it is a vast trove of insights waiting to be unlocked. With its API granting access to millions of research papers, it offers unparalleled opportunities to transform raw data into actionable knowledge.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ipmbJpY6n1oI5SXgiltmug.jpeg" /><figcaption>Source: <a href="https://unsplash.com/@hjrc33">hjrc33</a></figcaption></figure><h3>Scenario 1: A Traditional Approach to Finding Solutions</h3><p>Imagine a scenario where a patient suffers from a disease and is allergic to conventional treatment. The doctor, eager to help, decides to find an alternative by manually searching online. They eventually discover a peer-reviewed article in a respected journal and decide to try the suggested treatment. However, the study in question is a single-blind trial with limited supporting data. Tragically, the treatment fails, leaving the patient to suffer further.</p><p>While the doctor’s decision to consult recent research is commendable, it’s not without risks. The <strong>quality of the article</strong>, the <strong>credibility of the journal</strong>, the <strong>type of study</strong>, and even the <strong>author’s affiliations</strong> all play critical roles in determining whether the research is trustworthy. Unfortunately, in the current landscape, many rely solely on a journal’s <strong>Impact Factor</strong> as a metric for quality. But is that enough?</p><p>Consider this: a <strong>double-blind study</strong> is inherently more reliable than a single-blind one because it eliminates researcher bias. Similarly, an article authored by researchers affiliated with highly ranked institutions or one with substantial citation counts might be more credible. So why haven’t we developed a comprehensive scoring methodology to evaluate papers holistically? Why are we fixated solely on the Impact Factor of journals when more nuanced metrics could significantly enhance the credibility of research used for critical decisions?</p><h3>Scenario 2: Relying on AI Without Caution</h3><p>Now, imagine another scenario; With the rising popularity of ChatGPT, the doctor decides to forgo traditional search methods and instead queries the AI for a solution. ChatGPT, celebrated for passing medical exams and its widespread use in medical assistance, confidently suggests a treatment. However, the treatment only worsens the patient’s existing symptoms. Further investigation reveals that ChatGPT was hallucinating — producing fabricated or inaccurate information — potentially jeopardizing the patient’s health or even their life.</p><p>This is not to diminish ChatGPT’s capabilities; it is undeniably a groundbreaking tool for tasks such as summarization, classification, and contextual language understanding. However, when it comes to factual accuracy in critical fields like healthcare, relying solely on a language model is fraught with risks. Large Language Models (LLMs) like ChatGPT are prone to hallucinations, where they generate plausible but false or unverified information. In a domain where even a single error can have life-threatening consequences, this is a risk we simply cannot afford.</p><h3>A Smarter Solution: Combining LLMs and Knowledge Graphs</h3><p>Rather than relying solely on LLMs, we can leverage their strengths to construct <strong>Knowledge Graphs (KGs)</strong> — a structured, traceable, and explainable framework. Knowledge Graphs offer several advantages:</p><ul><li>They are <strong>transparent</strong>, allowing users to trace recommendations back to the original source.</li><li>They are <strong>credible</strong>, incorporating only validated data from reliable research.</li><li>They are <strong>explainable</strong>, clearly showing the connections between entities like diseases, treatments, and study types.</li></ul><p>By integrating LLMs as tools for extracting information and combining them with Knowledge Graphs for reasoning and explainability, we can create systems that are both powerful and trustworthy. This hybrid approach ensures that doctors have access to reliable, evidence-based insights, minimizing the risks of misinformation and ultimately improving patient outcomes.</p><p>The future of healthcare lies not in replacing traditional methods or blindly trusting AI but in combining the best of both worlds — human expertise, AI innovation, and structured, explainable data.</p><p>This project aims to explore how advanced tools and methodologies can be combined to create a robust pipeline that not only processes vast amounts of data but also ensures its usability and credibility. By leveraging the strengths of both LLMs and Knowledge Graphs, we take a step closer to building intelligent systems that are explainable, evidence-based, and capable of supporting decision-making in high-stakes environments like medicine.</p><h3>Simplifying Complexity: The Project’s Two Core Parts</h3><p>While the project is divided into four parts for clarity, it fundamentally revolves around two primary components:</p><h4>Part 1: Distilling the Data</h4><p>This section establishes the foundation for the project:</p><ul><li><strong>Data Scraping</strong>: Collecting PubMed data and supplementing it with journal impact factors and university research scores.</li><li><strong>Data Visualization</strong>: Conducting exploratory data analysis (EDA) to understand data distributions and correlations between features.</li><li><strong>Mathematical Modeling</strong>: Developing a scoring methodology to rank papers based on feature importance.</li></ul><h4>Part 2: Building Knowledge Graphs</h4><p>The focus here is on leveraging LLMs to construct sophisticated knowledge graphs:</p><ul><li><strong>Named Entity Recognition (NER)</strong>: Employing biomedical transformers to identify critical entities such as institutions, researchers, and study attributes.</li><li><strong>Relationship Modeling</strong>: Using Llama 3.1 to establish connections between these entities, enabling the creation of meaningful relationships.</li><li><strong>Knowledge Graphs</strong>: Constructing and querying knowledge graphs to visualize data and derive actionable insights.</li></ul><h3>Why This Matters</h3><p>This proof-of-concept project demonstrates how artificial intelligence (AI) can transform unstructured data into structured, explainable knowledge. By automating the pipeline with AI models, the project highlights a range of impactful applications:</p><ul><li><strong>Healthbots</strong>: Automating patient inquiries with AI-driven chatbots powered by high-quality research.</li><li><strong>Recommender Systems</strong>: Guiding researchers and clinicians to relevant studies based on robust scoring and relationships.</li><li><strong>Explainable AI</strong>: Enhancing trust by providing clear, evidence-backed recommendations.</li></ul><h3>What’s Next: A Four-Part Series</h3><p>This blog marks the beginning of a series that delves deeper into the methods and findings of the project. Here’s what you can expect:</p><ul><li><a href="https://medium.com/@fhirshotlearning/pubmed-data-part-1-web-scraping-c307bca93008"><strong>PubMed Data Part 1: Web Scraping</strong></a><strong>:</strong> Exploring how PubMed data, journal impact factors, and university research scores were collated from different sources and integrated into a single dataframe.</li><li><a href="https://medium.com/@fhirshotlearning/pubmed-data-part-2-data-visualisation-1b403a800875"><strong>PubMed Data Part 2: Data Visualisation</strong></a>: Uncovering patterns and correlations through exploratory data analysis (EDA).</li><li><a href="https://medium.com/@fhirshotlearning/pubmed-data-part-3-mathematical-modelling-e3c698a1e5ed"><strong>PubMed Data Part 3: Mathematical Modelling</strong></a>: Developing a mathematical equation for sorting out the articles using unsupervised learning method.</li><li><a href="https://medium.com/@fhirshotlearning/b1cd0cb382b6"><strong>Part 4: Building Knowledge Graphs</strong>:</a> Leveraging transformers and LLMs for advanced NER and identifying various relationships and constructing Knowledge graphs of it.</li></ul><p>This four-part series aims to demonstrate how a combination of data science, machine learning, and advanced AI models can streamline complex biomedical research workflow. By starting with data collection and progressing through visualization, modeling, and Knowledge Graph construction, each part builds on the previous to showcase a holistic approach to transforming unstructured data into actionable insights. Whether you’re a data scientist, researcher, or healthcare professional, this series offers a comprehensive guide to leveraging AI for impactful applications.</p><p>If you are interested in the full project, it is available on <a href="https://github.com/amulya-prasad/Biomedical-Text-Extraction">github</a>.</p><p>Happy Reading!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=4e895b4f0839" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>