Clustering diseases based on drug class frequency

David Chun
INST414: Data Science Techniques
3 min readDec 9, 2023

In healthcare data analysis, extracting valuable insights often means sorting through vast amounts of data. Clustering is one method for doing this, grouping data points to find hidden patterns. Our goal was to discover relationships between drug classes and the diseases they treat.

Challenges in Data Acquisition and Preparation

The most challenging aspect of this analysis was constructing a meaningful dataset. Using specific use cases for medication led to a high-dimensional and sparse dataset, with an array of different diseases and conditions. To tackle this, I initially used a dictionary with categories as keys and associated keywords, but this proved to be a rough start. The categorizations weren’t entirely accurate.

To enhance the categorization, I looked into various disease classification models. Eventually, I turned to GPT-4, which offered ease of use and relatively high accuracy, later validated by a medical professional — my father, a doctor. This resulted in a coherent dataset ready for cluster analysis.

From Binary Matrix to Cluster Construction

The initial step with the data was to construct a binary matrix, with drug classes as columns and conditions as rows. My first attempt for clustering used agglomerative clustering with cosine similarity. However, without dimensionality reduction, the resulting clusters were unbalanced — a large primary cluster overshadowed several smaller ones.

Dimensionality reduction turned out to be crucial for this analysis, and I opted for Singular Value Decomposition (SVD). After calculating the optimal number of components through cumulative explained variance, I applied SVD, which reshaped the binary matrix into a form amenable to clustering.

Determining the Optimal Cluster Count

Choosing the right number of clusters without any sort of guidance is a hard task, so I used a dendrogram generated from the similarity matrix to visualize and determine the optimal cluster count. The dendrogram suggested that nine clusters would best represent the underlying patterns:

With the optimal number of clusters in hand, I proceeded with agglomerative clustering.

Unraveling the Clusters

The clusters formed painted aclear picture of drug-disease interactions. By examining the top three disease types within each cluster and their conditional probabilities.

These clusters, visualized here, showcase the conditional probabilities of each disease type within the clusters:

The cluster that made the most intuitive sense to me was cluster 5. Viral diseases tend to very infectious while Genetic and congenital conditions can affect the immune system. The genetic conditions probably make individuals more susceptible to infectious and viral diseases, leading to comorbidities.

Tools and Techniques

The analysis was facilitated by SKLearn, which provided the necessary clustering algorithms and utilities for dimensionality reduction. Data cleaning was an iterative process, where anomalies were ironed out through constant validation and refinement.

Conclusion and Reflection

The application of clustering to our healthcare dataset has practical implications for cost management in healthcare settings. By identifying clusters that indicate which drug classes can treat multiple disease types effectively, healthcare providers and insurers can make more informed decisions about which medications to include in their formularies. This could lead to the adoption of drugs with broader applications, simplifying treatment plans and potentially reducing the costs associated with managing multiple medications for patients.

A streamlined formulary, focused on versatile and effective medications, can enhance the quality of treatment while maintaining affordability. While the analysis provides a valuable overview, it’s important to note that the effectiveness of drugs can vary widely among individual patients. Future efforts should aim to incorporate a wider array of data, including patient demographics and long-term outcomes, to build upon these preliminary findings.

https://github.com/dvc0310/drug_class_clustering

--

--