Analysis of similarities between drugs

David Chun
INST414: Data Science Techniques
3 min readDec 9, 2023

Introduction

In the world of pharmacology, understanding drug interactions is not just a matter of scientific curiosity but a necessity for ensuring patient safety and effective treatment plans. My recent analysis sheds light on the often-misunderstood relationships between different drug classes, offering a new perspective on how drugs interact based on their uses.

Data Collection:

The project began with an ambitious goal: to compile a comprehensive dataset of over 8,000 drugs. The data, scraped from drugs.com, included critical information like drug uses, classes, interactions, and generic names. Initially, I used an old web scraper that was made with BeautifulSoup for this task. But I soon realized its limitations in handling large-scale data and error management. To overcome these hurdles, I switched to Scrapy, a more robust framework designed for heavy-duty web scraping. Its superior error handling and efficiency, especially when paired with proxy servers, made it an ideal choice for my needs.

Methodology: Navigating Through Complexity

The core of my analysis revolved around two metrics: the average cosine similarity between different drug class combinations and the total edge weight of these combinations. These metrics were chosen to decipher the likelihood of interactions between various drug types.

I categorized drug interactions into three levels — major, moderate, and minor — and assigned corresponding weights. This simplified yet effective categorization was crucial for our network graph construction, allowing us to visualize and analyze the interaction patterns effectively.

To enhance the accuracy of our analysis, I used an NLP library, scispacy, a specialized version of Spacy designed for scientific texts. Its entity linker functionality was particularly useful in extracting detailed information about diseases and drug uses from raw text, providing us with a more concise dataset to work with. However, the raw text does play its role later.

Insights and Surprises

One of my most intriguing findings was the relationship between CNS stimulants and Anorexiants. Despite having the highest edge weight, they exhibited a surprisingly low average cosine similarity of just 0.21. To me, this suggested that the processed uses list might not sufficiently capture the complexity of drug interactions. It became evident that including data on ingredients and side effects could offer a more complete picture, although these were not available at the time of my analysis.

I began to suspect that Scispacy may have filtered out important information that would affect these scores. So, I went on to construct two graphs, where one would use the raw text of cosine similarity and the other would se the preprocessed uses list.

Visualizing the Data:

My visualizations — two graphs comparing total edge weight with cosine similarity for drug classes — revealed a surprising insight: the choice of raw text versus a preprocessed uses list had little impact on the results. More importantly, these visualizations highlighted a minimal correlation between drug interactions and their listed uses, suggesting that regulatory agencies might need to focus more on ingredients and adverse effects.

Conclusion

This analysis, while limited by the absence of certain data, underscores the need for a more holistic approach in analyzing drug data, going beyond the metrics that makes intuitive sense. For regulatory bodies, these insights suggest that they should prioritize other dependent variables over drug uses. In the future, incorporating additional data layers such as ingredients and side effects into my model could further refine the analysis.

https://github.com/dvc0310/Assignment3

--

--