Enhancing Drug Safety: Utilizing Logistic Regression for Prescription Drug Classification

Introduction

David Chun
INST414: Data Science Techniques
11 min readDec 17, 2023

--

This project is dedicated to the analysis of Over The Counter drugs in an effort to improve public health protections. By assessing these medications for their potential risks, such as adverse drug interactions and harmful side effects, my goal is to identify any OTC drugs that may be more appropriately managed as prescription-only.

This classification model is developed with a focus on public safety and aims to prevent the misuse of drugs that could otherwise be readily available without professional oversight. The model’s aim is to ensure that medications are used safely and effectively, with the ultimate goal of minimize any health risks.

Objective: Development of a Logistic Regression Classifier Model

Overview of the Model

The core objective of this project is the development and implementation of a logistic regression classifier model. It was developed to distinguish between OTC drugs that are safe for general public use and those that present substantial risks. These risks might necessitate their reclassification as prescription-only medications, which is a decision with significant public health implications.

The primary function of the logistic regression model is to analyze various health and usage metrics related to OTC drugs. These metrics include potential interactions with other drugs and the prevalence or severity of side effects. By processing this data, the model aims to identify patterns and correlations that might indicate a higher risk profile for certain OTC drugs.

The output from this logistic regression model is designed not as a definitive verdict on drug classification but as an initial set of recommendations. These recommendations are intended to serve as a valuable tool for healthcare regulatory bodies and medical professionals. They provide a data-driven starting point for further in-depth analysis and deliberation regarding the classification of drugs.

By utilizing this model, regulatory authorities can make more informed decisions about drug classification, and can potentially shift some OTC drugs to prescription status where necessary. This approach aims to allow for public health and safety by ensuring that drugs with higher risk profiles are appropriately regulated and monitored.

Data Collection

Web Scraping: Challenges and Strategies

Web scraping, a critical component of this project, proved to be the most time-consuming aspect. After looking at my options, I chose Scrapy over BeautifulSoup for several reasons. Scrapy excelled in error handling and offered batter performance due to its asynchronous nature. Additionally, Scrapy’s built-in support for exporting data in various formats like CSV and JSON was highly beneficial. However Scrapy had a steeper learning curve compared to BeautifulSoup, which added to the complexity of the task.

The implementation process involved creating five distinct spiders, primarily to navigate through drugs.com, with one dedicated for rxlist.com. Initially, I designed four different types of items for data collection. In hindsight, this was unnecessary and stemmed from my inexperience. These spiders gathered a wide range of data, including generic names, drug classes, indications, prescription statuses, interactions, and side effects.

Overcoming Data Collection Challenges

My initial approach to collecting data on indications and side effects involved detecting links that led to conditions or symptoms pages. However, this method failed in gathering sufficient information on OTC drugs, since many lacked detailed data on side effects or indications. This posed a risk of bias in the classification model. To mitigate this, I shifted to scraping entire paragraphs, and used natural language processing to discern uses and side effects.

Integration with OpenFDA API

While the web-scraped data was extensive, it lacked information on drug ingredients. To fill this gap, I turned to the OpenFDA API, which provided the necessary ingredient data. However, the data from OpenFDA was unstructured and messy, which gave me some difficulty.

One significant issue was the imprecision of the API’s search queries. For example, a query for a COVID vaccine would sometimes return unrelated results. Additionally, many drug names included dosage forms, complicating the data retrieval process. To overcome these issues, I developed several backup queries, and that eventually led to the successful compilation of a more complete list of ingredients for the majority of the drugs in the study.

Data Preparation

In data science, the common sentiment that data preparation, including collection, cleaning, wrangling, and preprocessing, consumes the majority of project time, is not just a cliché but a reality, and this project was no exception. While the data collection phase was completed, transforming the raw data into a format suitable for analysis proved to be challenging.

The raw data included extensive paragraphs detailing indications, ingredients, and side effects. At first glance, this unstructured data seemed daunting, but I discovered SciSpacy, which is a Natural Language Processing tool specifically trained on medical literature. One of SciSpacy’s features, Named Entity Recognition, proved to be important. It could accurately categorize words into specific groups like diseases and conditions.

The process involved converting the messy paragraphs into structured lists of words. A significant task was to identify and eliminate false positives to ensure the integrity of the data. The result was a dataset that was more structured than before, but it still needed further refinement. The challenge at this stage was dealing with the vast array of different diseases, conditions, and ingredients which were all in text form.

Feature Engineering

The initial step in feature engineering involved simplifying the complex dataset. I chose to use frequency counters as the primary method for structuring the data. This approach was applied to categorize and quantify side effects and usages, and grouped them into different diseases/conditions and types of ingredients. The frequency counters served as vectors, which were then utilized as features in the logistic regression model.

As I delved deeper into the dataset, I realized the potential of incorporating the interactions list for each medication. My initial thought was to use frequency counters for this data as well. However, I soon thought of a different approach, which involved creating a network of drug interactions and using various centrality scores as features for the classification model.

To construct this network, I assigned different weights to the edges based on the severity of interactions: severe interactions were weighted as 3, moderate as 2, and minor as 1. This weighting system, although somewhat arbitrary due to the lack of clear numeric definitions for interaction severity, provided a structured means to quantify the impact of these interactions.

Using Centrality Scores in Assessing Drug Importance

Initially, my most promising idea was to use PageRank scores to discern the most significant drugs in the interaction network. This idea hinged on quantifying the popularity of each drug and using this as a basis for assigning initial importance scores. By integrating these scores into the PageRank algorithm, I thought that generating centrality scores that would reflect each drug’s importance, not just in terms of its number of interactions but also its prominence in medical usage.

This approach aimed to provide a dual perspective on drug importance. One aspect being the drug’s popularity or frequency of use in medical treatments, and the other being its centrality in terms of interactions with other drugs. I had thought that the combination of these factors would yield a more nuanced understanding of each drug’s role and significance in the context of drug usage.

However, I realized that this idea was difficult to execute due to challenges in accurately obtaining data on the popularity or widespread usage of each drug. Despite its potential for providing a deeper insight into drug interactions and importance, I had to set this method aside in favor of more feasible centrality measures.

In the end I’ve decided to use these measures:

Degree Centrality: This measure identifies drugs with a high number of interactions, indicating their influential role in the network.

Closeness Centrality: This metric provides insights into how quickly effects might propagate through the network, highlighting drugs in central positions.

Betweenness Centrality: This approach emphasizes drugs that act as crucial connectors within the network, offering a deeper understanding of interaction pathways.

Combined, these centrality measures offer a more complete view of the network’s structure. This combination illuminated the significance and roles of individual drugs within it.

The Classification Model

Contrasting starkly with the extensive data preparation phase, the actual modeling process was swift and efficient. The thorough data preparation proved to be a significant advantage and set a solid foundation for the modeling stage.

While constructing the model, I defined the target variable with a binary classification system. Drugs classified as “prescription only” and those that were “discontinued” were labeled as 1, both of which signified a higher risk category. All other drugs were assigned the 0 label, which indicated a lower risk profile. The rationale for grouping discontinued drugs with prescription-only drugs stemmed from the fact that many discontinued drugs had been deemed too hazardous for general medical use.

The final model had 498 features. The majority of these features were frequency counts of specific categorical variables related to drug characteristics, while the remaining features were the centrality measures derived from the drug interaction network.

To ensure the model’s reliability, I used k-fold cross-validation with 10 different splits. The cross-validation process yielded an average score, providing a quantitative measure of the model’s predictive power.

Model Metrics: Precision, Recall, and F1 Scores

The logistic regression model demonstrated a good performance, with precision, recall, and F1 scores all exceeding 90%. However, this high level of accuracy came with certain caveats. A notable issue was the imbalance in the dataset between prescription and non-prescription drugs. This imbalance adversely affected the performance metrics for OTC drugs, leading to lower scores across all three measures.

The logistic regression model, evaluated using 10-fold cross-validation, yielded these average scores for precision, recall, and F1 metrics
Confusion Matrix

Addressing Data Imbalance

To tackle the challenge posed by data imbalance, I explored several strategies:

Classification Threshold Adjustments: Initially, I considered modifying the threshold for classification as a potential solution. To assess if this approach was viable, I plotted graphs of various metrics against different threshold levels. However, this analysis revealed that altering the threshold would not significantly enhance the model’s performance.

Threshold vs Various metrics

Hyperparameter Tuning: Next, I turned to hyperparameter tuning, and I assumed that it might yield better results. But surprisingly, this approach had a negligible impact on the model and sometimes slightly degraded its performance.

Regularization Techniques: I experimented with L2 and L1 regularization techniques, though I was skeptical about the model suffering from overfitting, given the appropriate ratio of features to data points. As anticipated, regularization didn’t substantially improve the model.

Final Model Selection

After considering various methods to refine the model, I ultimately decided to proceed with the basic logistic regression model, setting it to run for 1000 iterations. This decision was based on the realization that despite the attempts to enhance the model through various adjustments, the original version provided the most reliable and robust results.

Discussion

Model Application and Evaluation

This project is based on the intersection of supervised learning and network analysis, and it primarily utilizes centrality measures within a drug interaction network. The utilization of three centrality scores (degree, closeness, and betweenness) turned out to be the most effective predictors of a drug’s prescription status. By nature, these measures offer a more nuanced analysis into drug interactions compared to simple frequency counts.

Supervised learning was the other pillar of this project, and was implemented through logistic regression. Despite its name, logistic regression is fundamentally a binary classification tool and its main strength lies in striking a balance between power and interpretability. This interpretability in the model’s decision-making process is vital, especially in a field where understanding and justifying the model’s recommendations is as important as the recommendations themselves.

Insights for Decision Makers

The key finding from this model for decision-makers and stakeholders is that a drug’s degree centrality is the most significant factor in determining its classification as either over-the-counter or prescription-only. A less influential, yet still notable factor is whether the drug is a viral vaccine, followed by its classification as an anti-inflammatory agent for eyes. On the other end, the features least associated with prescription status are miscellaneous analgesics, topical rubefacients, and herbal products.

Top 5 most influential features (positive correlation)
Top 5 most influential features (negative correlation)

This suggests that the extent of a drug’s interactions with other drugs as indicated by its degree centrality should be the primary consideration in determining its classification as either prescription-only or over-the-counter. It implies that stakeholders when evaluating a drug’s classification, should prioritize examining its interaction network.

Project Limitations and Ethical Considerations

Limitations

The reliance on web-scraped/free API data introduced additional hurdles, especially with respect to the completeness and reliability of the information gathered. For example, I’ve mentioned before that I’d like to use PageRank, but decided not to include it since I’ve lacked popularity data. But that wasn’t the only data that I’ve failed to obtain. I’ve tried to utilize the different possible levels of severity of different side effects sourced from the Common Terminology Criteria for Adverse Events, but unfortunately, their list of side effects didn’t exactly match up with the UMLS, which led to me scrapping the idea altogether.

Another limitation of this project comes from the classification model of choice: Logistic Regression. While Logistic Regression is a great starting point to build off of, there are models that could perform better. For example, the XGBoost classification model showed superior performance, shown by the image below:

It also showed slightly different feature importance scores, with degree centrality still being the top feature.

Ethical Concerns

An ethical concern in using models for drug prescription classification lies in the high stakes attached to their accuracy. Over-reliance on such models without supplementing them with expert medical advice could lead to regulatory oversights, especially with OTC drugs that might require stricter control. Therefore, we must treat the outputs of this model as preliminary recommendations rather than definitive conclusions.

The involvement of medical experts in the decision-making process is essential, and we must acknowledge the fact that no machine learning model, regardless of its sophistication, can guarantee absolute accuracy.

Conclusion

The project has demonstrated that logistic regression can be a valuable tool for identifying OTC drugs that might need reclassification, but it has several limitations. The model performs well in identifying potential safety concerns based on the drug interaction network and usage patterns, as shown by its metrics. However, the data imbalance affected these metrics, and although various strategies for addressing this imbalance were explored, they didn’t significantly improve the model’s outcome.

The use of centrality measures from the drug interaction network proved to be the most important and strong indicators for drug reclassification potential by themselves. The project encountered data limitations: incomplete datasets, data reliability issues, and unavailability of certain information like drug popularity.

It’s clear from these outcomes that while the logistic regression model is promising, it should not be the sole basis for reclassification decisions. The model should be viewed as a supplementary tool to guide initial recommendations.

In sum, the project sheds light on how data science can contribute to public health, but it also highlights the complexities and responsibilities that come with its applications. Decision-makers in healthcare regulation should consider these findings as a starting point for more in-depth evaluation processes involving expert consultation.

Web Scraper and Jupyter notebooks:

https://github.com/dvc0310/drug-classification

--

--