Navigating the Complex World of Drug Reactions

Published in

INST414: Data Science Techniques

8 min readDec 9, 2023

Introduction

The selection of medications is a complex process, weighed down by concerns about potential adverse reactions. To address this, our project tapped into the rich data available from the OpenFDA API, focusing on drug adverse events. Central to our analysis were Python libraries: Pandas for data structuring, Matplotlib for visual representations, and Scikit-learn, particularly its KMeans module, for uncovering patterns in complex datasets.

TextBlob played a pivotal role in processing the reactions’ descriptions, assigning sentiment scores to these qualitative inputs. Complementing this, the ‘sentence_transformers’ library, specifically the SentenceTransformer module operating on PyTorch, enabled us to transform textual reactions into meaningful vector representations. This approach was crucial in analyzing the semantics embedded in the data.

By combining K-means clustering with sentiment analysis, the project aimed to convert intricate datasets into actionable insights. This methodology was intended to assist individuals in making informed decisions regarding their medication choices, focusing on minimizing adverse reactions and enhancing overall health outcomes. This initiative bridges the gap between data science and healthcare, offering a pathway to more informed healthcare decisions.

Data Collection

Initiating the data collection process required establishing a connection to the OpenFDA API using the provided "api_key." Once the connection was established, a specially designed function, fetch_adverse_effects, was employed to navigate through the extensive datasets available within the API.

def fetch_adverse_events(drug_name, limit=100):
    base_url = "https://api.fda.gov/drug/event.json"
    query = f"?api_key={api_key}&search=patient.drug.openfda.brand_name:\"{drug_name}\"&limit={limit}"
    url = base_url + query
    response = requests.get(url)
    if response.status_code == 200:
        return response.json()['results']
    else:
        return None

The primary focus of this data extraction was on the list of drugs, complemented by demographic information (age and gender) of the individuals reporting adverse reactions, and the detailed descriptions of these reactions. This extracted data was meticulously structured into a DataFrame, ensuring a systematic organization where each row encapsulated a unique event, precisely cataloged with attributes like ‘drug_name’, ‘age’, ‘gender’, and ‘reactions’. This methodical arrangement was crucial for the subsequent stages of analysis.

all_events = []
for drug in drugs:
    events = fetch_adverse_events(drug)
    if events:
        for event in events:
            # Extracting relevant information
            age = event.get('patient', {}).get('patientonsetage', None)
            gender = event.get('patient', {}).get('patientsex', None)
            reactions = [r.get('reactionmeddrapt') for r in event.get('patient', {}).get('reaction', [])]
            all_events.append({'drug_name': drug, 'age': age, 'gender': gender, 'reactions': reactions})

# Convert to DataFrame
df = pd.DataFrame(all_events)

Challenges Along the Way

The journey through this analytical process was marked by several challenges. Beyond the usual technical difficulties such as occasional code malfunctions and library import issues, the project faced a significant challenge in data interpretation. The primary concern was the conversion of qualitative data — complex and multifaceted descriptions of drug reactions — into a quantifiable format suitable for computational analysis. The crux of the problem was devising a method to numerically represent subjective experiences and symptoms, a task that required both creative and analytical thinking to ensure accuracy and relevance in the subsequent analysis.

Preprocessing: A Bridge Between Qualitative and Quantitative

To address the challenge of quantifying qualitative data, sentiment analysis emerged as a pivotal solution. This was something that was done on my study of charting song lyrics, and this was indeed a tool that would best suit this scenario. Utilizing the TextBlob library allowed for each drug reaction per event to be analyzed and assigned a sentiment score. This score numerically encapsulated the sentiment conveyed in the reaction descriptions, ranging from -1 for highly negative sentiments to +1 for highly positive ones and 0 for neutral. This methodology allowed for the preservation of the unique characteristics of each reaction while providing a consistent quantitative measure for analytical comparison and further processing.

In parallel, the sentence_transformer library played a crucial role in converting the textual data into a more analytically tractable format. It transformed complex text descriptions into high-dimensional vector space representations, making them amenable to clustering algorithms. This step was crucial in bridging the gap between raw textual data and structured, numerical data suitable for machine learning applications, particularly in clustering analysis.

Results and Insights

With the data preprocessed, the next step was to uncover the hidden patterns. The Elbow Method guided the choice for the number of clusters in KMeans, a clustering algorithm that groups similar adverse events together. This segmentation was the key to understanding the larger picture that individual data points painted when viewed together.

I decided to create 10 clusters given this chart, due to the idea that sentiment would be measured and because the sentiment scores are on a spectrum there could be many clusters.

Below is the k-means clustered graph.

In our quest to make sense of the complex data on adverse drug reactions (ADRs), we turned to a technique known as t-SNE (tdistributed Stochastic Neighbor Embedding) for a clearer visual representation. This advanced analytical tool helps us reduce the complexity of our data — imagine going from a multi-page spreadsheet to a single, insightful image. The colorful scatter plot you see is the result of the t-SNE algorithm transforming our detailed ADR data into a two-dimensional map. Each dot represents a unique report of a drug reaction, and the colors differentiate the various clusters — groups of reactions that the algorithm has identified as similar. In this t-SNE map, the axes are labeled ‘Feature 1’ and ‘Feature 2’. Unlike typical graphs where axes might represent straightforward measures like time or height, here, they stand for abstract concepts derived from our sentiment analysis. They’re like coordinates that plot out the landscape of drug reactions on a map drawn by t-SNE, helping us navigate the terrain of our data. The clusters reveal patterns in how people experience and report drug reactions. For instance, clusters of dots huddled closely might indicate a common type of reaction shared by a group of drugs, while isolated dots could represent unique or rare reactions. Through this visualization, we can start to discern which drug reactions are frequently experienced together, and perhaps more importantly, which ones stand out due to their uniqueness. It’s a stepping-stone towards understanding the bigger picture of drug safety and patient experience, translating numbers and narratives into actionable insights.

TextBlob was used again to measure the average sentiment of each cluster, with the polarity of the text being processed and returned with a value.

from textblob import TextBlob

# Function to calculate sentiment polarity
def get_sentiment(text):
    return TextBlob(text).sentiment.polarity

# Applying the function to the reactions column
df['sentiment_score'] = df['combined_reactions'].apply(get_sentiment)

# Now you can calculate the average sentiment score for each cluster
grouped_df = df.groupby('cluster')
average_sentiment = grouped_df['sentiment_score'].mean()
print(average_sentiment)

And below is the output.

cluster
0   -0.031513
1   -0.010340
2   -0.048769
3   -0.332039
4    0.011517
5    0.003259
6   -0.032309
7   -0.002428
8   -0.000219
9   -0.022333

So we see the average sentiment for most of the clusters appear to be negative, yet some may say the values are negligible. Cluster 3 has the lowest score which appears to be indicative of the severity of the reactions, with that community having the most severe reactions. Following Cluster 3 is Cluster 2, Cluster 6, and Cluster 1. Cluster 9 comes in at 5th with the other clusters remaining most closer to 0 with Cluster 4 and Cluster 5 being slightly greater than zero. Now given this information could we see if certain drugs are the cause of such poor sentiment?

Given the information in our data frame we can count which drug is the most in a cluster. The “get_most_least_frequent_drugs” function finds the most frequent drug that is present in a cluster. It returns the min and max count and applies it to the data frame for each each cluster.

def get_most_least_frequent_drugs(sub_df):
    drug_counts = sub_df['drug_name'].value_counts()
    most_frequent = drug_counts.idxmax()  # Drug with the highest count
    least_frequent = drug_counts.idxmin()  # Drug with the lowest count
    return pd.Series({'Most Frequent Drug': most_frequent, 'Least Frequent Drug': least_frequent})

# Applying the function to each cluster
cluster_drug_info = df.groupby('cluster').apply(get_most_least_frequent_drugs)

# Optionally, save this information to a CSV
cluster_drug_info.to_csv("cluster_drug_info.csv")

The following output of this function is below.

           Most Frequent Drug Least Frequent Drug
cluster                                       
0                 morphine      spironolactone
1                 naproxen             heparin
2              hydrocodone           thyroxine
3            levothyroxine        lansoprazole
4                  heparin        atorvastatin
5            acetaminophen            apixaban
6               citalopram           thyroxine
7            levothyroxine          fluoxetine
8             atorvastatin        liothyronine
9             fexofenadine        lansoprazole

Throughout this data-driven exploration of adverse drug reactions, morphine’s frequent mention in Cluster 0 could suggest a need for caution among prescribers and patients. The logic is straightforward: a drug that appears often in reports of adverse reactions might be considered riskier. This assertion stems from an analysis of thousands of recorded reactions in our dataset, with each instance documenting a reaction tied to a drug and quantified by a sentiment score — a numerical reflection of the reaction’s impact.

However, interpretation demands nuance. Despite morphine’s prevalence in Cluster 0, the varied average sentiment scores reveal a complex picture — one where the frequency of mentions doesn’t directly translate to a higher risk. The drug’s commonality in treatment regimens across the board could explain its numerical prominence in our findings.

Our investigation brings to light patterns that warrant further study, highlighting the importance of layered analysis in the evaluation of drug safety. As we continue to unravel the data, our insights form a guidepost for deeper inquiry, rather than a conclusive endpoint.

Limitations

Limitations are always prevalent in data exploration, and this journey was no different. The analysis was constrained by the range of features considered thus impacting different ways the data could be quantified. Assumptions had to be made, such as equating the frequency and sentiment scores with the risk profile of drugs — an inference that, while logical, might not encompass the full spectrum of drug safety. Furthermore, the limited demographic information posed a challenge in painting a comprehensive picture of who is most affected by these reactions.

Conclusion

Our journey through the data of adverse drug reactions has illuminated the intricate patterns of medication impacts, yet it’s clear that each discovery is but a piece in a larger puzzle. Our analysis, while thorough, was bound by the scope of variables at hand and the depth of the available dataset. We navigated these confines by drawing on sentiment scores and drug frequency as proxies for risk profiles, an approach that, despite its logical basis, may not capture the entirety of drug safety nuances.

Moreover, the demographic details at our disposal were limited, which meant our portrait of those most affected by adverse reactions could only be sketched in broad strokes. Such constraints remind us that in data science, as in medicine, there’s always room for more understanding, more data, and more refinement.

As we conclude this phase of exploration, we recognize that our insights are stepping stones. They are invitations to further research, to peer deeper into the complexities of pharmacology, patient experiences, and the broader healthcare landscape. The patterns we’ve discerned offer a map for future inquiries, guiding us towards more informed decisions for those seeking medical treatment.

Source Code

For those interested in delving deeper into the analysis or exploring the code that made it all happen, I invite you to visit the GitHub:

JA-414/JA-assignment4(414) at main · jeadjani/JA-414

Contribute to jeadjani/JA-414 development by creating an account on GitHub.

github.com