Medication Ratings and Their Implications: An Exploration Through Web Scraping

David Chun
INST414: Data Science Techniques
3 min readSep 17, 2023

Every medication has a purpose, but can we predict its addictive nature from user ratings and its prescribed uses? It’s common knowledge that controlled substances like Adderall can be addictive, possibly influencing its user ratings. Similarly, over-the-counter medications like melatonin, while not inherently addictive, can lead to dependencies. My objective: to uncover if user ratings on drugs.com hint at a drug’s potential for abuse or dependence. Such insights could be invaluable for agencies like the Drug Enforcement Administration when identifying which medications warrant closer scrutiny.

Web Scraping Methodology:

I started developing the webscraper as a personal project back in January, and it just so happened that I could use it for this analysis. Broken down into three Python files — drug_scraper.py, organize_drugs.py, and standardize.py — the core of the project resided in organize_drugs.py, which collected and structured the scraped information into a JSON file. Drug_scraper.py laid the groundwork, sourcing a comprehensive list of drug URLs from https://www.drugs.com/alpha.

Navigating through these webpages presented its own set of challenges. Although a majority shared a similar layout, nuances in their structure meant that uniform scraping methods often fell short. For instance, while many pages featured a “treatments” header detailing common drug applications, others lacked it altogether or presented a different format. Adapting to these variances was a monumental task, leading to a slew of errors before reaching a consistent and error-free scraping method.

Another challenge I had faced was scraping the data without overburdening the site. Continual rapid requests risked an IP ban, halting my project. To counter this, I introduced a random 1 to 2-second delay between requests, mimicking organic traffic. Yet, this tactic had its drawback. Any encountered bug meant a restart, which, coupled with the intentional delays, made troubleshooting a more prolonged process.

Addressing Data Inconsistencies:

Another hurdle was the varied naming conventions for conditions. ADHD, for instance, appeared under various aliases. Lacking expertise in natural language processing, I initially tried a sequence matcher algorithm to identify and consolidate similar terms. However, striking a balance where valuable data wasn’t lost proved to be difficult. The final solution, albeit tedious, was a manually made dictionary encompassing diseases and their potential aliases. This ensured the JSON file wasn’t inundated with redundant entries.

The final JSON structure included the drug’s name, class, uses, review count, rating, status, and schedule. Admittedly, while I had planned diving into data visualization post scraping, academic commitments coupled with the need for me to learn data visualization and analysis through python deterred me from immediate further exploration.

Data Insights:

A primary focus was understanding the correlation between drug schedules and their user ratings. Drug schedules ranged from “Not a controlled drug” to “Schedule 4”, with Schedule 1 drugs excluded due to their prohibition for medical use. For clarity, I’d like to add that of the drugs that are both legal and controlled, Schedule 2 drugs possess the highest potential for abuse, while Schedule 4 drugs have the lowest potential.

Initial visualizations, particularly boxplots, indicated scheduled drugs generally garnered higher ratings. To solidify these findings, I employed the Tukey HSD test, revealing statistically significant differences in ratings between Schedule 2 drugs and non-controlled substances. Interestingly, other categories didn’t exhibit similar stark contrasts.

Visually, we can see that schedule 2 drugs have high ratings.
Tukey’s test shows that non-controlled drugs and schedule 2 drugs have a statistically significant difference in means.

Limitations:

The analysis, while insightful, has its limitations. High ratings might not solely indicate a drug’s addictive nature but could reflect its efficacy and minimal side effects. A deeper dive would benefit from incorporating side effect data. However, the unstructured presentation of side effects on drugs.com, often in paragraph form, posed challenges. Also, the skewed distribution of reviews, with a vast majority of drugs receiving fewer than 100 reviews, suggests a potential bias towards popular drugs.

Repository for the web scraper:

The first link leads to the final scraper that outputs data in JSON, used for the analysis. The second link is the initial scraper version that saves data in CSV format.

https://github.com/dvc0310/drug_analysis

https://github.com/dvc0310/Medication-Webscraper

--

--