Web-Scraping for Healthcare Data: How to Pull Data From the Internet?

Dr David Ryan

davidkdryan
The Startup
9 min readJan 18, 2021

--

Health data comes in all shapes and sizes. Web-scraping represents an efficient tool to obtain data from hard-to-reach places.

Image credit: https://unsplash.com/photos/iar-afB0QQw

Scraping the barrel for every byte of information

In today’s world, the ability to obtain large amount of clean data efficiently is a key skill.

One of the most important methods for obtaining data on a large scale is to web-scrape. This refers to the automatic extraction of targeted information from a website. It is essentially a hide-and-seek exercise. We build a programme or “bot” that searches the back end of websites and harvests this information in a format that we can manipulate into our dataset.

This has been applied in many fields. For example, we can develop a bot that pulls the stock prices for a certain date, pulls the average daily temperature or the average number of commuters on the London Underground. Web-scraping unlocks huge potential for us to add more features to our data and rapidly build up richer datasets.

Simply put, web-scraping enables us to expand our datasets using search queries on the internet

Applied to Medicine, there is a huge potential to expand datasets. For example, we can obtain data relating to genetic variants online or enrich datasets with medication side-effects.

A non-technical introduction

As an example, this notebook will demonstrate web-scraping using the python selenium package. I will demonstrate this using the following use-case:

Develop code that takes electronic prescribing data (from the US open-source National Health and Nutrition Examination Survey) and returns the drug details from the NHS website.

The code: Prescribed Medication

As always, we start with importing our libraries:

#import library
import pandas as pd
import numpy as np
#!pip install selenium
from selenium import webdriver

Next we import the data set (available at: https://www.kaggle.com/dkryan/webscraping/). This is a list of medication prescribed for each individual in the National Health and Nutrition Examination Survey. This is a publicly-available, cross-sectional health database, studying the health and nutritional status of a representative sample of the US population.

#import drug chart
df = pd.read_csv(‘drug_chart.csv’)

Note the prescriptions are imported as a string, whereas it is easier to deal with a list of prescriptions for each individual patient.

#prescription list formation 
def prescription_list(row):
""" This function returns a list of all the prescription medication an individual is prescribed""" if row['Prescriptions'] is np.nan:
return(np.nan)
else:
drugs = row['Prescriptions'].split(", ")

drugs_list = []

for i in drugs:
drugs_list.append(i)

return(drugs_list)

df['prescription_list'] = df.apply(prescription_list, axis=1)
#display
df.head()
National Health and Nutritional Examination Survey prescribed medication lists

Selenium is a web-scraping library in python that works by setting up an automated google chrome page. We can then programatically control this to do our searching for us. We instantiate this using the function:

driver = webdriver.Chrome("/usr/local/bin/chromedriver")

We can then get driver to search a specific page. Here it is searching the NHS page on metformin (a common anti-diabetic medication).

driver.get("https://www.nhs.uk/medicines/metformin/")

If you now go to the chrome page that was opened by selenium, you can see that it has opened up the metformin page. Select the information on the page that you want to extract, right click and copy export path. This copies the link to the relevant HTML code, meaning that you can then access that specific part of the website. This is the scraping in web-scraping.

For example, I want to return all the drug information from the NHS website for metformin. This information is found under the pathway:

"""//*[@id=”about-metformin”]/div"""

Therefore, we can return this using:

#scrape 
metformin = driver.find_element_by_xpath("""//*[@id="about-metformin"]/div""")
metformin.text
"Metformin is a medicine used to treat type 2 diabetes, and to help prevent type 2 diabetes if you're at high risk of developing it.\nMetformin is used when treating polycystic ovary syndrome (PCOS), although it's not officially approved for PCOS.\nType 2 diabetes is an illness where the body does not make enough insulin, or the insulin that it makes does not work properly. This can cause high blood sugar levels (hyperglycaemia).\nPCOS is a condition that affects how the ovaries work.\nMetformin lowers your blood sugar levels by improving the way your body handles insulin.\nIt's usually prescribed for diabetes when diet and exercise alone have not been enough to control your blood sugar levels.\nFor women with PCOS, metformin lowers insulin and blood sugar levels, and can also stimulate ovulation.\nMetformin is available on prescription as tablets and as a liquid that you drink."

We can then use regex and other manipulations (e.g replace functions) to tidy up the returned text.

metformin.text.replace("\n", " ")

Now we can use this understanding to pull multiple sections from the NHS website. This nhs_details function returns the drug details for all the prescribed medication in the dataset.

It is important to note that Selenium makes the assumption that all page structures are the same. Therefore, the find_element_by_xpath can throw a glitch if the website HTML make-up changes. To address this, I have several different options for finding the drug information using several try and except clauses. This was discovered through trial and error and learning the basic structure of the HTML behind the NHS website.

def nhs_details(drug): 

drug = drug.lower()
try:
driver.get(f"https://www.nhs.uk/medicines/{drug}/")
section_1 = driver.find_element_by_xpath(f"""//*[@id="about-{drug}"]/div""")
section_1_text = section_1.text.replace("\n", " ")
section_2 = driver.find_element_by_xpath("""//*[@id="key-facts"]/div""")
section_2_text = section_2.text.replace("\n", " ")
try:
section_3 = driver.find_element_by_xpath(f"""//*[@id="who-can-and-cannot-take-{drug}"]/div""")
section_3_text = section_3.text.replace("\n", " ")
except:
section_3 = driver.find_element_by_xpath(f"""//*[@id="who-can-and-cant-take-{drug}"]/div""")
section_3_text = section_3.text.replace("\n", " ")

return(section_1_text, section_2_text, section_3_text)

except:
driver.get(f"https://www.nhs.uk/medicines/{drug}-for-adults/")
section_1 = driver.find_element_by_xpath(f"""//*[@id="about-{drug}-for-adults"]/div""")
section_1_text = section_1.text.replace("\n", " ")
section_2 = driver.find_element_by_xpath("""//*[@id="key-facts"]/div""")
section_2_text = section_2.text.replace("\n", " ")
section_3 = driver.find_element_by_xpath(f"""//*[@id="who-can-and-cannot-take-{drug}"]/div""")
section_3_text = section_3.text.replace("\n", " ")


return(section_1_text, section_2_text, section_3_text)

Some key parts of this code:

  • .lower() standardises the input the the webdriver
  • f-strings enable us to insert any drug we want into the URL
  • The find_element_by_xpath method returns the data of interest from the html as a json object.
  • This is then converted to text and cleaned to remove escape characters
nhs_details('SITAGLIPTIN')('Sitagliptin is a medicine used to treat type 2 diabetes. Type 2 diabetes is an illness where the body does not make enough insulin, or the insulin that it makes does not work properly. This can cause high blood sugar levels (hyperglycaemia). Sitagliptin is prescribed for people who still have high blood sugar, even though they have a sensible diet and exercise regularly. Sitagliptin is only available on prescription. It comes as tablets that you swallow. It also comes as tablets containing a mixture of sitagliptin and metformin. Metformin is another drug used to treat diabetes.',
"Sitagliptin works by increasing the amount of insulin that your body makes. Insulin is the hormone that controls sugar levels in your blood. You take sitagliptin once a day. The most common side effect of sitagliptin is headaches. This medicine does not usually make you put on weight. Sitagliptin is also called by the brand name Januvia. When combined with metformin it's called Janumet.",
"Sitagliptin can be taken by adults (aged 18 years and older). Sitagliptin is not suitable for some people. To make sure it's safe for you, tell your doctor if you: have had an allergic reaction to sitagliptin or any other medicines in the past have problems with your pancreas have gallstones or very high levels of triglycerides (a type of fat) in your blood are a heavy drinker or dependent on alcohol have (or have previously had) any problems with your kidneys are pregnant or breastfeeding, or trying to get pregnant This medicine is not used to treat type 1 diabetes (when your body does not produce insulin).")

Now let’s build a function that returns the NHS website advice for all the drugs a patient is prescribed in the annonymised NHANES extract.

#build a function that returns information for all medication prescribed 
def drug_information(patient_number):
"""webscrapes NHS website and returns drug information"""
drugs = df.loc[patient_number]['prescription_list']
print(drugs)

for drug in drugs:
print('\nPrescription medication:', drug)
print('\nAccessing NHS drug information')

try:
print(nhs_details(drug))

except:
print('No NHS details available')
drug_information(0)['AMLODIPINE', 'LOSARTAN', 'SIMVASTATIN']

Prescription medication: AMLODIPINE

Accessing NHS drug information
('Amlodipine is a medicine used to treat high blood pressure (hypertension). If you have high blood pressure, taking amlodipine helps prevent future heart disease, heart attacks and strokes. Amlodipine is also used to prevent chest pain caused by heart disease (angina). This medicine is only available on prescription. It comes as tablets or as a liquid to swallow.', "Amlodipine lowers your blood pressure and makes it easier for your heart to pump blood around your body. It's usual to take amlodipine once a day. You can take it at any time of day, but try to make sure it's around the same time each day. The most common side effects include headache, flushing, feeling tired and swollen ankles. These usually improve after a few days. Amlodipine can be called amlodipine besilate, amlodipine maleate or amlodipine mesilate. This is because the medicine contains another chemical to make it easier for your body to take up and use it. It doesn't matter what your amlodipine is called. They all work as well as each other. Amlodipine is also called by the brand names Istin and Amlostin.", "Amlodipine can be taken by adults and children aged 6 years and over. Amlodipine is not suitable for some people. To make sure amlodipine is safe for you, tell your doctor if you: have had an allergic reaction to amlodipine or any other medicines in the past are trying to get pregnant, are already pregnant or you're breastfeeding have liver or kidney disease have heart failure or you have recently had a heart attack")

Prescription medication: LOSARTAN

Accessing NHS drug information
("Losartan is a medicine widely used to treat high blood pressure and heart failure, and to protect your kidneys if you have both kidney disease and diabetes. Losartan helps to prevent future strokes, heart attacks and kidney problems. It also improves your survival if you're taking it for heart failure or after a heart attack. This medicine is only available on prescription. It comes as tablets.", "Losartan lowers your blood pressure and makes it easier for your heart to pump blood around your body. It's often used as a second-choice treatment if you had to stop taking another blood pressure-lowering medicine because it gave you a dry, irritating cough. If you have diarrhoea and vomiting from a stomach bug or illness while taking losartan, tell your doctor. You may need to stop taking it until you feel better. The main side effects of losartan are dizziness and fatigue, but they're usually mild and shortlived. Losartan is not normally recommended in pregnancy or while breastfeeding. Talk to your doctor if you're trying to get pregnant, you're already pregnant or you're breastfeeding. Losartan is also called by the brand name Cozaar.", "Losartan can be taken by adults aged 18 years and over. Children aged 6 years and older can take it, but only to treat high blood pressure. Your doctor may prescribe losartan if you've tried taking similar blood pressure-lowering medicines such as ramipril and lisinopril in the past, but had to stop taking them because of side effects such as a dry, irritating cough. Losartan isn't suitable for some people. To make sure losartan is safe for you, tell your doctor if you: have had an allergic reaction to losartan or other medicines in the past have diabetes have heart, liver or kidney problems recently had a kidney transplant have had diarrhoea or vomiting have been on a low salt diet have low blood pressure are trying to get pregnant, are already pregnant or you are breastfeeding")

Prescription medication: SIMVASTATIN

Accessing NHS drug information
("Simvastatin belongs to a group of medicines called statins. It's used to lower cholesterol if you've been diagnosed with high blood cholesterol. It's also taken to prevent heart disease, including heart attacks and strokes. Your doctor may prescribe simvastatin if you have a family history of heart disease, or a long-term health condition such as rheumatoid arthritis, or type 1 or type 2 diabetes. The medicine is available on prescription as tablets. You can also buy a low-strength 10mg tablet from a pharmacy.", "Simvastatin seems to be a very safe medicine. It's unusual to have any side effects. Keep taking simvastatin even if you feel well, as you will still be getting the benefits. Most people with high cholesterol don't have any symptoms. Do not take simvastatin if you're pregnant, trying to get pregnant or breastfeeding. Do not drink grapefruit juice while you're taking simvastatin. It doesn't mix well with this medicine. Simvastatin is also called Zocor and Simvador.", "Simvastatin can be taken by adults and children over the age of 10 years. Simvastatin isn't suitable for some people. Tell your doctor if you: have had an allergic reaction to simvastatin or any other medicines in the past have liver or kidney problems are trying to get pregnant, think you might be pregnant, you're already pregnant, or you're breastfeeding have severe lung disease regularly drink large amounts of alcohol have an underactive thyroid have, or have had, a muscle disorder (including fibromyalgia)")

So that’s all folks! This post described how to build a basic web-scraping tool to download drug data from the NHS website. I’ve used this template in other situations as well, including finding the chromosome location for 49K genetic variants (single nucleotide polymorphisms). Web-scraping is a pretty nice tool to have available when data is tricky to obtain in clean and user-friendly formats.

Read the docs on selenium: https://selenium-python.readthedocs.io/

Kaggle data: https://www.kaggle.com/dkryan/webscraping

--

--

davidkdryan
The Startup

Irish doctor working in London. Biomedical informatics. Big Data. AI. Clinical Pharmacology and Therapeutics. Cork. Edinburgh. London.