Photo by Mimi Thian on Unsplash

Simple Web Scraping in Python

by Dwarkesh Natarajan

Opex Analytics
9 min readApr 17, 2020

--

As global supply chains are disrupted all around the world, the fastest, richest data source for people who make key business decisions in the face of all this uncertainty is the internet.

For the last few weeks, I’ve been tasked with learning about web scraping to extract COVID-19-related data for our clients. To fill the void in my (non-existent) social life left by this terrible virus, I decided to compile what I’ve learned to share with others.

In this post, I’ll cover how you can leverage Python libraries like Beautiful Soup, Selenium, and pandas to get relevant information off the web and into your (figurative) hands.

Process Overview and HTML

The process for web scraping can be broadly categorized into three steps:

  1. Understand and inspect the web page to find the HTML markers associated with the information we want.
  2. Use Beautiful Soup, Selenium, and/or other Python libraries to scrape the HTML page.
  3. Manipulate the scraped data to get it in the form we need.

The first step involves investigating a web page’s HTML, the programming language used to define and structure the content of a single page. HTML has a series of elements that are represented by tags (check out all the different types of tags and their attributes). They usually come in pairs — for example, a paragraph element on a web page will usually look something like this:

<p>This is a paragraph</p>

Using these tags, our main scraping libraries can target and parse information effectively throughout the scraping process. The following examples will help illustrate the above process in greater detail.

Photo by Florian Olivo on Unsplash

Example A

Our first example task is to get a table from Nepia.com’s shipping impact page, which describes port closures resulting from COVID-19.

Graphic Credit: Nepia.com

The packages that we need for this task are requests, beautifulsoup4, and pandas. Here’s a quick look at how to download this information:

import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.nepia.com/industry-news/coronavirus-outbreak-
impact-on-shipping/"
# makes a request to the web page and gets its HTML
r = requests.get(url)
# stores the HTML page in 'soup', a BeautifulSoup object
soup = BeautifulSoup(r.content)

The above code stores the HTML content of our web page into a BeautifulSoup object. The HTML of the page can be printed out clearly by running the following line:

print(soup.prettify())

To check which tags contain our information, right-click on the element in your browser and select Inspect. In this case, the relevant information in the table is associated with the tag <td>. Use the find_all method to check it out:

print(soup.find_all('td'))

This method returns all matching information and outputs it as a list.

Now that we know how to find all such tags in a page, we’re nearly good to go. The following snippet will parse the HTML page to get the data we need in a pandas dataframe with four columns — Country, Informing Party, Date, and Summary of Advice:

a = []
df = pd.DataFrame(columns = ['Country', 'informing Party', 'Date',
'Summary of advice'])
for
link in soup.find_all('td'):
a.append(link.get_text())
if
len(a) == 4:
df_length = len(df)
df.loc[df_length] = a
a = []

This code goes through each element in the list generated by find_all and gets the text associated with that element using the get_text method. The if statement makes sure that the data retrieved is in the form we need.

Unfortunately, even though Beautiful Soup can get us all the information we need in this case, it can’t automatically put it in the form we would like. Specifically, it can’t tell us where one row ends and the next one begins — the find_all method only returns a list of selected elements. Hence, we must use our understanding of the web page to manipulate the output.

(Note — the above task can be done using pd.read_html, a pandas function that pulls tables from web pages. However, not all data can be scraped with pandas, and it never hurts to learn the hard way!)

Example B

In this example, let’s say that we want to get each country and its travel restrictions present at this URL. The challenging part is that there’s no nice structure like the table in the previous example.

Graphic Credit: Aljazeera.com

A quick browser inspection tells us that the countries are tagged with <h3>, and their associated restrictions come under the tag <p>. Futhermore, one country can have multiple paragraphs, like in the case of Angola (seen above). Our goal is to get the above information into a dataframe with two columns — Country and Restriction.

The first step is the same as the previous example: make a request to the URL and store the page’s HTML as a BeautifulSoup object.

Once you have this, I encourage you to explore the HTML a little bit. A good first step might be to get the text of all <h3> tags. After doing so, I was happy to see that only countries were present in the list — no extraneous information.

With our countries found, the second step is to find the text for all <p> tags present in the HTML document. Unfortunately, not all <p>s contain information we care about. So we have to figure out a way to filter down to the paragraphs that come after the <h3> tags, and the following script does precisely that:

df = pd.DataFrame(columns =['Country', 'Restriction'])
country = []
restriction = []
flag = 0
for link in soup2.find_all("h3"):
if
flag == 0:
country.append(link.get_text())
next_sib = link.find_next_sibling(['p', 'h3'])
print(
next_sib.get_text())
text = []
# loop to go through all paragraphs that come after a h3 tag
while next_sib.name != 'h3':
text.append(next_sib.get_text())
next_sib = next_sib.find_next_sibling(['p','h3'])
if
next_sib is None :
flag = 1
break
restriction.append(' '.join(text))
else:
break
df['Country'] = country
df['Restriction'] = restriction

The above snippet uses a method called find_next_sibling, which checks for the list of tags provided in the parameter and returns the first match. In the above example, we first find all <h3> tags, and then we can obtain the next <p> or <h3> tag using find_next_sibling. If it’s a <p> tag, we store the text, but otherwise, we do nothing.

(Note — the find_next_sibling method returns None if it can’t find any of the tags given in the parameter.)

Example C

Our final example is about extracting the dates at which various governmental safety orders were imposed in each state from the IHME website.

Graphic Credit: covid19.healthdata.org

The website in question has a dropdown element in which the user can specify their state of interest. Crucially, this website is designed such that when the user changes this dropdown value, the URL and the underlying HTML of the page change — the information isn’t stored nicely all on one single page.

There are two typical ways to scrape a website with a dynamic layout like this:

  1. See how the URL changes when different states are selected, and then automate separate requests to each of these URLs to scrape them. (As it might sound, this method is fairly involved.)
  2. Use another Python library called Selenium, which automates user interaction with the browser. With this method, we can programmatically input a state into the dropdown box and get the associated HTML page as a BeautifulSoup object.

Let’s give the second method a shot.

For Selenium to work, we need to have a browser (in this case, Google Chrome) and a browser driver installed on our machine. The driver allows our code to actually create and control a browser window (the driver required for your system can be downloaded here).

from selenium import webdriver
import time
drv_path = "/Users/dn/Desktop/chromedriver_win32/chromedriver.exe"
driver = webdriver.Chrome(drv_path) # full path of browser driver
url = 'https://covid19.healthdata.org/projections'
# opening the url
driver.get(url)
time.sleep(5)

The above snippet, the first section of our final script, will simply open the web page in question in a browser window. The time.sleep is to ensure that the page is opened safely before further action is taken, since there is a lag between the UI and the code — this will help avoid errors.

In order to automate the input of different states in the dropdown, we must find the corresponding HTML element and send each state as an input. There are multiple ways to locate elements using Selenium, but for this example, I’ll focus on the find_element_by_XPath method.

As the method’s name might imply, you need to have the dropdown’s XPath, a type of XML/HTML element identifier. In order to find it, inspect the dropdown in the browser, then right-click on the element of interest and select copy XPath. Put it into your code and it should look something like this:

dropdown = driver.find_element_by_xpath('//*[@id="rc_select_0"]')

In order to send a state value to this dropdown, use send_keys:

# type in California in the dropdown
dropdown.send_keys('California')
time.sleep(5)
# press Enter
dropdown.send_keys(Keys.ENTER)

Now that we can select a state automatically, we need to identify the tags with the information we need. With some browser inspection, you’ll see that the dates of the orders’ impositions come under the tag <span> with attribute class="ant-statistic-content-value" which in turn come under the tag <div> with class="_30-NvbQKNSCT3T454z5_47".

Let’s put all this together with the following script to get the information into a dataframe with following columns — State, stay_at_home, educational_facilities_closed , and non_essential_services_closed.

from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import
StaleElementReferenceException
a = []
df = pd.DataFrame(columns =
['State','stay_at_home','educational_facilities_closed',
'non_essential_services_closed']) # output dataframe
dd_xpath = '//*[@id="rc_select_0"]'for state in states['State'].unique():
# finding dropdown
dropdown = driver.find_element_by_xpath(dd_xpath)
time.sleep(5)
a.append(state)
try:
dropdown.send_keys(state)
time.sleep(5)
except
StaleElementReferenceException as Exception:
print(
'Trying to find element again')
dropdown = driver.find_element_by_xpath(dd_xpath)
time.sleep(5)
dropdown.send_keys(state)
time.sleep(5)
try:
dropdown.send_keys(Keys.ENTER)
time.sleep(5)
except
StaleElementReferenceException as Exception:
print(
'Trying to find element again')
dropdown = driver.find_element_by_xpath(dd_xpath)
time.sleep(5)
dropdown.send_keys(Keys.ENTER)
time.sleep(5)
# storing html page as BeautifulSoup object
soup_level1 = BeautifulSoup(driver.page_source)
print(state)
div_class = "_30-NvbQKNSCT3T454z5_47"
span_class = "ant-statistic-content-value"
for link in soup_level1.find_all('div',{'class': div_class}):
a.append(link.find('span',{"class": span_class}).get_text())
print(
link.find('span',{"class": span_class}).get_text())
if
len(a) == 4:
df_length = len(df)
df.loc[df_length] = a # appending row to dataframe
a = []
break

Notice here that I’ve done some error handling. When we don’t specify enough time between changes, a common error known as the StaleElementReferenceException can occur. This exception can usually be safely handled by selecting the dropdown again. (Check out a list of common Selenium exceptions here.)

(N.B. In the above example, I used a states dataframe which I downloaded as a CSV from Kaggle to get the list of all the states that I loop through.)

Photo by Pankaj Patel on Unsplash

Conclusion

If you’re new to web scraping, the above examples can serve as a starting point for your future scraping adventures. All web pages are different, so the above scripts will naturally have to be modified for other pages, but the overall process should be the same.

In addition, other powerful Python scraping libraries exist. For example, Scrapy is an incredibly powerful tool for large-scale scraping. For beginners, it’s best to start with the stuff discussed here and then build up to Scrapy later as needed.

Stay safe and happy scraping!

_________________________________________________________________

If you liked this blog post, check out more of our work, follow us on social media (Twitter, LinkedIn, and Facebook), or join us for our free monthly Academy webinars.

--

--