Web-scraping with Selenium

Brian Yee
5 min readAug 18, 2019

--

Delicious, delicious data. Images from Alpes London.

Web-scraping is a technique that programmers use to parse through webpages and automatically retrieve data. Data gathering is an important step when data scientists start a new problem, so web-scraping is a very important skill to have since the internet is a vast resource of data that should be utilized.

Websites are all written using HTML and CSS and like other coding languages, there is are patterns and specific syntax that allow us to easily locate the information that we need. Using Python libraries like requests and BeautifulSoup, we can pass the HTML of websites into our program to get and clean the information that we need. However, most sites nowadays include JavaScript elements that are only executed by the client so simply passing the HTML code into our program will not give us access to all the data available on the webpage.

Selenium Webdriver

Mercury was the primary browser automator before Selenium.

Selenium is way that allows us to automate browsers using code by mimicking keystrokes and button presses. Specifically for web-scraping, this gives us a way to load and interact with JavaScript elements in websites.

Installing Selenium

Just like with most libraries in Python, Selenium can be installed with

pip install selenium

You will also need to install the required driver and put it into your PATH (usr/local/bin or usr/bin) to allow Python to interact with your browser of choice. Here are the drivers to Google Chrome and Mozilla Firefox.

Pre-Scraping Checklist

Before writing your web-scraper, it is important to check if the site we are thinking of scraping has a good public API that can get us the data we need. If the site we are looking to scrape doesn’t have a public API, next is to check the site’s rules about web-scraping. Check the site’s Terms of Service and also the site’s robots.txt file (if the site has a robots.txt, it can be reached by typing site_url.com/robots.txt). The robots.txt file for Medium looks like this:

Medium’s robot.txt file

The robots.txt file will define restrictions for webcrawlers that visit. For more information about the syntax of robots.txt files, you can check out this site.

Web-scraping Etiquette

While web-scraping, there are some guidelines that should be followed regardless of what site you’re scraping.

  • Don’t scrape anything that breaks Terms of Service or copyrights.
  • Limit the requests/second of your program.
    Don’t overload the site’s servers with your requests. Include waits between requests or process the data between requests to space them out.
  • Look only at what you need.
    Avoid scraping everything from a site and search specifically for what you need. Some sites include “honeypots” which are links that can only be found through the HTML code. These can either trap the crawler in an infinite loop or alert the admins and flag your IP address.

Using Selenium

The first thing to do, is create a new webdriver that will execute your code. This is done by executing the following:

from selenium import webdriverdriver = webdriver.Chrome()

Wait a second..

It is important to put waits into your web-scraper for two reasons:

  1. Elements load in sequentially, so the element you are looking for might not load immediately.
  2. Too many requests/second will overload the server.

To handle the first case, we can use selenium’s built in implicit and explicit waits. Implicit waits will tell the webdriver how many seconds it should try to look for elements (default is 0 seconds) and explicit waits will define how long the webdriver should try to look for a specific element. Implicit waits are set for the driver object while explicit waits are made during element searches. I will use implicit waits here because it is much easier to read.

driver.implicitly_wait(2) # Wait 2 seconds for elements to load

To handle the second case, we can either use Python’s time library for time.sleep() and put that before every request, or just process our data in-between requests to act as a buffer. Since we are not going to be scraping multiple pages in this example, this won’t be an issue.

Requesting Pages

Unlike BeautifulSoup that needs help from the requests library to make requests to websites, Selenium has its own built in get function simply called .get(). Let’s make our webdriver go to Google.

driver.get('https://google.com')

Interacting with Elements

Now we need a way for our driver to find elements to interact with. Selenium webdrivers have the following methods to find elements:

# These return the first instance
driver.find_element_by_id()
driver.find_element_by_name()
driver.find_element_by_xpath()
driver.find_element_by_link_text()
driver.find_element_by_partial_link_text()
driver.find_element_by_tag_name()
driver.find_element_by_class_name()
driver.find_element_by_css_selector()
# These return a list of all instances
driver.find_elements_by_name()
driver.find_elements_by_xpath()
driver.find_elements_by_link_text()
driver.find_elements_by_partial_link_text()
driver.find_elements_by_tag_name()
driver.find_elements_by_class_name()
driver.find_elements_by_css_selector()

By using our browser’s inspect tool, we can get the appropriate arguments to pass into our find methods. From inspecting the search bar, we can see that the name of the element is q.

search_bar = driver.find_element_by_name('q')

Selenium also allows us to input keystrokes and mouse clicks to elements.

search_bar.send_keys('Flatiron School')

An advantage Selenium has over BeautifulSoup is the ability to interact with the webpage in real-time. After entering ‘Flatiron School’ into the search bar, the search button will have moved to the bottom of the suggestions. The HTML for this new search button will not be in the initial HTML code that requests.get() returns, so BeautifulSoup will have no way of finding this element.

driver.find_element_by_class_name('gNO89b').click()

Now the browser should have navigated to the search results for ‘Flatiron School.’ Now we can start scraping for all the search results links.

all_results = driver.find_elements_by_class_name('r')
first_link = all_results[0].find_element_by_tag_name('a')
print(first_link.get_attribute('text'))
print(first_link.get_attribute('href'))
first_link.click()

A gist of this example can be found on my github here.

Pitfalls

Selenium was not built with webscraping in mind. BeautifulSoup is very lightweight since it only requests HTML from the website’s server. Selenium creates an instance of a fully-functioning web browser so it is very resource intensive for what you need. There are other packages like Scrapy and Splash for Python which are lightweight libraries built for webscraping. This post was meant to introduce the basics of Selenium and the basics of webscraping together.

--

--