Web-scraping is a technique that programmers use to parse through webpages and automatically retrieve data. Data gathering is an important step when data scientists start a new problem, so web-scraping is a very important skill to have since the internet is a vast resource of data that should be utilized.
Websites are all written using HTML and CSS and like other coding languages, there is are patterns and specific syntax that allow us to easily locate the information that we need. Using Python libraries like requests
and BeautifulSoup
, we can pass the HTML of websites into our program to get and clean the information that we need. However, most sites nowadays include JavaScript elements that are only executed by the client so simply passing the HTML code into our program will not give us access to all the data available on the webpage.
Selenium Webdriver
Selenium is way that allows us to automate browsers using code by mimicking keystrokes and button presses. Specifically for web-scraping, this gives us a way to load and interact with JavaScript elements in websites.
Installing Selenium
Just like with most libraries in Python, Selenium can be installed with
pip install selenium
You will also need to install the required driver and put it into your PATH (usr/local/bin
or usr/bin
) to allow Python to interact with your browser of choice. Here are the drivers to Google Chrome and Mozilla Firefox.
Pre-Scraping Checklist
Before writing your web-scraper, it is important to check if the site we are thinking of scraping has a good public API that can get us the data we need. If the site we are looking to scrape doesn’t have a public API, next is to check the site’s rules about web-scraping. Check the site’s Terms of Service and also the site’s robots.txt
file (if the site has a robots.txt
, it can be reached by typing site_url.com/robots.txt
). The robots.txt
file for Medium looks like this:
The robots.txt
file will define restrictions for webcrawlers that visit. For more information about the syntax of robots.txt
files, you can check out this site.
Web-scraping Etiquette
While web-scraping, there are some guidelines that should be followed regardless of what site you’re scraping.
- Don’t scrape anything that breaks Terms of Service or copyrights.
- Limit the requests/second of your program.
Don’t overload the site’s servers with your requests. Include waits between requests or process the data between requests to space them out. - Look only at what you need.
Avoid scraping everything from a site and search specifically for what you need. Some sites include “honeypots” which are links that can only be found through the HTML code. These can either trap the crawler in an infinite loop or alert the admins and flag your IP address.
Using Selenium
The first thing to do, is create a new webdriver that will execute your code. This is done by executing the following:
from selenium import webdriverdriver = webdriver.Chrome()
Wait a second..
It is important to put waits into your web-scraper for two reasons:
- Elements load in sequentially, so the element you are looking for might not load immediately.
- Too many requests/second will overload the server.
To handle the first case, we can use selenium’s built in implicit and explicit waits. Implicit waits will tell the webdriver how many seconds it should try to look for elements (default is 0 seconds) and explicit waits will define how long the webdriver should try to look for a specific element. Implicit waits are set for the driver object while explicit waits are made during element searches. I will use implicit waits here because it is much easier to read.
driver.implicitly_wait(2) # Wait 2 seconds for elements to load
To handle the second case, we can either use Python’s time
library for time.sleep()
and put that before every request, or just process our data in-between requests to act as a buffer. Since we are not going to be scraping multiple pages in this example, this won’t be an issue.
Requesting Pages
Unlike BeautifulSoup
that needs help from the requests
library to make requests to websites, Selenium
has its own built in get function simply called .get()
. Let’s make our webdriver go to Google.
driver.get('https://google.com')
Interacting with Elements
Now we need a way for our driver to find elements to interact with. Selenium webdrivers have the following methods to find elements:
# These return the first instance
driver.find_element_by_id()
driver.find_element_by_name()
driver.find_element_by_xpath()
driver.find_element_by_link_text()
driver.find_element_by_partial_link_text()
driver.find_element_by_tag_name()
driver.find_element_by_class_name()
driver.find_element_by_css_selector()# These return a list of all instances
driver.find_elements_by_name()
driver.find_elements_by_xpath()
driver.find_elements_by_link_text()
driver.find_elements_by_partial_link_text()
driver.find_elements_by_tag_name()
driver.find_elements_by_class_name()
driver.find_elements_by_css_selector()
By using our browser’s inspect tool, we can get the appropriate arguments to pass into our find methods. From inspecting the search bar, we can see that the name of the element is q
.
search_bar = driver.find_element_by_name('q')
Selenium also allows us to input keystrokes and mouse clicks to elements.
search_bar.send_keys('Flatiron School')
An advantage Selenium has over BeautifulSoup
is the ability to interact with the webpage in real-time. After entering ‘Flatiron School’ into the search bar, the search button will have moved to the bottom of the suggestions. The HTML for this new search button will not be in the initial HTML code that requests.get()
returns, so BeautifulSoup
will have no way of finding this element.
driver.find_element_by_class_name('gNO89b').click()
Now the browser should have navigated to the search results for ‘Flatiron School.’ Now we can start scraping for all the search results links.
all_results = driver.find_elements_by_class_name('r')
first_link = all_results[0].find_element_by_tag_name('a')
print(first_link.get_attribute('text'))
print(first_link.get_attribute('href'))
first_link.click()
A gist of this example can be found on my github here.
Pitfalls
Selenium was not built with webscraping in mind. BeautifulSoup
is very lightweight since it only requests HTML from the website’s server. Selenium creates an instance of a fully-functioning web browser so it is very resource intensive for what you need. There are other packages like Scrapy
and Splash
for Python which are lightweight libraries built for webscraping. This post was meant to introduce the basics of Selenium and the basics of webscraping together.
Selenium’s Python documentation:
Official Selenium site: