An Introduction to Selenium and Beautiful Soup
One of the many things you might have heard about while learning about python is web scraping. When I was working on a personal project, I had to collect weather data from multiple counties and looked to web scraping to gather that data. Two of the most useful tools that you can utilize when dealing with web scraping are Selenium and Beautiful Soup.
Beautiful Soup is a python library that makes it easy for users to scrape data from web pages. The tools make it easier to navigate through html or XML files and search for information through a tree-like structure. You can search for specific tags, attributes, or ids. The tools also allow the user to navigate through relations such as children and siblings.
Selenium is a set of tools that allows the user to automate a web driver. To truly understand the importance of this while web scraping, let us take an example. In my project, when I needed weather data, I required multiple months of data. However, the information was not on just one web page. I had to move to multiple pages to collect this data. The last thing you want to do is move to each site by hand, grab the specific data you are looking for, and then move on to the next site. The tools from selenium allow the user to automate the process of going to each site and collecting the necessary information. This makes collecting the data a lot faster and smoother. Not to mention you don’t have to stay on your computer while collecting the data. You could set a program and allow it collect the data for you.
Of course, the first step to using this library is to install the package. The current release of this package as of October 3rd 2020 is Beautiful Soup 4.9.3. An easy way to install the Beautiful Soup is with
pip install beautifulsoup4
If you are using jupyter notebook, after you set up a new notebook, you want to import the package using the following code:
from bs4 import BeautifulSoup
So now that the library is set up, we want to be able to access a web page. If you are just beginning and want to try scraping only a single page, I would suggest using the Requests module. The Requests module allows the user to collect a web page.
The next step is to find a website to scrape. There are certain sites out there that were specifically designed to allow people to learn how to scrape. One example is from Books to Scrape. It is only a demo website, so some of the information such as prices and ratings were assigned random values, but you won’t be in danger of possibly being kicked off the website.
html_page = requests.get(‘http://books.toscrape.com/’)
Once we have a website requested, we want to pass the page through beautiful soup.
soup = BeautifulSoup(html_page.content, 'html.parser')
When going through Beautiful Soup constructor, the page is converted into Unicode. Unicode is “an international encoding standard for using different languages and scripts.” We can take a quick look at the information and see its structure by using the following code.
This will show all the information on the page, which is a lot. When scraping, we only want a specific portion of that information. So instead of trying to read through the entire page, we can use the Inspect Element Feature. When you are on a certain page, if you right click the page, there is an inspect button. For example, on the Book to Scrape page, if you right clicked on the price of the first book, you will see an Inspect button.
By pressing this button, a side window will open up and directly navigate to its position on the document and highlight it.
If you highlight different portions of inspect elements page, it will highlight the respective portions of the page.
So, if we wanted to specifically look for the prices of the books, we would want a variable where we are only looking through that selection. The first step is using the find method. When we are on the website, we can use the inspect elements to find the <div> that contains all the books. Above that, there is a unique <div> with an alert warning.
select = soup.find('div', class_="alert alert-warning")
From there we can navigate down to the container that has the books.
books = select.next_sibling.next_sibling
Now that we have all the books, we can take a closer look at the the prices. If you look carefully at the elements, you can see that the price is in a paragraph with a class called
price_color . There is a findAll method that allows us to find each case within in the books container.
prices = books.findAll('p', class_='price_color')
This gives us a list of prices. If we wanted to find more information like the min, max, or mean of the prices on the page, we would need to first convert the information into text and then convert the text from string to float.
prices_text = [price.text for price in prices]
Now that we have a list of prices in string format, we can convert to a float if we remove the British pounds symbol.
prices_fl = [float(price[1:]) for price in prices_text]
Now we have a list of prices on that page and we can learn more by finding such information like which was the most expensive book in the list. However, as mentioned above, sometimes you need information from multiple websites and that is where Selenium can enter.
As before, the first step is to install the suite of tools.
pip install selenium
pip install webdriver-manager
pip install selectorlib
Next we import a number of tools into our notebook.
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutExceptionfrom
from selectorlib import Extractor
Now this may seem like a lot, but if you are checking multiple sites and your internet is moving slowly, you don’t want for a site to be skipped because the first website did not load.
Going back to my example of collecting weather data, I ran a function where it grabbed a table from multiple pages. However, when I returned after the function finished, I found that half my data set was duplicated because the website had not loaded yet.
For a basic example of how you can go to multiple sites, let us go back to Books to Scrape website. This website has multiple pages of books to iterate through. The first step in this case is to decide which browser you want to use.
driver = webdriver.Chrome(ChromeDriverManager().install())
This will open a new window in a Chrome browser that will be controlled by the computer. Next, we order the computer to go to the website.
Using Selenium, we can have the computer click certain buttons if you know what element you are trying to click. In the driver, you can use the find_element method and you choose which way you want to find the element and then what that element is. In this case will look for the XPath. To find the XPath, you use the inspect elements on the button you want to click, then you right click the highlighted element and choose to copy the XPath.
The following code then will click the button and take the browser to the next page.
click_next = driver.find_element(By.XPATH, /html/body/div/div/div/div/section/div/div/ul/li/a)
The page then can be put through beautiful soup and go through the same process in beautiful soup as seen above.
soup = BeautifulSoup(driver.page_source, 'html.parser')
Confirming Your Page Has Loaded
As mentioned before, sometimes the page will not load at the same speed throughout the program. To confirm that the page has loaded, one way is to wait for an element to be displayed.
timeout = 5
element_present = EC.presence_of_element_located((By.ID, 'main'))
print("Timed out waiting for page to load")
Looking at this code, the
presence_of_element_located() method will tell us if the website has been loaded. The
WebDriverWait will force the program to wait using the
until method until the element is located, which is what confirms that the page has been loaded.
When used together, Selenium and Beautiful Soup are powerful tools that allow the user to web scrape data efficiently and quickly. I hope this introduction will help you feel a bit more comfortable about web scraping with these great tools.
For more information, you can check the documentation of the tools found below:
Beautiful Soup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/