BeautifulSoup + Selenium = Superpower!
Web Scraping 101: Basic Function and how to use it.
I know it’s an old joke. However, it was literally my feeling when I learned how to do web scraping the first time. In the beginning, I was taught by using BeautifulSoup to scrape the data. It was powerful enough to extract all the information I needed until my project was getting complex. And then, I found Selenium which can interact with the browser to do some automatic works. Today, I’m going to share my experience of how to use both packages to scrape the data from an interactive website.
⭕ Backstory
The Convenience Store in Taiwan has been the most popular culture in the world. Not only the store can do almost anything you can’t imagine, but also its high density. For example, there are 3 convenience stores on a street, and each store only one-minute walking distance. Do you think it’s crazy? I think so. I’m so curious why they are doing this and what’s their strategy. Therefore, I decided to find out what’s the secret behind. For starting this project, I need to get all stores’ information, including the store’s name/address/service information, etc. Here comes the challenge, and it’s also our today’s topic.
⭕ Target
7–11 and FamilyMart are two major brands and most popular in Taiwan. Their services cover almost everything people daily needed.
⭕ Challenge
Both companies use the interactive map as their store search webpage.
One thing about these map pages is that it doesn’t contain any hyperlinks, which I can’t extract the URL to go deeper.
<a href="#" onclick="showAdminArea('宜蘭縣')">宜蘭縣</a>
Therefore, here comes “Selenium” in handy. There are some elements we need before we start.
First, we need a driver for interacting with the browser. If you are using Chrome, please click here to download. If you are using Firefox, please click here to download. Remember, you need to check your browser version and find the compatible one to use. In today’s topic, I’m using the Chrome driver as an example.
After downloading the driver, here is the basic structure for using Selenium.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys#open driver
PATH_TO_DRIVER = '/Users/guest/code/chromedriver'
driver = webdriver.Chrome(executable_path=PATH_TO_DRIVER)#launch url using driver
url = 'https://www.family.com.tw/Marketing/inquiry.aspx'
driver.get(url)#save the current loaded page into "page_source"
page_source = driver.page_source
When we load the page_source, we can view the HTML DOM. However, this is not our goal today. We want to click the button, go deeper to the page and shrink down the area then scrape the data we need.
“Selenium” has many methods to locate the element. Here is the list I took from their official website.
Most of the time, I use xpath to locate the button element and do the click.
driver.find_element_by_xpath('//*\[@id="taiwanMap"]/div[1]/a').click()
❗For people who is the beginner, the picture below shows how to find the XPath or other locators from the element. (Red square)
By using the locators, we can click the element, or assign it to a variable. For example, I clicked buttons twice to a sub-page with table information. And then, I assigned the first row to the “row” variable.
#click an element
driver.find_element_by_xpath('//*[@id="taiwanMap"]/div[1]/a').click()time.sleep(5) #use sleep to wait the page finish loading#click the second element
driver.find_element_by_xpath('//*[@id="showTownList"]/li[1]/a').click()time.sleep(5) #use sleep to wait the page finish loading#assign the first row to "row" variable
row = driver.find_element_by_xpath('//*[@id="showShopList"]/table/tbody[2]/tr[1]')
When we print out the row, it shows a selenium object. There are a couple of things we can do to extract the object. I’m going to give two examples I used from my project.
row.get_attribute('innerHTML') #First example: get attribute
From the above code, I get the result like below:
' <td class="graybox">全家宜蘭大福店</td> <td class="graybox"> <table width="100%"> <tbody><tr> <td> 店舖號:019762</td> <td align="right"><div class="shop_add_map"><a href="#" onclick="showMap(0)"><span class="add_map_word">地圖檢視</span></a></div></td> </tr> </tbody></table> 服務編號:15012<br> 地址:宜蘭縣宜蘭市大福路一段43號,45號<br> 電話:03-9108022 , 03-9301476</td> <td class="graybox"><span class="store02"></span><span class="store07"></span><span class="store05"></span><span class="store03"></span><span class="store04"></span><span class="store10"></span><span class="store12"></span><span class="store21"></span><span class="store30"></span></td>'
It prints out the original HTML table with the first row I located.
row.text
When I use text instead of get_attribute, I get a different result:
'全家宜蘭大福店\n店舖號:019762\n地圖檢視\n服務編號:15012\n地址:宜蘭縣宜蘭市大福路一段43號,45號\n電話:03-9108022 , 03-9301476'
Once I get the above result, it’s easy to use panda to turn it into a DataFrame.
The above examples are just very basic functions, but it’s already powerful enough to achieve what I need. If you want to go deeper, I strongly recommend checking their website and documentation here.
How about the BeautifulSoup?
I used BeautifulSoup to store the page information and use it to find the keywords and made it a condition rule.
For example, sometimes there were multiple pages and I need to locate the page element. Thus, I save the page and use condition to check if the string exists.
# Looking for total pages
total_pages = 1
find_pages = soup.find_all('a')
for p in find_pages:
if 'onclick="chgPage' in str(p):
total_pages += 1if 'div class="page_bu" id="page_bu" style="display: block;"' in str(soup) and total_pages > 1:
for page in range(2, total_pages+1):
driver.find_element_by_xpath(f'//\
[@id="page_bu_content"]/li[{page}]/a').click()
print(f'We are at page {page}...')
Also, I used this method with Regex to locate the element.
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')time.sleep(5)pages_string = str(soup.find("td", attrs={"class":"nb_pagination"}))
pages_extract = re.findall(r'>\d+?<', pages_string)
pages_lst = [re.sub('>|<', '', i) for i in pages_extract]
Since this project has mostly relied on Selenium because of the interactive webpage, it doesn’t mean that Selenium is better than BeautifulSoup. I only used the BeautifulSoup on scraping film image library, and it was working perfectly. I will share my experience in the future article.
Thank you for reading and please leave the comment below to let me know your thought about the method or if there is a better way of doing this kind of works.