Use Python with Selenium to scrap Javascript heavy websites
Selenium is a wonderful tool which allows you to automate website testing by reproducing user’s actions. But well, a lot of people actually use it for other purposes like web scraping.
In fact, Selenium basically uses browsers’ driver to execute fetched page. Consequently, the program could have access easily to data which are complicated to obtain using traditional scraper or crawler.
If you are not able to visualise the content until the end, I invite you to take a look here to catch-up!
Finding information by searching on the website
Let’s jump directly into today’s project. We are going to find out what is the list of Dragon Ball’s Son Goku action figures with their price on the website I build for the experiment that I describe here.
Firstly, we import the packages up and get some functions ready to output our data. We are going to output our data in a CSV file so it makes the process easier.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Optionsimport csv
import timedef setCSV(output_file, headers):
writer = csv.DictWriter(output_file, fieldnames=headers)
writer.writeheader()
return writerdef writeRowsCSV(writer, data):
print(data)
for row in data:
print("Write row")
print(row)
writer.writerow(row)
We imported a bunch of functionalities from Selenium which have the following uses in our script:
- webdriver: webdriver constructor, in other words to initialize Selenium with a driver
- Options: helps to set options in the webdriver.
- Keys: collection of keys of the keyboard.
- By: collection of method to find elements.
- WebDriverWait: makes the inputted driver wait explicitly, functions like “until()” can be added to stop waiting when a condition is true.
- expected_conditions: collection of conditions which we can use with the above mentioned “until()” function.
Here is the explained process to get our data. We are just following the process anyone would do to get the information required. In order to do this, I use Google Chrome Dev Tool to inspect the page and take note of which attribute, element or structure I could use to differentiate the element I need to find from others. You can see below a screenshot of me looking for the search bar.
Our plan is to get all action figures of Goku. So, we instinctively would input our keyword in the search bar and press enter. That is exactly what we are going to do. But before we need to initialize Selenium with the driver of the browser we want to use.
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
As a note, I use Chrome driver, so I downloaded separately from Selenium the driver as it is explained in Selenium’s documentation for the setup. I also added the “headless” argument because I don’t want to see the browser open up doing the actions I ask for, but instead just doing it in the background.
Once Selenium is initialized, we get the webpage and input our keyword in the search bar and press “Enter” key. You can notice that I use some time.sleep() functions to explicitly wait a bit in order to make the script more human and let time for the webpage to process the input just in case there is some kind of “onChange” event.
driver.get("https://www.figurines-maniac.com/")
time.sleep(2)
elem = WebDriverWait(driver, 30).until(
EC.presence_of_element_located((By.XPATH,'//input[@class="header-search-input"]'))
)
elem.send_keys("Goku")
time.sleep(1)
elem.send_keys(Keys.RETURN)
time.sleep(4)
Next, we need to look for action figures list in the page. Our eyes are naturally trained for this, but once again our little script need our help to find them. And he doesn’t know what we are looking for, here we are going to indicate it what information we need. I will need three information, the product name, the price and the link to the product as I may be interested to buy it later on.
The name I chose for those information are pretty straightforward. I named them “product_name”, “price” and “url”.
# Looking for the list of products
products_ul = driver.find_elements(By.XPATH, '//ul[contains(@class, "products")]')
print(products_ul)
if len(products_ul) > 0:
product_list = []
# Getting data for every product in the list
for product in products_ul[0].find_elements(By.XPATH, './/li'):
product_title = product.find_elements(By.XPATH, './/h2')
if len(product_title) > 0:
product_name = product.find_element(By.XPATH, './/h2').text if len(product.find_elements(By.XPATH, './/h2')) > 0 else "Product not found"
price = product.find_element(By.XPATH, './/bdi').text if len(product.find_elements(By.XPATH, './/bdi')) > 0 else "Price not found"
url = product.find_elements(By.XPATH, './/a')[0].get_attribute("href") if len(product.find_elements(By.XPATH, './/a')) > 0 else "URL not found"
product_list.append({
"product_name": product_name, # a job-card-list__title
"price": price, # a job-card-container__company-name
"url": url
})
print(product_list)
# Outputting scrapped data to CSV file
with open("products_output.csv","w", encoding="utf-8") as output_file:
headers = ["product_name", "price", "url"]
writer = setCSV(output_file, headers)
writeRowsCSV(writer, product_list)
Let’s not forget the clean it up and close the browser once done.
if driver is not None:
driver.close()
And we are done. Selenium is easy to use and is really helpful to collect data from website where you may need to execute complete actions like logging in or searching before getting the data. You can find below the full code for you to enjoy!
If you enjoyed the article or found it useful, it would be kind of you to support me by following me here (Jonathan Mondaut). More articles are coming very soon!
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Optionsimport csv
import timedef setCSV(output_file, headers):
writer = csv.DictWriter(output_file, fieldnames=headers)
writer.writeheader()
return writerdef writeRowsCSV(writer, data):
print(data)
for row in data:
print("Write row")
print(row)
writer.writerow(row)try:
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://www.figurines-maniac.com/")
time.sleep(2)
elem = WebDriverWait(driver, 30).until(
EC.presence_of_element_located((By.XPATH, '//input[@class="header-search-input"]'))
)
elem.send_keys("Goku")
time.sleep(1)
elem.send_keys(Keys.RETURN)
time.sleep(4)products_ul = driver.find_elements(By.XPATH, '//ul[contains(@class, "products")]')
print(products_ul)
if len(products_ul) > 0:
product_list = []
for product in products_ul[0].find_elements(By.XPATH, './/li'):
product_title = product.find_elements(By.XPATH, './/h2')
if len(product_title) > 0:
product_name = product.find_element(By.XPATH, './/h2').text if len(product.find_elements(By.XPATH, './/h2')) > 0 else "Product not found"
price = product.find_element(By.XPATH, './/bdi').text if len(product.find_elements(By.XPATH, './/bdi')) > 0 else "Price not found"
url = product.find_elements(By.XPATH, './/a')[0].get_attribute("href") if len(product.find_elements(By.XPATH, './/a')) > 0 else "URL not found"
product_list.append({
"product_name": product_name, # a job-card-list__title
"price": price, # a job-card-container__company-name
"url": url
})
print(product_list)
with open("products_output.csv","w", encoding="utf-8") as output_file:
headers = ["product_name", "price", "url"]
writer = setCSV(output_file, headers)
writeRowsCSV(writer, product_list)
if driver is not None:
driver.close()
except Exception as error:
print(error)
driver.close()