WEB SCRAPING SERIES

Web Scraping With Selenium & Scrapy

A hands-on combining Selenium with Scrapy

Karthikeyan P
The Startup

--

In the previous tutorials, we have understood and worked with Scrapy and Selenium individually. In this tutorial, I shall be highlighting the need to combine these two and explaining how to do it. Let us start with the need to combine Selenium with Scrapy.

This is the final part of a 4 part tutorial series on web scraping using Scrapy and Selenium. The previous parts can be found at

Part 1: Web scraping with Scrapy: Theoretical Understanding

Part 2: Web scraping with Scrapy: Practical Understanding

Part 3: Web scraping with Selenium

Why combine Selenium with Scrapy?

The main drawback of Scrapy is its inability to natively handle dynamic websites, i.e. websites that use JavaScript (React, Vue, etc.) to render content as and when needed. For example, trying to extract the list of countries from http://openaq.org/#/countries using Scrapy would return an empty list. To demonstrate this scrapy shell is used with the command

scrapy shell https://openaq.org/#/countries

The processing of this command is shown below.

In [1]: response.xpath('//h1[@class="card__title"]/a/text()').get()
In [2]: response.xpath('//h1[@class="card__title"]/a/text()').getall()
Out[2]: []
In [5]: print(response)
<200 https://openaq.org/>
In [6]: print(response.text)
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<meta name="description" content="OpenAQ is a community of scientists, software developers, and lovers of open environmental data" />
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no" />
<title>OpenAQ</title>
<!-- Twitter -->
<meta name="twitter:card" content="summary" />
<meta name="twitter:site" content="@OpenAQ" />
<meta name="twitter:title" content="OpenAQ">
<meta name="twitter:description" content="OpenAQ is a community of scientists, software developers, and lovers of open environmental data" />
<meta name="twitter:image:src" content="assets/graphics/meta/default-meta-image.png" />
<!--/ Twitter -->
<!-- OG -->
<meta property="og:site_name" content="OpenAQ" />
<meta property="og:title" content="OpenAQ" />
<meta property="og:url" content="https://openaq.org/" />
<meta property="og:type" content="website" />
<meta property="og:description" content="OpenAQ is a community of scientists, software developers, and lovers of open environmental data" />
<meta property="og:image" content="assets/graphics/meta/default-meta-image.png" />
<!--/ OG -->
<link rel="icon" type="image/png" sizes="96x96" href="assets/graphics/meta/favicon.png" />
<link rel="icon" type="image/png" sizes="192x192" href="assets/graphics/meta/android-chrome.png" />
<link rel="apple-touch-icon" sizes="180x180" href="assets/graphics/meta/apple-touch-icon.png" />
<link href="https://fonts.googleapis.com/css?family=Oxygen:400,700|Source+Sans+Pro:300,300i,400,400i,700,700i&display=swap" rel="stylesheet">
<link rel="stylesheet" href="/assets/styles/main-e543e76bd1.css">
</head>
<body>
<div id="app-container">
<!-- page -->
</div>
<script>
(function(b,o,i,l,e,r){b.GoogleAnalyticsObject=l;b[l]||(b[l]=
function(){(b[l].q=b[l].q||[]).push(arguments)});b[l].l=+new Date;
e=o.createElement(i);r=o.getElementsByTagName(i)[0];
e.src='//www.google-analytics.com/analytics.js';
r.parentNode.insertBefore(e,r)}(window,document,'script','ga'));
ga('create','UA-66787377-1');ga('send','pageview');
</script>
<script src="/assets/scripts/vendor-48750b5367.js"></script>
<script src="/assets/scripts/bundle-44f67f4c5c.js"></script>
</body>
</html>

The XPath to identify each country’s name in OpenAQ’s countries webpage is //h1[@class="card__title"]/a. Therefore, response.xpath('//h1[@class="card__title"]/a/text()').get() is a valid statement, but it does not produce any output. When trying to extract all the country names using response.xpath('//h1[@class="card__title"]/a/text()').getall(), it returns an empty list as output. This raises a doubt whether the URL is correct. With lines In [5]: and In [6]:, it becomes clear that there is indeed a valid response from the webpage but not the one needed to extract country names. This is because OpenAQ uses React JS to render its webpages. Upon inspecting the webpage using the browser's developer tools, all the country's names are present inside <div id="app-container"> tag. But scrapy viewed it to contain nothing other than <!-- page -->. Please refer In [6]: in the above block to see this. All this is because Scrapy cannot handle webpages that render its content using JS.

Selenium is an automation tool for testing web applications. It uses webdriver as an interface to control webpages through programming languages. So, this gives Selenium the capability to handle dynamic webpages effectively. From the previous tutorial, it may seem like Selenium is capable of extracting data on its own. It is true, but it has its caveats. Selenium cannot handle large data, but Scrapy can handle large data with ease. Also, Selenium is much slower when compared to Scrapy.

So, the smart choice would be to use Selenium with Scrapy to scrape dynamic webpages containing large data, consuming less time.

How to combine Selenium with Scrapy?

Combining Selenium with Scrapy is a simpler process. All that needs to be done is let Selenium render the webpage and once it is done, pass the webpage’s source to create a Scrapy Selector object. And from here on, Scrapy can crawl the page with ease and effectively extract a large amount of data. A general skeleton of this combination is presented below.

# SKELETON FOR COMBINING SELENIUM WITH SCRAPYfrom scrapy import Selector
# Other Selenium and Scrapy imports
...driver = webdriver.Chrome()# Selenium tasks and actions to render the webpage with required contentselenium_response_text = driver.page_sourcenew_selector = Selector(text=selenium_response_text)# Scrapy tasks to extract data from Selector

Example using OpenAQ

Similar to the Selenium tutorial, this example too shall be using 3 steps to extract PM2.5 data from http://openaq.org. These 3 steps are:

  1. Collecting country names as displayed on OpenAQ countries webpage. This would be used in selecting appropriate checkboxes.
  2. Collecting URLs that contain PM2.5 data from each country. There are countries that contain more than 20 PM2.5 readings from various locations. It would require further manipulation of the webpage, which is explained in the code section.
  3. Opening up the individual URL and extracting PM2.5 data.

These steps would be performed by 3 basic spiders and the output from these spiders would be stored in JSON format. NOTE: A prior working knowledge of Selenium & Scrapy is required to understand this example. The code for this example can be found in my GitHub repository.

countries_spider

Below is the code for the spider that extracts country names and stores it in a JSON file. It can be seen that this spider does not adhere to the skeleton of combining Selenium with Scrapy. The main reason is that it would be efficient to just pass on the cards already extracted by Selenium to Scrapy’s yield functionality which would automatically write it to a JSON file implicitly. The skeleton of combining these two is followed in the spider that extracts PM2.5 values from individual locations.

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from logzero import logfile, logger
class CountriesSpiderSpider(scrapy.Spider):
# Initializing log file
logfile("openaq_spider.log", maxBytes=1e6, backupCount=3)
name = "countries_spider"
allowed_domains = ["toscrape.com"]
# Using a dummy website to start scrapy request
def start_requests(self):
url = "http://quotes.toscrape.com"
yield scrapy.Request(url=url, callback=self.parse_countries)
def parse_countries(self, response):# driver = webdriver.Chrome() # To open a new browser window and navigate it# Use headless option to not open a new browser window
options = webdriver.ChromeOptions()
options.add_argument("headless")
desired_capabilities = options.to_capabilities()
driver = webdriver.Chrome(desired_capabilities=desired_capabilities)
# Getting list of Countriesdriver.get("https://openaq.org/#/countries")# Implicit wait
driver.implicitly_wait(10)
# Explicit wait
wait = WebDriverWait(driver, 5)
wait.until(EC.presence_of_element_located((By.CLASS_NAME, "card__title")))
countries = driver.find_elements_by_class_name("card__title")countries_count = 0# Using Scrapy's yield to store output instead of explicitly writing to a JSON file
for country in countries:
yield {
"country": country.text,
}
countries_count += 1
driver.quit()
logger.info(f"Total number of Countries in openaq.org: {countries_count}")

This spider is executed using the following command

scrapy crawl countries_spider -o countries_list.json

The countries_list.json file would look like this.

countries_list.json[
{"country": "Afghanistan"},
{"country": "Algeria"},
{"country": "Andorra"},
...
]

urls_spider

This spider opens the http://openaq/#/locations webpage, filters out countries and PM2.5 data using checkboxes on the left-side panel. Locations reporting PM2.5 readings in the corresponding country are rendered on the right-side panel as cards, which open a new webpage to render the recorded values. Similar to the above spider, this too does not follow the skeleton and uses Scrapy’s yield to return the URLs.

import scrapyfrom selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import selenium.common.exceptions as exception
from logzero import logfile, logger
import time
import json
class UrlsSpiderSpider(scrapy.Spider):
logfile("openaq_spider.log", maxBytes=1e6, backupCount=3)
name = "urls_spider"
allowed_domains = ["toscrape.com"]
def start_requests(self):
url = "http://quotes.toscrape.com"
yield scrapy.Request(url=url, callback=self.parse_urls)
def parse_urls(self, response):# Use headless option to not open a new browser window
options = webdriver.ChromeOptions()
options.add_argument("headless")
desired_capabilities = options.to_capabilities()
driver = webdriver.Chrome(desired_capabilities=desired_capabilities)
# Load the countries list written by countries_spider.py
with open("countries_list.json", "r") as f:
temp_list = json.load(f)
countries_list = list(map(lambda x: x["country"], temp_list))
total_url_count = 0
for i, country in enumerate(countries_list):
# Opening locations webpage
driver.get("https://openaq.org/#/locations")
driver.implicitly_wait(5)
country_url_count = 0
# Scrolling down the country filter till the country is visible
action = ActionChains(driver)
action.move_to_element(driver.find_element_by_xpath("//span[contains(text()," + '"' + country + '"' + ")]"))
action.perform()
# Identifying country and PM2.5 checkboxes
country_button = driver.find_element_by_xpath("//label[contains(@for," + '"' + country + '"' + ")]")
values_button = driver.find_element_by_xpath("//span[contains(text(),'PM2.5')]")
# Clicking the checkboxes
country_button.click()
time.sleep(2)
values_button.click()
time.sleep(2)
while True:# Navigating subpages where there are more PM2.5 data.
locations = driver.find_elements_by_xpath("//h1[@class='card__title']/a")
for loc in locations:
link = loc.get_attribute("href")
country_url_count += 1
yield {
"url": link,
}
try:
next_button = driver.find_element_by_xpath("//li[@class='next']")
next_button.click()
except exception.NoSuchElementException:
logger.debug(f"Last page reached for {country}")
break
logger.info(f"{country} has {country_url_count} PM2.5 URLs")
total_url_count += country_url_count
logger.info(f"Total PM2.5 URLs: {total_url_count}")
driver.quit()

This spider when executed with the following command produces a JSON file which looks like this.

scrapy crawl urls_spider -o urls.json

urls.json[
{"url": "https://openaq.org/#/location/US%20Diplomatic%20Post%3A%20Kabul"},
{"url": "https://openaq.org/#/location/Kabul"},
{"url": "https://openaq.org/#/location/US%20Diplomatic%20Post%3A%20Algiers"},
{"url": "https://openaq.org/#/location/Algiers"},
...
]

pm_data_spider

This final spider crawls all the URLs extracted by the above spider and extracts PM2.5 values. It also extracts country, city, location, date and time of recording PM2.5 value. All these are stored in a JSON file.

import scrapy
from scrapy.selector import Selector
from ..items import OpenaqItem
from selenium import webdriver
from logzero import logger, logfile
import json
import time
class PmDataSpiderSpider(scrapy.Spider):
logfile("openaq_spider.log", maxBytes=1e6, backupCount=3)
name = "pm_data_spider"
allowed_domains = ["toscrape.com"]
def start_requests(self):
url = "http://quotes.toscrape.com"
yield scrapy.Request(url=url, callback=self.parse_pm_data)
def parse_pm_data(self, response):
# Use headless option to not open a new browser window
options = webdriver.ChromeOptions()
options.add_argument("headless")
desired_capabilities = options.to_capabilities()
driver = webdriver.Chrome(desired_capabilities=desired_capabilities)
with open("urls.json", "r") as f:
temp_list = json.load(f)
urls = list(map(lambda x: x["url"], temp_list))
count = 0for i, url in enumerate(urls):
driver.get(url)
driver.implicitly_wait(10)
time.sleep(2)
# Hand-off between Selenium and Scrapy happens here
sel = Selector(text=driver.page_source)

# Extract Location and City
location = sel.xpath("//h1[@class='inpage__title']/text()").get()
city_full = sel.xpath("//h1[@class='inpage__title']/small/text()").getall()
city = city_full[1]
country = city_full[3]
pm = sel.xpath("//dt[text()='PM2.5']/following-sibling::dd[1]/text()").getall()if len(pm) != 0:# Extract PM2.5 value, Date and Time of recording
pm25 = pm[0]
date_time = pm[3].split(" ")
date_pm = date_time[0]
time_pm = date_time[1]
count += 1item = OpenaqItem()
item["country"] = country
item["city"] = city
item["location"] = location
item["pm25"] = pm25
item["date"] = date_pm
item["time"] = time_pm
yield item
else:
# Logging the info of locations that do not have PM2.5 data for manual checking
logger.error(f"{location} in {city},{country} does not have PM2.5")
# Terminating and reinstantiating webdriver every 200 URL to reduce the load on RAM
if (i != 0) and (i % 200 == 0):
driver.quit()
driver = webdriver.Chrome(desired_capabilities=desired_capabilities)
logger.info("Chromedriver restarted")
logger.info(f"Scraped {count} PM2.5 readings.")
driver.quit()

This spider follows the skeleton of combining Selenium with Scrapy and makes use of Scrapy’s Selector to get the webpage source at this line sel = Selector(text=driver.get_source). This line is where Scrapy takes over the manipulation from Selenium and extracts the required data. This spider also makes use of Item to return the extracted data. The items.py file comprises of the following code.

import scrapyclass OpenaqItem(scrapy.Item):
country = scrapy.Field()
city = scrapy.Field()
location = scrapy.Field()
pm25 = scrapy.Field()
date = scrapy.Field()
time = scrapy.Field()

When this spider is executed using the following command, an output JSON file is generated.

scrapy crawl pm_data_spider -o outpout.json

Contents of the output file will look like this.

output.json[
{
"country": "Afghanistan",
"city": "Kabul",
"location": "US Diplomatic Post: Kabul",
"pm25": "17",
"date": "2020/07/18",
"time": "19:00"
},
{
"country": "Algeria",
"city": "Algiers",
"location": "US Diplomatic Post: Algiers",
"pm25": "13",
"date": "2020/07/18",
"time": "18:30"
},
{
"country": "Antigua and Barbuda",
"city": "N/A",
"location": "Algiers",
"pm25": "21.1",
"date": "2020/03/17",
"time": "22:30"
},
...
]

When to combine Selenium with Scrapy?

You might have noticed that the above example and the Selenium example from previous part have more similarities than differences. You might ask, “Why cannot I use Selenium for all my web scraping projects? It seems to get the job done.”, “Why learn Scrapy? It is complex to learn.” The answer is “You can use Selenium for all your web scraping projects” and “Selenium cannot handle large data and it is slow”. I am reiterating what was said at the beginning of this tutorial. I shall show you the performance difference between the two by extracting books’ details from books.toscrape.com. For this, I shall reuse the scrapy project from part 2 of this tutorial series.

# Crawling spider to extract books' detailsimport scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import BookstoscrapeItem
class CrawlSpiderSpider(CrawlSpider):
name = "crawl_spider"
allowed_domains = ["books.toscrape.com"]
# start_urls = ["http://books.toscrape.com/"] # when trying to use this, comment start_requests()
rules = (Rule(LinkExtractor(allow=r"catalogue/"), callback="parse_books", follow=True),)
def start_requests(self):
url = "http://books.toscrape.com/"
yield scrapy.Request(url)
def parse_books(self, response):
""" Filtering out pages other than books' pages to avoid getting "NotFound" error.
Because, other pages would not have any 'div' tag with attribute 'class="col-sm-6 product_main"'
"""
if response.xpath('//div[@class="col-sm-6 product_main"]').get() is not None:
title = response.xpath('//div[@class="col-sm-6 product_main"]/h1/text()').get()
price = response.xpath('//div[@class="col-sm-6 product_main"]/p[@class="price_color"]/text()').get()
stock = (
response.xpath('//div[@class="col-sm-6 product_main"]/p[@class="instock availability"]/text()')
.getall()[-1]
.strip()
)
rating = response.xpath('//div[@class="col-sm-6 product_main"]/p[3]/@class').get()
# Yielding the extracted data as Item object.
item = BookstoscrapeItem()
item["title"] = title
item["price"] = price
item["rating"] = rating
item["availability"] = stock
yield item

The Selenium code to carry out the same task is given below.

# Selenium code to extract books' detailsfrom selenium import webdriver
import json
import time
class BtsScraper:
def __init__(self):
# self.driver = webdriver.Chrome() # To open a new browser window and navigate it
# Use the headless option to avoid opening a new browser window
options = webdriver.ChromeOptions()
options.add_argument("headless")
desired_capabilities = options.to_capabilities()
self.driver = webdriver.Chrome(desired_capabilities=desired_capabilities)
def scrape_urls(self):
# Extracting URLs of all the 1000 books
urls = []
for i in range(1, 51):
self.driver.get("http://books.toscrape.com/catalogue/page-" + str(i) + ".html")
links = self.driver.find_elements_by_xpath('//div[@class="image_container"]/a')
for link in links:
urls.append(link.get_attribute("href"))
with open("urls.json", "w") as f:
json.dump(urls, f)
def scrape_book_details(self):
with open("urls.json", "r") as f:
urls = json.load(f)
list_data_dict = []
for i, url in enumerate(urls):
data_dict = {}
self.driver.get(url)
title = self.driver.find_element_by_xpath('//div[@class="col-sm-6 product_main"]/h1').text
price = self.driver.find_element_by_xpath(
'//div[@class="col-sm-6 product_main"]/p[@class="price_color"]'
).text
stock = self.driver.find_element_by_xpath(
'//div[@class="col-sm-6 product_main"]/p[@class="instock availability"]'
).text
rating = self.driver.find_element_by_xpath('//div[@class="col-sm-6 product_main"]/p[3]').get_attribute(
"class"
)
data_dict["title"] = title
data_dict["price"] = price
data_dict["stock"] = stock
data_dict["rating"] = rating
list_data_dict.append(data_dict)
with open("selenium_output.json", "w") as f:
json.dump(list_data_dict, f)
if __name__ == "__main__":
tic = time.time()
scraper = BtsScraper()
scraper.scrape_urls()
scraper.scrape_book_details()
scraper.driver.quit()
toc = time.time()
print(f"Execution took {toc-tic} seconds")

Execute the scrapy spider with the following command.
scrapy crawl crawl_spider -o scrapy_output.json
At the end of execution, Scrapy will dump its execution stats. It will look like this.

{'downloader/request_bytes': 432680,
'downloader/request_count': 1195,
'downloader/request_method_count/GET': 1195,
'downloader/response_bytes': 5499087,
'downloader/response_count': 1195,
'downloader/response_status_count/200': 1194,
'downloader/response_status_count/404': 1,
'dupefilter/filtered': 20230,
'elapsed_time_seconds': 42.833584,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 8, 4, 6, 11, 53, 126402),
'item_scraped_count': 1000,
'log_count/DEBUG': 2196,
'log_count/INFO': 11,
'memusage/max': 56664064,
'memusage/startup': 56664064,
'request_depth_max': 51,
'response_received_count': 1195,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 1194,
'scheduler/dequeued/memory': 1194,
'scheduler/enqueued': 1194,
'scheduler/enqueued/memory': 1194,
'start_time': datetime.datetime(2020, 8, 4, 6, 11, 10, 292818)}

I shall ask you to take a look at elapsed_time_seconds. Let us now execute Selenium code and look at its output.
Execution took 579.296879529953 seconds

Our crawling spider took 43 seconds, whereas the Selenium code took 579 seconds to extract the details of 1000 books. This shows Scrapy is approximately 14 times faster than Selenium. But here is the catch, books.toscrape.com is a static website. So, Scrapy had the upper hand. If it were using JavaScript to render book details, Selenium would be the only choice. To answer the question of this section “When to combine Selenium with Scrapy?”, when the website uses JavaScript to render the data you need. If the website is static, Scrapy would be a wise choice.

If you need to know more about choosing which tool to use, there is an excellent article by a fellow writer Sri Manikanta Palakollu. The code for this performance comparison and the combining example can be found in my GitHub repository.

Closing remarks

The above-mentioned method is not the only way of dealing with dynamic webpages using Scrapy. There is ScrapyJS that integrates Scrapy and JavaScript through Splash. There is also a Scrapy middleware named scrapy-selenium that handles JS pages through Selenium. It is encouraged to first analyze the use-case and choose a method that best suits the need.

I hope that I have accomplished my goal of making you understand and induce confidence in yourself to collect data from webpages by presenting you with tutorials on data collection through web scraping using Scrapy, Selenium and a combination of both.

Until next time, Good luck. Stay safe and happy learning.!

--

--

Karthikeyan P
The Startup

Data Science & Machine Learning Aficionado | Tech Geek | Writing to share the joy of learning