Scraping a Dynamic Website, It’s Selenium

Published in

The Startup

4 min readJan 24, 2021

This is a part of a series about Dynamic Web Scraping. And, this is

Part-I

This story contains an introduction to dynamic websites and my first approach towards its scraping. Let’s begin with the introduction to dynamic websites.

Dynamic Websites

Dynamic websites produce some results based on some action of a user. For example, when a webpage is completely loaded only on scroll down or move the mouse over the screen there must be some dynamic programming behind this. When you hover the mouse pointer over some text and it gives you some options, it also contains some dynamics. One such website is here. Here is a very good and detailed article about dynamic webpages.

Scraping a Dynamic Website

You can find many articles on the internet that help you scrape a dynamic website. This article is my approach to scrape doordash.com. It was all stepwise.

A necessary condition to scrape dynamic web pages is to load their javascript in the browser. And, this is done with a headless browser(will be explained shortly).

My target was to scrape 50k+ menus from doordash.com.

[Remember that python is case sensitive, except for some certain conditions.]

Let’s start coding by importing some necessary libraries and also some accessory libraries that we may need. As the title indicates, I am going to use the Selenium library(More about Selenium will be written in a separate article).

#importing required libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.support.wait import WebDriverWait
from selenium_move_cursor.MouseActions import move_to_element_chrome
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
import js
import json
import numpy as np
import time
import pandas as pd         #to save CSV file
from bs4 import BeautifulSoup
import ctypes         #to create text popup

(I will explain each module in a separate article.)

The “Webdriver” module of Selenium is most important because it will control the browser. To control the browsers there are certain requirements and these requirements have been set in the form of drivers like “chromedriver” for “google chrome”. I am going to use “chromedriver”. And, to use it we need to tell the “webdriver” about it.

Let’s define this browser for “webdriver” and set its options as ‘ — headless’.

#defining browser and adding the “ — headless” argument
opts = Options()
opts.add_argument(‘ — headless’)
driver = webdriver.Chrome(‘chromedriver’, options=opts)

This “headless” argument is set to deal with Dynamic Webpages, to load their javascript.

Here are the URL and the code to open the URL with the “webdriver”.

url = 'https://www.doordash.com/en-US'
driver.maximize_window() #maximize the window
driver.get(url)          #open the URL
driver.implicitly_wait(220) #maximum time to load the link

I put chromedriver in the project directory to keep the path simple. Or a path may be defined in place of “chromedriver” with the “OS” module.

First approach:

I took an overview of doordash.com to understand where our results, i.e. menus, are located and how they can be accessed.

This script will

1- open the browser

#defining browser and adding the “ — headless” argument
opts = Options()
opts.add_argument(‘ — headless’)
driver = webdriver.Chrome(‘chromedriver’, options=opts)

2- search the URL (doordash.com)

url = 'https://www.doordash.com/en-US'
driver.maximize_window() #maximize the window
driver.get(url)          #open the URL
driver.implicitly_wait(220) #maximum time to load the link

3- scroll down to load the whole page

driver.execute_script("window.scrollTo(0, document.body.scrollHeight,)")

4- navigate to “Top Cuisines Near You”

5- click on “Pizza Near Me” (I assume this will be enough for 50k+ menus)

time.sleep(5)
element = driver.find_element_by_xpath(‘//h2[text()=”Top Cuisines Near You”]’).find_element_by_xpath(‘//a[@class=”sc-hrWEMg fFHnHa”]’)
time.sleep(5)
element.click()
driver.implicitly_wait(220)

6- load the page and range of pages

#define the lists
names = []
prices = []#extract the number of pages for the searched product
driver.implicitly_wait(120)
time.sleep(3)
result = driver.page_source
soup = BeautifulSoup(result, 'html.parser')
page = list(soup.findAll('div', class_="sc-cvbbAY htjLED"))
start = int(page[2].text)
print('1st page:',start)
last = int(page[-2].text)
final = last +1
print('last page:',final)
#getting numbers out of string of pages
print(f'first page:{start}, and last page with + 1: {final}')

7- click on each store (the page has set the default location of New York, therefore no need to worry about location)

#set the page_range And
#lloop all the pages of store
for i in range(start, final, 1):
 time.sleep(7)
 #find the number of stores per page
 list_length = len(driver.find_elements_by_xpath(“//div[@class=’StoreCard_root___1p3uN’]”))
 products_per_page = list_length+1
 #loop through the menues of each store on a page
 for x in range(0, list_length, 1):
 time.sleep(7)
 driver.execute_script(“window.scrollTo({top:75, behavior:’smooth’,})”) 
 store_name = driver.find_elements_by_xpath(‘//div[@class=”StoreCard_storeDetail___3C0TX”]’)
 strnm = store_name[x]
 print(f’{x}- ‘, strnm.text)
 time.sleep(4)
 element=driver.find_elements_by_xpath(“//div[@class=’StoreCard_storeDetail___3C0TX’]”)
 click = element[x]
 move_to_element_chrome(driver, click, display_scaling=100)
 time.sleep(7)
 click.click()
 driver.implicitly_wait(360)

8- scrape menus and return to the page of stores after scraping

time.sleep(20)
 result = driver.page_source
 time.sleep(11)
 soup = BeautifulSoup(result, ‘html.parser’)
 div = soup.find(‘div’, class_=”sc-jwJjzT kjdEnq”)
 if div is not None:
 time.sleep(25)
 for i in div.findAll(‘div’, class_=”sc-htpNat Ieerz”):
 pros = i.find(‘div’, class_=”sc-jEdsij hukZqW”)
 print(‘writing (‘, pros.text, ‘) to disk’)
 names.append(pros.text)
 rates = i.find(‘span’, class_=”sc-bdVaJa eEdxFA”)
 #if there is no price for the food, append ‘N/A’ in the list of ‘prices’
 if rates is not None:
 print(‘price: ‘, rates.text)
 rate = rates.text
 else:
 print(‘N/A’)
 rate = ‘N/A’
 prices.append(rate)
 driver.back()

9- check the number of menus in the list of names

length = len(names)

break the loop on completion of about 10000 menus in the list and inform us with a popup, otherwise repeat the loop

#if menu record reaches the target, exit the script and produce target completion message box
 if ((length > 10000) and (length <10050)):
 ctypes.windll.user32.MessageBoxW(0, f”Congratulations! We have succefully scraped {length} menues.”, “Project Completion”, 1)
 break
 else:
 driver.back()
 continue

10- the whole process will be kept in a loop until we get about 10000 menus.

11- If 10000 target is not reached on scraping all the stores on one page, click in the “next” button to scrape

 #after scraping each store on a page, it will tell that it is going to next page
 print(f’Now moving to page number {i}’)
 #click next page button
 driver.find_elements_by_xpath(‘//div[@class=”sc-gGBfsJ jFaVNA”]’)[1].click()

12- save the results as a CSV file.

#save to dataframe
df = pd.DataFrame({‘Name’:names, ‘Price’:prices})#export as csv file
df.to_csv(‘doordash_menues.csv’)

It has worked for me.

Other approaches in other parts.

Scraping a Dynamic Website, It’s Selenium

First approach:

Written by Irfan Ahmad