How to Master Web Scraping of Complex JavaScript website with Scrapy and Selenium Python

Rexsy Bima Trima Wahyu
8 min readSep 3, 2023

--

Introduction

Web scraping is a skill in programming the main purpose of it is to extract data from a website. It’s a skill that has a close tie to data entry but it’s programmed automatically to collect data from a website. After the process of scraping, we will then save the data to a local file in the computer or to a database format like CSV or Excel. We will use Python with BeautifulSoup4, and Selenium custom libraries to scrape the products inside NIKE Indonesia.

NIKE Indonesia is a heavy javascript website, which means that it will be difficult sometimes to be able to scrape. In NIKE Indonesia, the website has this “infinite scrolling” which means that the website will continuously loaded new product data when scrolling down through the website.

How scraping works

I propose that there are 3 main steps in the process of web scraping:

  1. Understanding how the website works

Nike Indonesia employs “infinite scrolling,” where new product data loads continuously as you scroll down. To scrape efficiently, we must first load all products.

2. Fetching the data

Since Nike Indonesia heavily relies on JavaScript, we will use Selenium to save the fully scrolled data as a local HTML file for later scraping by Scrapy.

3. Extracting the data

After obtaining the local HTML file with Selenium, Scrapy will be used to scrape, parse, and extract the desired information from the local html files. The scraped data will be saved to a desired file from json to CSV, here, we will save the data into a csv format.

What data we will scrape

If we take a look in the front page we can see their list of catalogues already divided by Men, Women and Kids. here we will scrape each catalogues of their all shoes, clothing, accessories and equipment.

Now, If we take a look of one of the product page :

We will scrape its product name, category, description, colour, style, price, image, and the url of the product itself

Execution

We will need selenium to deal with NIKE Indonesia infinite scrolling, and saving the fully scrolled website into local html file


from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time, os
from bs4 import BeautifulSoup

class Scroller:
def __init__(self):
options = Options()
options.add_argument("--start-maximized")
self.driver = webdriver.Chrome(options=options)

def scroll(self, url):
target_url = url
self.driver.get(target_url)
print(self.driver.execute_script("return navigator.userAgent"))
time.sleep(5)
last_height = self.driver.execute_script("return document.body.scrollHeight")

while True:
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5) # or adjust to a higher value if the page needs longer to load

new_height = self.driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height

def save_html(self, webname):
time.sleep(10)
root_folder = os.path.abspath(os.path.join(os.path.dirname(__file__), "../../../../"))
folder_path = os.path.join(root_folder, ".temp","html_output")
os.makedirs(folder_path, exist_ok=True)
format_file = os.path.join(folder_path, f"{webname}_NIKE.html")
with open(format_file, 'w', encoding="utf-8") as f:
soup = BeautifulSoup(self.driver.page_source, 'html.parser')
f.writelines(soup.prettify())

def close(self):
self.driver.close()

def quit(self):
self.driver.quit()

urls = {
'men_shoes' : 'https://www.nike.com/id/w/mens-shoes-nik1zy7ok', #v
'men_clothing' : 'https://www.nike.com/id/w/mens-clothing-6ymx6znik1', #v
'men_accessories_equipment' : 'https://www.nike.com/id/w/mens-accessories-equipment-awwpwznik1', #v
'women_shoes' : 'https://www.nike.com/id/w/womens-shoes-5e1x6zy7ok', #?? 260 out of 420
'women_clothing' : 'https://www.nike.com/id/w/womens-clothing-5e1x6z6ymx6',
'women_accessories_equipment' : 'https://www.nike.com/id/w/womens-accessories-equipment-5e1x6zawwpw',
'kids_boys_clothing' : "https://www.nike.com/id/w/boys-clothing-4413nz6ymx6",
'kids_boys_shoes' : "https://www.nike.com/id/w/boys-shoes-4413nzy7ok",
"kids_girls_clothing" : "https://www.nike.com/id/w/girls-clothing-6bnmbz6ymx6",
"kids_girls_shoes" : "https://www.nike.com/id/w/girls-shoes-6bnmbzy7ok",
'kids_accessories_equipment' : 'https://www.nike.com/id/w/kids-accessories-equipment-awwpwzv4dh'
}

if __name__ == "__main__":
scroller = Scroller()
for webname, url in urls.items():
scroller.scroll(url=url)
scroller.save_html(webname) # replace 'page.html' with your desired file name
print(f"success accessing NIKE {webname} at URL : {url}")
scroller.quit()
#scroller.close()

the reason why we need ‘options.add_argument(“ — start-maximized”)’ is because somehow, the NIKE Indonesia will not load the infinite scrolling if we’re not in maximize window setting for our webdriver Chrome.

Now, we can check the infinite loading pages by using

last_height = self.driver.execute_script("return document.body.scrollHeight")

What that code does is that the selenium will save into variable ‘last_height’ about data of the height for the website, and then we will use continuation of while True to check if the page has been fully scrolled

while True:
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5) # or adjust to a higher value if the page needs longer to load

new_height = self.driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height

The reason i have ‘time.sleep(5)’ is because due to internet connection, sometimes, we will need to wait just a bit so the data of infinite scrolling to be fetched again. 5 stands for 5 seconds.

For saving HTML file, we will need beautifulsoup4 to prettify the lines of codes inside the HTML file.

Now, with Scrapy, we can try to open those HTML files with this configuration

import scrapy
import os
import subprocess
import logging

subprocess.run(['python', 'browser.py'])

def get_html_folder_files():
root_folder = os.path.abspath(os.path.join(os.path.dirname(__file__), "../../../../"))
folder_path = os.path.join(root_folder, ".temp","html_output")
return folder_path

def get_files_in_folder(folder_path):
file_list = []

# List all files and directories in the given folder path
entries = os.listdir(folder_path)

for entry in entries:
# Check if the entry is a file (not a directory)
if os.path.isfile(os.path.join(folder_path, entry)):
full_path = os.path.join(folder_path, entry)
file_list.append("file:\\" + full_path)

return file_list


class NikespiderSpider(scrapy.Spider):
name = "nikespider"
allowed_domains = ["www.nike.com"]
start_urls = get_files_in_folder(folder_path=get_html_folder_files())

The endgoal of those lines of codes will return a list of html folder to scrapy so it will open continuously by Scrapy Spider.

Now, lets look back to the one of the products, and try to scrape the data with Scrapy.

After setting up the Scrapy, we can fetch the url of the product by using Python terminal :

scrapy shell
fetch('https://www.nike.com/id/w/mens-shoes-nik1zy7ok')

It, will returns the data like this if its successful

If we take a look of its product name and its category it will be located in h1 tag with the class headline-2 and h2 tag with the class of headline-5 respectively

To scrape them, we can use response css from Scrapy with command response.css(‘h1.headline-2 ::text’).get() for product’s name, and response.css(‘h2.headline-5 ::text’).get() for product’s category

If we take a look while highlighting the price using chrome inspect, the element will be showed in div tag of a class of product-price

Now, we can try check the price of the product by using Scrapy’s response css ‘response.css(‘div.product-price::text’).get()’, but it will return ‘Rp\xa01,549,000’. To fix this, we can use function .replace to get rid of the ‘Rp\xa0’ and the comma ‘,’ and use int to turn the string into integer.

Next is, we will try to scrape the information of the colour, as we can see, it is located in li tag with the class of description-preview__color-description

Now, with scrapy we can try to check it again with response.css

As we can see, if we do it as it is, it will return the part of string of ‘Colour Shown: ‘. to get rid of this, we can save the data into variable, and using .replace function to remove Colour Shown out of the window.

Now, we will try to scrape the img link, if we take a look here

The data of the image is located in img tag but using src attribute. to deal with this we can use response.css with the combination of xpath from scrapy

Unfortunately, we have a problem because the data that returns is a list of link, the current solution is to check each links and find the best quality of the image. We find that list index 4 has the best quality of image. so we will scrape that.

Next, we can try to scrape the data of its ID so each products will have its own unique ID after the scraping, we can try to scrape that through product style ID

As you can see, the style is located in li tag with the class of .description-preview__style-color

With, scrapy, we can parse that using response.css, to get rid the leftover of ‘Style: ‘, we can use Python .replace function again.

To scrape the link of the url, we can easily use response.url from scrapy

Now the scraping of one product is finished, the next step is figuring out the whole urls of each products from html files that has been saved by selenium.

Now, we will take a look with men’s shoes. If we inspect it, we can see that the catalogue of each product is located in div tag within class product-card

Now with scrapy, if we try to access it with response.css div product card, it will return list like this

Turns out that the link of href of each product is located inside of div.product-card. Precisely, its inside an ‘a’ tag with class of product-card__link-overlay. We can scrape each urls with response.follow, while later we will need callback function of parsing product page previously

def parse(self, response):
products = response.css('div.product-card')
for product in products:
yield response.follow(relative_url, callback=self.parse_product_page)

Inside of the ‘self.parse_product_page’, it should be like this for final parsing

def parse_product_page(self, response):
output = {
'id' : self.parse_id(response=response),
'title' : response.css('h1.headline-2 ::text').get(),
'category' : response.css('h2.headline-5 ::text').get(),
'price (RP)' : self.parse_price(response=response),
'description' : response.css('div.description-preview ::text').get(),
'colour' : self.parse_colour(response=response),
'url' : response.url,
'img_url' : self.parse_img_url(response=response)
}
yield output

Now, if we run the program and save the output into csv, it will resulted in something like this

scrapy crawl nikespider -O output.csv

Conclusion

Finally, if we set up the scrapy correctly, the output of the program will be like above, containing its name, category, price, color, description, image link, and the product URL. I hope this article helps others that focused on Python web scraping.

--

--