Web Scraping with Selenium in Python — Amazon Search Result (Part 1)

Ranchana Kiriyapong
9 min readSep 13, 2021

--

Photo by Nathan da Silva on Unsplash

Edition: Update: 15/05/2022

Every time that I would like to buy something, I spend a lot of time scrolling many pages to find a product I want with a reasonable price and good ratings. How about if we let the computer collect these data for us?

To scrape the web pages, there are several libraries for web-scraping in python such as BeautifulSoup, Selenium, and Scrapy. In this article I am going to use Selenium as it can work like a real user starting from opening a browser, type a keyword in a search box and then click to find the result.

Outlines:

  1. Install Selenium and download a web driver
  2. Access the Amazon website
  3. Specify WebElement(s)
  4. Extract the data from WebElement(s)

1. Install Selenium and Download a Web Driver

To install Selenium, open your command line interface and install Selenium.

pip install selenium

What is a web driver? A web driver is a tool to open the browser we chose (Chrome, Firefox, Edge, or Safari). Whatever browser you want to use, you need to have a specific web driver for it. Moreover, the driver should be compatible with your browser version and your operating system.

Before, we have to download a webdriver and store it physically in our machine and update it upon the version. Lucky us!, now python has a library called webdriver-manager that can help you get a webdriver instantly at the time you run the automation code.

Traditional Way (You can skip this!)

In the past, I had to download it by myself. Let’s say I use Chrome, please download a Chrome web driver here .

Check your Chrome browser version by clicking three vertical dots at the top right corner > help > about Google Chrome. Then, download a web driver for your Chrome version.

For other browsers you can find the links for download in this page: https://pypi.org/project/selenium/

https://pypi.org/project/selenium/

After downloading the web driver you want, extract and save the driver in the folder that you know its path.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
# assign the driver path
driver_path = 'YOUR_DRIVER_PATH'
# create a driver object using driver_path as a parameter
driver = webdriver.Chrome(service = Service(executable_path=driver_path))

New version

Install webdriver-manager by pip install webdriver-manager (using hyphen)

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
# import from webdriver_manager (using underscore)
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service = Service(ChromeDriverManager().install()))

Note: You can use any browser as you prefer. Check it here

2. Access the Amazon website

Once you are ready. Let’s get start!

Option 1: Open a browser normally

# assign your website to scrape
web = 'https://www.amazon.com'
driver.get(web)# keep this line of code at the bottom
driver.quit()

If you run all above snippet of code, the web driver will really open a browser by itself. However, if you want the driver to run it in the background we can have another settings.

Option 2: Open a browser in the background

Before assigning a driver, add ChromeOptions

options = webdriver.ChromeOptions()
options.add_argument('--headless')
# create a driver object using driver_path as a parameter
driver = webdriver.Chrome(options = options, service = Service(ChromeDriverManager().install()))
# assign your website to scrape
web = 'https://www.amazon.com'
driver.get(web)
driver.quit()

In option 2, we set the options of Chrome web driver to be — headless which means to open Chrome with headless mode. Scraping through this mode the Chrome browser will work in the background. There are also other options like open in incognito mode, start maximized, etc.

3. Specify WebElement(s)

Imagine that you are browsing the Amazon website, what are you going to do next? As you want to search for a product, you click the search box, type in a keyword and then click the search button. Selenium can do these things as well but what we need to do is to tell it which part of the page or I will say html element in DOM (Document Object Model) to select by creating an object called WebElement . To find element, Selenium provides us many selectors:

  • ID
  • XPATH
  • TAG_NAME
  • CLASS_NAME
  • CSS_SELECTOR, etc.

We will use these selectors together with 2 main methods. find_element and find_elements

Note: find_element will match with the first WebElement it finds on the website while find_elements will find multiple elements with the same selector which return a list. (The difference is “s” following the word element.)

In order to select a search box on the top of the page, right click at the box and select inspect to see the html code.

Right click at the search box and then click inspect
Inspect the element

After looking at the html code, we have found that the id for this search box is "twotabsearchtextbox". Then, we do the same for the search button. You will get an id for a search button "nav-search-submit-button”. Then, now we can tell Selenium how to locate these elements.

# import more
from selenium.webdriver.common.by import By
# assign any keyword for searching
keyword = "wireless charger"
# create WebElement for a search box
search_box = driver.find_element(By.ID, 'twotabsearchtextbox')
# type the keyword in searchbox
search_box.send_keys(keyword)
# create WebElement for a search button
search_button = driver.find_element(By.ID, 'nav-search-submit-button')
# click search_button
search_button.click()
# wait for the page to download
driver.implicitly_wait(5)
# quit the driver after finishing scraping (please keep this line at the bottom)
driver.quit()

If you select option1, now you will see the amazon page with a search result for the chosen keyword.

Next, we are going to select all the item details as they were highlighted in the photo below.

Before that, we will create empty lists for containing the data we’d like to scrape.

product_name = []
product_asin = []
product_price = []
product_ratings = []
product_ratings_num = []
product_link = []

After inspecting the html code, you will see that all the items have classes "s-result-item s-asin …” Thus, we create WebElements called “items” using find_elements(By.XPATH, "/XPATH")

Shortcuts for xpath:

  1. Xpath is written like a general path using forward slash /to navigate through elements in HTML

2. A basic xpath starts with double forward slash //following by a tag name with attribute’s value(s). It means that “go to find a tag name which has the specified attribute equal to value(s)”.

For example, you can scrape the first line header by using xpath equal to '//div[@class="title"]/h1'

3. You can access to the child node using a forward slash or skip to the node in the nested child node by using double forward slash // . According to the above example, you can create a WebElement for the highlighted element by using the xpath '//div[@class="title"]/h1/span' or '//div[@class="title"]//span'

4. You can use contains to find elements that match the given values. As sometimes wanted elements may be complex, contains can help identify the elements by using only some parts. You can see one of the usage further.

Since we know that the elements we want to scrape have many classes and some items have the class AdHolder(i.e. featured or sponsored items) but some not. Using contains in xpath make it easier to identify all the products in the page. Therefore, we should create a WebElement using xpath equal to //div[contains(@class, “s-result-item s-asin")]'

items = driver.find_elements(By.XPATH, '//div[contains(@class, "s-result-item s-asin")

However, as required elements may be loaded dynamically with different time, if we run this code, it may occur NoSuchElementException Error because elements we want are not present yet at the running time. In this case we need explicit wait to help us as the following:

from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC
items = WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.XPATH, '//div[contains(@class, "s-result-item s-asin")]')))

From the code, it means that Selenium will wait for a maximum of 10 seconds for elements until it can locate all elements we want and then store those WebElements in the variable “items”. Otherwise, it will raise TimeOutException. There are many EC (expected conditions) that we can use (see more) depending on the circumstances. For now, you get all the items on the first search result page.

4. Extract the data from WebElement(s)

To get data from elements, there are two common ways to do.

  1. Get the text within the element(s)
  2. Get the attribute’s value

To get the data of each item such as a product name, ASIN number, or price, we need to iterate through a for loop and then extract the data.

Finding a product name

The first piece of data we want to scrape is the product name which locates in the span tag name with the class "a-size-medium a-color-base a-text-normal"The function to get text in the WebElement is .text

for item in items:
name = item.find_element_by_xpath('.//span[@class="a-size-medium a-color-base a-text-normal"]')
product_name.append(name.text)

Note: This time we use function find_elementwith the WebElement “item” we already acquired not the “driver” meaning that we are looking for sub-elements of the WebElement. Hence, we start the xpath with .//instead of //to tell Selenium to find only within that item WebElement. Otherwise, Selenium will search the entire page.

Finding a product’s ASIN number

The ASIN number is the number used to identify the product on Amazon which is unique. In html document, it is the value of the attribute data-asinof each item. To get attribute’s value, we can use the function .get_attribute("ATTR_NAME")

for item in items:   name = item.find_element(By.XPATH, './/span[@class="a-size-medium a-color-base a-text-normal"]')
product_name.append(name.text)
data_asin = item.get_attribute("data-asin")
product_asin.append(data_asin)
# following print statement is for checking that we correctly scrape data we want
print(product_name)
print(product_asin)

Finding Price

The item that has no price.

Not every product has price. To find the price, there is a little issue about using the function .find_element. As I said before, using find_element will return a WebElement, while find_elements will return a list. Since there is a product without price, if Selenium cannot locate the price element in the item, the function find_element will throw an error. While if the function find_elements cannot find the element, it will just return an empty list. For the price, we will scrape the whole price and the fraction price which are the number before and after the decimal point in turn.

# find priceswhole_price = item.find_elements(By.XPATH, './/span[@class="a-price-whole"]')fraction_price = item.find_elements(By.XPATH,'.//span[@class="a-price-fraction"]')if whole_price != [] and fraction_price != []:
price = '.'.join([whole_price[0].text, fraction_price[0].text])
else:
price = 0
product_price.append(price)

Finding ratings and ratings number

Mostly, I prefer to buy products with ratings over 4. Moreover, the ratings number shows me which product most customers decided to buy and how many buyers that ratings are based on.

In the <div> tag elements, there are two <span> tag elements. To scrape ratings and ratings_num, we create the WebElement called ratings_box to grab those two <span> tag elements. Thus, get attribute aria-labelfrom each <span> tag element as following:

# find a ratings boxratings_box = item.find_elements(By.XPATH, ‘.//div[@class=”a-row a-size-small”]/span’)if ratings_box != []:
ratings = ratings_box[0].get_attribute('aria-label')
ratings_num = ratings_box[1].get_attribute('aria-label')
else:
ratings, ratings_num = 0, 0
product_ratings.append(ratings)
product_ratings_num.append(str(ratings_num))

Find details link

Finally, we might need to see product’s information in details. Scraping the specific link of the products give us the way to read more about them. You can find the <a> tag element for the product link by inspecting the product name or product image.

link = item.find_element(By.XPATH, ‘.//a[@class=”a-link-normal a-text-normal”]’).get_attribute(“href”)product_link.append(link)driver.quit()

All my code:

Thanks for reading! Hope this tutorial has some advantages. The upcoming part 2 will scrape through multiple pages and store the data in the database Sqlite3.

If you enjoy this blog, please consider follow me on Medium and feel free to talk or give any recommendation.

--

--