Web Scraping with Selenium in Python — Amazon Search Result (Part 2)

Scrape multiples pages and store data in database

4 min readSep 25, 2021

In part 1, we scraped all the items in a search result on Amazon.com and store them in lists which you may create a dataframe and save the data into csv file. This part we will scrape through multiple pages and store them in relational database using sqlite3.

The brief processes of web-scraping in this tutorial is shown in the diagram. After we scrape items’ details on the first page we will store them in sqlite3 database. Thus, the function store_db was created.

import sqlite3

def store_db(product_asin, product_name, product_price, product_ratings, product_ratings_num, product_link):
    conn = sqlite3.connect('amazon_search.db')
    curr = conn.cursor()

    # create table
    
    curr.execute('''CREATE TABLE IF NOT EXISTS search_result (ASIN text, name text, price real, ratings text, ratings_num text, details_link text)''')    # insert data into a table    curr.executemany("INSERT INTO search_result (ASIN, name, price, ratings, ratings_num, details_link) VALUES (?,?,?,?,?,?)", 
                    list(zip(product_asin, product_name, product_price, product_ratings, product_ratings_num, product_link)))
        
    conn.commit()
    conn.close()

Next, we need to find the pagination button which is next to our current page number in order to get the link to next page.

Find the next_page element

pagination buttons at the bottom of the page

At the buttom of the first page, we inspect page 2 button as below.

As we are now at <li class=”a-selected”> , the next page button will be the following sibling which is also <li> tag and the link to next page is stored as a value of attribute href of its child node <a> tag. Our xpath for the next_page element will be ‘//li[@class =”a-selected”]/following-sibling::li/a’

Continuing from part 1, after scraping a page we can use store_db function using data lists as parameters. Then, we have Selenium find the next page link. However, since we are scraping multiple pages, next_page link will be changed every time we scrape a page. Therefore, we need to define the global variable called next_page equal to empty string in the beginning of the code.

next_page = ''
driver = webdriver.Chrome(options=options, executable_path=driver_path)...product_asin = []
product_name = []
product_price = []
product_ratings = []
product_ratings_num = []
product_link = []

items = wait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//div[contains(@class, "s-result-item s-asin")]')))for item in items:...# store data from lists to databasestore_db(product_asin, product_name, product_price, product_ratings, product_ratings_num, product_link)global next_page
next_page = driver.find_element_by_xpath('//li[@class ="a-selected"]/following-sibling::a').get_attribute("href")

Now, we have finished code for scraping one page. Wrap-up all these code in to the function scrape_page and create the main scraping function name scrape_amazon .

def scrape_amazon(keyword, max_pages):

    page_number = 1
    next_page = ''

    driver = webdriver.Chrome(options=options, executable_path=driver_path)
    driver.get(web)

    driver.implicitly_wait(5)
    keyword = keyword
    search = driver.find_element_by_xpath('//*[(@id = "twotabsearchtextbox")]')
    search.send_keys(keyword)
    # click search button
    search_button = driver.find_element_by_id('nav-search-submit-button')
    search_button.click()

    driver.implicitly_wait(5)

    while page_number <= max_pages:
        scrape_page(driver)
        page_number += 1
        driver.get(next_page)
        driver.implicitly_wait(5)


    driver.quit()

Run the function scrape_amazon with keyword and max_pages you want.

scrape_amazon('wireless charger',3)

The amazon_search.db will emerge in your project folder. To open sqlite3 database, you can use a program like DB Browser. Moreover, you can have a quick look by opening amazon_search.db (File > Open DB) and run a query on www.sqliteonline.com

The search_result table from amazon_search.db file

You can see all my code in my github profile.

amazon-scraping-search-result/run.py at main · ranchana-k/amazon-scraping-search-result

Contribute to ranchana-k/amazon-scraping-search-result development by creating an account on GitHub.

github.com

Note:

There are some restriction for scraping along with this tutorial. Some webelements might change their xpath. During writing this content, I had seen a few changes of the website’s structure. However, if we grab the basic of Selenium and HTML structure, we will be able to adapt the the code.

Summary

There are many tools that can do web-scraping. Some are more appropriate and stable to scrape data from Amazon.com especially when you want to scrape a lot of data. Scraping through Selenium is a bit challenging but it makes us learn a lot as well.

If you enjoy this blog, please consider follow me on Medium and feel free to talk or give any recommendation.