Web Scraping with Selenium in Python — Amazon Search Result (Part 2)
Scrape multiples pages and store data in database
In part 1, we scraped all the items in a search result on Amazon.com and store them in lists which you may create a dataframe and save the data into csv file. This part we will scrape through multiple pages and store them in relational database using sqlite3.
The brief processes of web-scraping in this tutorial is shown in the diagram. After we scrape items’ details on the first page we will store them in sqlite3 database. Thus, the function store_db
was created.
import sqlite3
def store_db(product_asin, product_name, product_price, product_ratings, product_ratings_num, product_link):
conn = sqlite3.connect('amazon_search.db')
curr = conn.cursor()
# create table
curr.execute('''CREATE TABLE IF NOT EXISTS search_result (ASIN text, name text, price real, ratings text, ratings_num text, details_link text)''') # insert data into a table curr.executemany("INSERT INTO search_result (ASIN, name, price, ratings, ratings_num, details_link) VALUES (?,?,?,?,?,?)",
list(zip(product_asin, product_name, product_price, product_ratings, product_ratings_num, product_link)))
conn.commit()
conn.close()
Next, we need to find the pagination button which is next to our current page number in order to get the link to next page.
Find the next_page element
At the buttom of the first page, we inspect page 2 button as below.
As we are now at <li class=”a-selected”>
, the next page button will be the following sibling which is also <li>
tag and the link to next page is stored as a value of attribute href
of its child node <a>
tag. Our xpath for the next_page
element will be ‘//li[@class =”a-selected”]/following-sibling::li/a’
Continuing from part 1, after scraping a page we can use store_db
function using data lists as parameters. Then, we have Selenium find the next page link. However, since we are scraping multiple pages, next_page
link will be changed every time we scrape a page. Therefore, we need to define the global variable called next_page
equal to empty string in the beginning of the code.
next_page = ''
driver = webdriver.Chrome(options=options, executable_path=driver_path)...product_asin = []
product_name = []
product_price = []
product_ratings = []
product_ratings_num = []
product_link = []
items = wait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//div[contains(@class, "s-result-item s-asin")]')))for item in items:...# store data from lists to databasestore_db(product_asin, product_name, product_price, product_ratings, product_ratings_num, product_link)global next_page
next_page = driver.find_element_by_xpath('//li[@class ="a-selected"]/following-sibling::a').get_attribute("href")
Now, we have finished code for scraping one page. Wrap-up all these code in to the function scrape_page
and create the main scraping function name scrape_amazon
.
def scrape_amazon(keyword, max_pages):
page_number = 1
next_page = ''
driver = webdriver.Chrome(options=options, executable_path=driver_path)
driver.get(web)
driver.implicitly_wait(5)
keyword = keyword
search = driver.find_element_by_xpath('//*[(@id = "twotabsearchtextbox")]')
search.send_keys(keyword)
# click search button
search_button = driver.find_element_by_id('nav-search-submit-button')
search_button.click()
driver.implicitly_wait(5)
while page_number <= max_pages:
scrape_page(driver)
page_number += 1
driver.get(next_page)
driver.implicitly_wait(5)
driver.quit()
Run the function scrape_amazon with keyword and max_pages you want.
scrape_amazon('wireless charger',3)
The amazon_search.db
will emerge in your project folder. To open sqlite3 database, you can use a program like DB Browser. Moreover, you can have a quick look by opening amazon_search.db
(File > Open DB) and run a query on www.sqliteonline.com
You can see all my code in my github profile.
Note:
There are some restriction for scraping along with this tutorial. Some webelements might change their xpath. During writing this content, I had seen a few changes of the website’s structure. However, if we grab the basic of Selenium and HTML structure, we will be able to adapt the the code.
Summary
There are many tools that can do web-scraping. Some are more appropriate and stable to scrape data from Amazon.com especially when you want to scrape a lot of data. Scraping through Selenium is a bit challenging but it makes us learn a lot as well.
If you enjoy this blog, please consider follow me on Medium and feel free to talk or give any recommendation.