Web Scraping — Amazon Customer Reviews

Vaisakh Nambiar
Analytics Vidhya
Published in
4 min readNov 20, 2019

We search for a lot of things in the internet. These information are readily available but cannot be saved easily so we can use it later for any other purposes.

One way is to copy the data manually and save it in your desktop. However, this is a very time consuming job. Web scraping is handy in such cases.

What is web scraping?

Web scraping is a technique used to extract large amount of data from websites and store it in your computer . This data can be later used for analysis.

In this blog, I will show you how to scrape reviews of a particular product from Amazon website using python.

The first step is to check whether the website allows scraping or not. This can be checked by adding robots.txt after the website link.

https://www.amazon.in/robots.txt

A URL can be split into five parts — protocol, domain, path, query string & fragment. But we will be focusing mainly on 3 parts — domain, path & query string.

STEPS:

  1. Get the URL of the page to be scrapped.
  2. Inspect the elements of the page and identify the tags required.
  3. Access the URL.
  4. Get the element from the required tags.

Let us begin coding now!!

We begin by importing the above two libraries. The library requests is used to get the content from a web page. We send a request to the URL and we get a response. The response will contain a status code along with the web page content. BeautifulSoup converts the contents of a page into a proper format.

Header and Cookies

Normally, python requests do not need headers and cookies. But in some situations when we request for the page content, we get a status code of 403 or 503. This means we cannot access the web page contents. In such cases we add headers and cookies to the argument of the requests.get() function.

To find your headers and cookies, go to Amazon website and search for a particular product. Then right click any element and select Inspect (or use shortcut key Ctrl+Shift+I). From the Network tab, we can find headers and cookies.

Never share your cookies with anyone.

A function is used to get the page content and status code for the required query. A status code 200 is required to continue with the process.

Scraping product names and ASIN numbers

Every product in amazon has a unique identification number. This number is called ASIN — Amazon Standard Identification Number. Using the ASIN number, we can directly access every individual product.

The above function can be used to extract the product names and ASIN numbers.

The findall() function is used to find all the html tags of the required span, attribute and value as mentioned in the argument. These parameters of a tag will be same for all the product names and asin throughout all the product pages. We just add the data-asin part of the content to a new list. Using this list, we can access individual data-asin numbers and hence their individual pages.

Scraping customer review links

Customer reviews will be present in each page of the products. But these are just few. We want all the customer reviews for the products. So, we have to scrape the ‘see all customer reviews’ link.

To do this, we first define a function which will go to page of each and every product using ASIN number.

Now we do the same thing as we did for ASIN numbers. We extract all the ‘see all customer reviews’ links of each product using the corresponding html tag and add the href part to a new list.

Scrapping all customer reviews

Now we have got all the links for every product. Using these links, we can scrape all the reviews of each product. So, we define a function (similar to the previous ones) which will extract all the reviews of all products.

We use the above function to extract all the customer reviews and store it in a list.

We can access the details any number of pages of products by adding ‘&page=2 or 3 or 4..’ to the search query and repeating the steps from scrapping ASIN numbers.

Saving reviews in CSV file

We have now scrapped all the reviews, Now we have to save it in a file in order to perform further analysis.

We convert the reviews list into a dictionary. Then import the pandas library and use it to convert the dictionary into a data frame. Then using to_csv() function we convert it into a CSV file and store it in our computer.

The complete code for scrapping Amazon product reviews is available at:

https://github.com/vaisakhnambiar/Web-scraping

--

--