Data Science

Scraping Amazon “Best Seller” Books Using Python

We will be using Python libraries: Requests, BeautifulSoup, and Pandas

12 min readJun 8, 2022

 Index Of Contents
  ∘ Introduction about Web scrapping
  ∘ Is Web Scraping Legal?
  ∘ Can we scrape data from everywhere?
· Web Scrapping Libraries
  ∘ Downloaded the webpage using the Requests library.
  ∘ Parse the HTML source code using Beautiful Soup
  ∘ Inspect the web page to extract the information.
  ∘ Extracting the Titles and URLs of the books.
  ∘ Extracting Books’ Topics and Tittles.
  ∘ Extracting Books Topics URL’S
  ∘ Compile the data and create a CSV file using the Pandas library.
  ∘ Extracting information about books, such as the title, author, edition type, price, star rating, and reviews.
· Summary
· References

It is not simple to extract content from websites. As a result, the ideal solution is data scraping. web scraping is becoming widely popular due to the easy accessibility of data. Moreover, web scraping has gained momentum due to efficient ways of copying large chunks of information online. Most businesses use web scraping techniques to quickly examine competitors and their business portfolios.

“Where there is data smoke, there is business fire.” — Thomas Redman
“You can have data without information, but you cannot have information without data.” — Daniel Keys Moran

Did you know that data scraping began with a different goal in mind? The origin of very basic web scraping dates back to 1989 when English scientist Tim Berners-Lee created the World Wide Web. Initially, the idea here was to share information between scientists in universities and institutes worldwide automatically. It took more than two decades for it to evolve into the web scraping we know today. Scraping has aided firms and sectors in their exploration of their options.

Introduction about Web scrapping

Web-scraping is a gathering of useful information from a website of interest and presenting it in a meaningful way.” Web scraping is also called web data mining or web harvesting. Web scraping is the process of extracting, parsing, downloading, and organizing useful information from the web automatically. This extract is in the form of an Excel spreadsheet or a CSV file, but the data can also be saved in other formats, such as a JSON file.

Whether you are a data scientist, engineer, or anybody who analyses vast amounts of datasets, the ability to scrape data from the web is a valuable skill to have. Let’s imagine you find data on the web that you can’t download directly; web scraping with Python is a technique you can use to extract the data into a format that can be imported and used in a variety of ways. Web scraping may be done using a variety of techniques, and Python has some wonderful and useful packages that make the work a lot easier.

Beautiful Soup is one of the most useful Python libraries for beginners. It is user-friendly and simple to grasp at all levels of Python coding. I hope that by the end of this post, you will be convinced that “anyone can scrape the data with any level of coding.”

Is Web Scraping Legal?

People frequently inquire whether web scraping is permitted or not. Because it is duplicating data, there must be some type of protection in place. The answer, on the other hand, maybe both yes and no. Scraping Services Company In India has not come across any such regulations prohibiting the scraping of website data. However, it would be fantastic if you took into consideration that websites come with Terms & Conditions, so you must exercise extreme caution when working with them. Can we scrape data from everywhere?

Can we scrape data from everywhere?

Before you get too deep into the process of scraping, bear in mind that scraping causes a spike in website traffic and may cause the website server to crash. As a result, not all websites enable scraping. So, how can you know which websites are permitted and which are prohibited? The website’s robots.txt ‘file can be examined. Simply add “/robots.txt” to the end of the URL you want to scrape to get information on whether the website’s host allows scraping.

Let’s take an example of “poetryfoundation.org/robots.txt”

You will come across a page similar to this. Here you will find all of the website’s terms and conditions, as well as what the website allows and does not allow.

Another option to locate the scraping allowance is to go to the website user agreement page. There, you’ll find their terms and conditions for web scraping. Some online pages prohibit scraping in general, but if you email the administrators and explain that you want to scrape the data for research or study purposes, they may agree. Otherwise, if none of the circumstances apply to the web page and you attempt to scrape the data for your purposes, the administrators may restrict your I.P. address from accessing their server. Then, instead of online sites, you’ll start seeing captcha pages.

Web Scrapping Libraries

Data can be scraped in several ways. There are dozens of web-scraping libraries in Python as well, but some of the most notable are “Requests,” “Beautiful Soup,” “Scrapy,” “lxml,” “Selenium,” and “AWS Lambda.” Requests is a web scraping library that allows you to communicate with web servers; the rest relies on your use case, such as:

Beautiful Soup: The Beautiful Soup library is an essential addition to your data science toolset since it is a basic and easy-to-use but powerful library that allows you to scrape data in just a few hours of practice. Its biggest strength is undoubtedly its simplicity.
Scrapy: Scrapy is a Python-based open-source web scraping framework. It’s used to create a sophisticated web scraper. You’ll find all of the tools you need to extract data from websites, process it as needed, and store it in the structure and format you wish. It’s constructed on top of a twisted asynchronous networking infrastructure, which is one of its key features. Scrapy is the ideal alternative for tasks that demand a lot of web scraping. It also includes several useful built-in exports, including JSON, XML, and CSV. Data scraping is significantly faster here, and it can be used for a variety of reasons, from data mining to monitoring and automated testing. However, it is not for novices because it is a full-fledged framework.
Selenium: Complex and dynamic codes are present on websites. Furthermore, it is preferable to render all of the website content using a browser first. To reach the webpage, Selenium uses a genuine web browser. This gives the impression that a real person is accessing data in the same way. The browser uses the web driver to load all of the online resources and runs the javascript on the page. Simultaneously, it stores all cookies established by websites and sends complete HTTP headers, just like any other browser. While Selenium is mostly used for testing, it may also be used to scrape dynamic web pages. Running it is the best way to figure out whether a website is compatible with various browsers.
lxml: lxml is a production-quality HTML and XML parsing library with outstanding performance. You can rely on it to be beneficial to you regardless of which web page you are scraping. lxml is more efficient than Beautiful Soup which is commonly used in the data science industry.
AWS Lambda: For simpler tasks, AWS Lambda is wonderful. It is integrated with all of Amazon’s services. A Docker container is used to run the scraper. Every day, AWS Cloud Watch event rules deploy scraping jobs to lambdas. You can operate the server on a schedule rather than manually starting and stopping it.
It also uses cron, which is similar to the setup on the local Mac. However, using Lambda is difficult since it lacks persistent local storage like EC2. This suggests that the Lambda was created for data transformation but suffers from delays in data transmission and storage. In addition, the paperwork can be difficult to understand.

Steps Involved:

Install and import the required libraries.
Parse a specific bestseller category page using requests and Beautiful Soup
Fetch information about the item including name, price, reviews, rating, and URL.
Store all the information into a CSV file using pandas.

We want to scrape data from Amazon’s Best Seller Books.

Amazon is one of the largest online business enterprises which sells millions of varieties of goods all over the world through the platform called Amazon.com. They have different categories of products, including fashion, books, electronics, toys, jewelry, and more. Some popular products are listed in the bestseller category of Amazon which is helpful for sellers to find the bestselling products as well as customers to find the bestselling quality products to buy. Here we are going to scrape the bestselling books in different categories using Python and Beautiful Soup.

We’ve chosen https://www.amazon.in/gp/bestsellers/books/page to contain information about the Best Seller Categories of Books. In this blog post, I will retrieve information from this page using web scraping. We will use the “Requests” and “Beautiful Soup” libraries for our project.

You can view and execute the code for this tutorial here:

tavishi-1402/web-scrapping-amazon-best-seller-booksfinal — Jovian

Major-Projects/web_scrapping-amazon-best-seller-booksFINAL.ipynb at main · tavi1402/Major-Projects (github.com)

Let’s start the journey.

Downloaded the webpage using the Requests library.

We will start our journey by installing and importing the request library to download the page https://www.amazon.in/gp/bestsellers/books.

Requests: HTTP for Humans™ - Requests 2.26.0 documentation

Release v2.26.0. ( Installation ) Requests is an elegant and simple HTTP library for Python, built for human beings…

requests.readthedocs.io

Let’s install and import the requests library.

The Requests library establishes a connection between your notebook or, say, the machine on which you are writing the codes, and the web-page server, fetching content in HTML format (which is the format in which a web page is produced). We save all of the HTML content in a variable so that we can access it at any time.

We use requests.get to download the web page URL.

The response object contains the result of the requests, the status code, and other information. We can access the contents of the page using response.text.

We can check whether the web page is ordinary for web-scraping or not. If the status code falls between 200 and 299, the web page you selected is ordinary; otherwise, it is not. This status code refers to the status of a Hypertext Transfer Protocol (HTTP) response. A server issues status codes in response to a client’s request to the server. A complete list of user guides for these status codes can be found here.

The web page contains the HTML code.

We have successfully downloaded the web page using requests.

Parse the HTML source code using Beautiful Soup

We’ll use the Beautiful Soup Python library to parse the HTML source code of the web page downloaded in the previous session.

Beautiful Soup Documentation - Beautiful Soup 4.9.0 documentation

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to…

www.crummy.com

Let’s install and import the Beautiful Soup library.

We can use the Beautiful Soup class to parse the HTML documents\.

Inspect the web page to extract the information.

Now we’ll inspect our web page. As HTML contains tags for each attribute and our downloaded web page is also in the form of HTML codes, we’ll try to extract all of the data from the web page using HTML tags for each attribute. To learn the HTML tag for any attribute on a web page, simply go to that attribute on the page and right-click on your mouse; you’ll see a menu with an inspect button; when you click on the inspect button, a new window appears on the left, bottom, or right side of your current window, as shown below.

So far, we’ve completed half of our work; we’ve downloaded the web page and learned about the HTML tags for each attribute; now it’s time to collect data from the web page using these tags. A sneak peek at the code cell below

Once parsed, we can use Doc to extract the information from the web page.

As we can see above, we were able to retrieve the data and the first image from the web page.

Extracting the Titles and URLs of the books.

Start-get data out of the doc using the helper function and collect the information into a list.

Extracting Books’ Topics and Tittles.

For this, we have created a helper function get_topic_titles to extract the list of all the titles of books present on the web page.

In the above code cell, we have fetched all the titles of the books present on the web page https://www.amazon.in/gp/bestsellers/books

Extracting Books Topics URL’S

Creating helper function get_topic_urls to extract the list of all the URLS present on the web page.

In the above code cell, we have fetched all the URLs of the books present on the web page https://www.amazon.in/gp/bestsellers/books

Compile the data and create a CSV file using the Pandas library.

Install and Import the Pandas library

pandas

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of…

pandas.pydata.org

We have combined all the collected data above and created a single function called scrape_topics. Inside that, create a dictionary and parse the data into the dictionary to extract all the URLs and titles of the books. we got the data frame by using the python library.

We’ve written a helper function similar to the one shown above. When we used the above function, it verified the status code and returned a Beautiful Soup object if the status code was valid. To collect data from each page, we built certain functions.

So far, we’ve just scraped data from a single web page. Let’s go deeper into web scraping and construct some helper functions to help us automate our job. Instead of writing codes for each page. let’s create some helper function to extract the data from a several almost identical web pages.

Extracting information about books, such as the title, author, URL, edition type, price, star rating, and reviews.

To get the data we wanted, we used CSS selectors. Hover through the page to discover the CSS selectors we need. We require selectors for the item div tag, book name, price, edition type reviews, rating, and URL, as shown in the image. The selectors used in the code are most likely the same on the web page, although this can change.

we have collected all the data from each web page of books of different categories and automatically written all the data into a python dictionary. Using the Python library, we can obtain the data frame.

Here is a snip of what the panda’s data frame looks like:

Well, finally we have successfully scraped the data from different pages, and we have created a CSV file using the panda’s data frame.

We have created a data folder and stored all the files into this folder in csv format, let’s take a look:

Here is a snip of what the CSV file looks like:

Summary:

From beginning to end, this is a very brief description:

First, we installed all of the necessary libraries in our Jupyter notebook.
Using the requests library, we download the web page to our notebook.
We inspect the web page for HTML tags for all required attributes regarding each data that we want to scrape from the web page.
The data from each HTML tag is then collected and written into a Python dictionary.
For collecting the data from different pages we have written some helper functions and then we have written a parser function to extract all the data from each different page and then parse the collected data into a python dictionary.
We constructed various helper methods(functions) to gather data from various pages, and then we write a parser function to extract all of the data from each page, and then we parse the collected data into a Python dictionary.
Finally, we have created a CSV file using the panda’s library.

We have extracted the information about the amazon best seller books using Beautiful Soup library in python. Extracted the Book names, URLs, author name, edition type, price, Star Rating, and reviews. At the end, by using pandas we have created a data folder with all 50 rows and 7 columns and then created a CSV file using of all the best seller books on Amazon.

References:

To see the complete notebook for codes and all the user guides please visit my notebook, on jovian or my notebook on Github.
A big thanks to Aakash N S, who taught me how to scrape the web pages, here is a link to his tutorial https://youtu.be/RKsLLG-bzEY
A complete documentation for requests library here
Documentation for Beautiful Soup 4 library here
A big thanks to Anushree K, and jovian.ai team as well, who support me to learn all things in a group.