The Numbers Game: Exploring Amazon’s Data Analysis Book Trends

Isaac Oresanya
6 min readOct 27, 2023

Books are more than just sources of information or entertainment. They can also influence our thoughts, feelings, and behaviors in various ways. But what factors make a book appealing or influential for readers? To answer this question, I scraped data from Amazon books, one of the largest online book retailers in the world. I collected information on 166 books about data analysis across different genres, languages, and ratings. Then I analyzed the data to explore the relationships between the different variables such as price, rating, number of book pages, etc.

In this article, I will present the details of my data collection and analysis process, as well as the limitations and implications of my study. For each of the different analyses, I will share the PostgreSQL queries that I used to extract and manipulate the data from the database, and the corresponding Tableau charts that I created to visualize and interpret the results. I will also discuss the main findings and conclusions of my analysis and how they answer my research question.

DATA SOURCE

Using the Scrapy web scraping framework, I collected data from Amazon’s website. I searched for books related to data analysis using the keyword "data analysis". I scraped the data of more than 250 books (with at least 100 ratings each) into a "books" table in my database. I also scraped the reviews of all the books (2993 reviews in total) into another "reviews" table.
The "books" table contains information like title, author, description, price, rating, and number of ratings for each book. The "reviews" table contains information like reviewer’s name, rating, title, date, and content for each review. Each review in the "reviews" table is linked to its corresponding book in the "books" table using a foreign key.

ANALYSIS REPORT

What is the distribution of book prices on Amazon?

-- Categorize books into price ranges and count the number of books in each range
SELECT
price_range,
COUNT(*) AS num_of_books
FROM (
SELECT price,
CASE
WHEN price BETWEEN 0 AND 29 THEN 'Under $30'
WHEN price BETWEEN 30 AND 59 THEN '$30-$59'
WHEN price BETWEEN 60 AND 89 THEN '$60-$89'
WHEN price BETWEEN 90 AND 119 THEN '$90-$119'
WHEN price BETWEEN 120 AND 149 THEN '$120-$149'
WHEN price BETWEEN 150 AND 179 THEN '$150-$179'
WHEN price BETWEEN 180 AND 209 THEN '$180-$209'
WHEN price BETWEEN 210 AND 239 THEN '$210-$239'
WHEN price BETWEEN 240 AND 269 THEN '$240-$269'
ELSE 'Above 270'
END AS price_range
FROM books
) AS grouped_books
GROUP BY price_range
ORDER BY num_of_books DESC;
Book Price Distribution

The chart shows the distribution of book prices. The vast majority of books are priced under $30. Another decreased percentage of books are priced between $30 and $59. A small number of books are priced above $270.

The chart also shows that the number of books in each price range decreases as the price increases. This suggests that there is a smaller market for high-priced books.

Which authors are the most highly-rated on Amazon?

-- Sum up the number of ratings for each author and display in descending order of total ratings
SELECT author, SUM(num_of_rating) AS total_rating
FROM books
GROUP BY author
ORDER BY total_rating DESC;
Top Rated Authors

The chart shows that the top-rated author is Cole Nussbaumer Knaflic, with a total rating of 10398. Kam Knight is in second place with a total rating of 5504, followed by Martin Kleppmann with a total rating of 4574. Neil Dagger and Wayne C. Booth et al. are in fourth and fifth place, respectively, with total ratings of 4376 and 2470.

What is the distribution of book ratings on Amazon?

-- Count books based on their rating ranges
SELECT
CASE
WHEN rating >= 4.5 THEN '4.5+'
WHEN rating >= 4.0 THEN '4.0 - 4.4'
WHEN rating >= 3.5 THEN '3.5 - 3.9'
ELSE 'Below 3.5'
END as rating_range,
COUNT(*) as num_books
FROM books
GROUP BY rating_range;

Distribution of book Ratings

The pie chart shows the distribution of book ratings on Amazon. The vast majority of books are rated 4 or above.
The chart suggests that most books on Goodreads are well-received by readers. This is likely due to the fact that Amazon is a platform where users can rate and review books, which helps other users to find books that they are likely to enjoy.

How are book page ranges distributed?

-- Categorize books into page count ranges and count the number of books in each range
SELECT
page_range,
COUNT(*) AS num_of_books
FROM (
SELECT paperback,
CASE
WHEN CAST(paperback AS integer) BETWEEN 100 AND 299 THEN '100-299'
WHEN CAST(paperback AS integer) BETWEEN 300 AND 499 THEN '300-499'
WHEN CAST(paperback AS integer) BETWEEN 500 AND 699 THEN '500-699'
WHEN CAST(paperback AS integer) BETWEEN 700 AND 899 THEN '700-899'
WHEN CAST(paperback AS integer) BETWEEN 900 AND 1099 THEN '900-1099'
WHEN CAST(paperback AS integer) BETWEEN 1100 AND 1299 THEN '1100-1299'
WHEN CAST(paperback AS integer) BETWEEN 1300 AND 1499 THEN '1300-1499'
ELSE '1500+'
END AS page_range
FROM books
) AS grouped_books
GROUP BY page_range
ORDER BY num_of_books DESC;
Book Page Range Distribution

The chart shows the distribution of book page ranges. The most common range is 100-299 pages, followed by 300-499 pages and 500-699 pages. A small number of books have more than 1100 pages.

The chart suggests that most books are relatively short. This is likely due to a number of factors, including the increasing popularity of e-books, the attention span of modern readers, and the cost of publishing longer books.

LIMITATION

One of the limitations of my study is the small size of my dataset. I only collected data on 163 books about data analysis, which may not be representative of the entire population of books on this topic. Moreover, I only scraped data from one source, Amazon books, which may have some bias or error in the data quality or availability. Therefore, my results may not be generalizable or reliable for other books or sources. A larger and more diverse dataset would be needed to validate and extend my findings.

CONCLUSION

This article has analyzed the distribution of book prices, author ratings, and page ranges on Amazon. These insights suggest that the book market is dominated by well-received, relatively short books. This is likely due to a number of factors, including the increasing popularity of e-books, the attention span of modern readers, and the cost of publishing longer books.
The insights from this analysis can be useful for authors, publishers, and readers alike. Authors can use the insights to make informed decisions about their writing and publishing strategies. Publishers can use the insights to identify trends in the book market and to develop marketing campaigns that appeal to target audiences. Readers can use the insights to discover new authors and books that are likely to interest them.

The source code for the project is available on my GitHub repository

--

--