Data Analysis for Bestselling Books

Insights discovered by using SQL

Amy Yang
10 min readJan 22, 2022
Photo by Renee Fisher on Unsplash

Walking into a book store, our attention is always caught by the bestselling books. Have you ever wondered about similar questions like below?

  • Which books were ranked on the bestseller over multiple years?
  • Who are the authors of these books?
  • Are these books with a high rating and a large number of reviews?

In this article, we will look at the dataset Amazon top 50 Bestselling Books 2009–2019, ask some interesting questions from different aspects of bestsellers, and find out the answers with SQL analysis. The SQL query covered here includes:

  • Select…From…
  • Group by
  • Where and Having
  • Order by
  • Limit
  • Join and Union
  • Aggregations
  • Case statement
  • Common Table Expression (CTE)
  • Analytic functions

For SQL syntax details, you could refer to the link.

Dataset

The dataset is made up of 7 measures: Name, Author, User Rating, Reviews, Price, Year, and Genre. Reviews represent the number of written reviews on Amazon and Year stands for the Year(s) the book ranked on the bestseller. Genre is either fiction or non-fiction. The first 10 records are extracted and included in the table below.

Questions and Findings

To make it easier for us to come up with a question, the measures are assigned into three groups:

  • relating to the book itself: Name, Price, Year and Genre
  • relating to the author(s) of the book: Author
  • relating to the feedback from the readers of the book: User Rating, Reviews

We then focus on each group and ask questions from these three perspectives.

Book Itself 📚

Number of books

Firstly we would like to know how many books were listed as top 50 bestsellers from 2009 to 2019. As there may be some books ranked on the bestseller for multiple years, DISTINCT is used to count the number of unique books in the query below. As shown in the result, there are 350 books in total.

SELECT count(DISTINCT name)total_number
FROM bestsellers;
Result:
+--------------+
| total_number |
+==============+
| 350 |
+--------------+

Among these books, can we know how many of them are fiction and nonfiction respectively? Yes, we can use GROUP BY genre or GROUP BY 1 to split the books into two groups and count the number of unique books for each group. Note that number 1 could replace ‘genre’ in the GROUP BY statement as ‘genre’ is the first column in the SELECT statement. We could see that 160 books are fiction and the other 190 ones are nonfiction.

SELECT genre, count(DISTINCT name) number_of_books
FROM bestsellers
GROUP BY 1;
Result:
+-------------+------------------+
| genre | number_of_books |
+=============+==================+
| Fiction | 160 |
| Non Fiction | 190 |
+-------------+------------------+

Books ranked on the bestsellers more than one year

If we add up the top 50 books for 11 years (2009–2019), we should get 550 in total. Since there are only 350 instead, there must be some books ranked as bestsellers over multiple years. What are these books? Which genre do they belong to? To find out the answer, we could use the GROUP BY clause for both book name and genre and count the number of years that the books were ranked on the bestseller. HAVING statement ensures only the result satisfying the criteria is returned. In this case, only the books being bestsellers for more than one year will be filtered in. ORDER BY is used to sort the exported data in descending order by 3rd column (number_of_years) and then in ascending order by 1st column (name). By default, the order will be ascending and thus ASC can be omitted after number 1.

SELECT name, genre, count("Year") number_of_years
FROM bestsellers
GROUP BY 1,2
HAVING count("Year")>=2
ORDER BY 3 DESC,1;

As shown in the table above, 95 books are being ranked in the top 50 bestsellers for multiple years. Could we know how many of them were ranked for 2 years? Are there more fiction books being ranked for 2 years than nonfiction ones? Here we could use Method 1 Common Table Expression (CTE) to name the result of the temporary table ‘more_than_1year’ and then refer to this table in SELECT…FROM…statement and GROUP BY number_of_years and genre to count the books. The result could also be achieved by using Method 2 Subquery.

Method 1 - CTEWITH more_than_1year AS (
SELECT name, genre, count("Year") number_of_years
FROM bestsellers
GROUP BY name, genre
HAVING count("Year")>=2
)
SELECT number_of_years, genre, count(*) number_of_books
FROM more_than_1year
GROUP BY 1,2
ORDER BY 1 DESC,2;
-----------------------------------------------------------Method 2 - SubquerySELECT number_of_years, genre, count(*) number_of_books
FROM (
SELECT name, genre, count("Year") number_of_years
FROM bestsellers
GROUP BY name, genre
HAVING count("Year")>=2
) AS more_than_1year
GROUP BY 1,2
ORDER BY 1 DESC,2;

The result is as below:

A graph is made to provide a better visualization to compare the fiction and nonfiction books by the number of years they were ranked on the bestseller. It is found that more nonfiction books were ranked as bestsellers for over 4 years compared to fiction books.

Picture by the author

Price Summary

We may wonder what the price of bestselling books looks like. What is the minimum or maximum price? Do fiction and nonfiction books differ a lot in the average price? By using the aggregate function such as MIN(), MAX(), AVG(), we could calculate the minimum, maximum and average price respectively. Median is calculated as the 50th percentile by using the PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY NUM_VALUE) suggested by Postgresql wiki. Note this is only applied to calculate the median in Postgresql. For BigQuery users, this article explains how the median is calculated in detail.

SELECT
genre,
min(price) min_price,
max(price) max_price,
round(avg(price),2) mean_price,
PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY price) median_price
FROM
bestsellers
GROUP BY 1;
Result:
+-------------+-----------+-----------+------------+--------------+
| genre | min_price | max_price | mean_price | median_price |
+=============+===========+===========+============+==============+
| Fiction | 0 | 82 | 10.85 | 9 |
| Non Fiction | 0 | 105 | 14.84 | 12 |
+-------------+-----------+-----------+------------+--------------+

As the result shows, both fiction and nonfiction bestselling books could be without a cost. What are these books? We could use the WHERE statement and add the criteria ‘price = 0’ to extract the free books only.

SELECT genre, name, author, "Year", price
FROM bestsellers
WHERE price = 0
ORDER BY 1,2,4

In the query statements above, first of all, the minimum price is calculated and then the books with the minimum price are extracted. Is it possible for us to combine these two steps into one query? The answer is Yes!

One solution is to use CTE and Join together. In the query below, firstly CTE is used to keep the result of the minimum price by each genre with the table named ‘min_price’. This table is then joined with the original table ‘bestsellers’ to extract the information of those free books including name, author, the year ranked on the bestseller, and its price. To make the query a bit shorter, aliases ‘m’ and ‘b’ are used to stand for table ‘min_price’ and ‘bestsellers’ respectively. We can see the book ‘To Kill a Mockingbird’ is listed 4 times and this is because the book was a bestseller for multiple years (2013–2016). To get a unique list of books with minimum price, we can use the GROUP BY clause with columns ‘genre’ and ‘name’ in the query.

WITH min_price AS
(SELECT genre, min(price) min_price
FROM
bestsellers
GROUP BY
genre)
SELECT m.genre, b.name, b.author, b."Year", b.price
FROM min_price m LEFT JOIN bestsellers b
ON m.genre=b.genre AND m.min_price=b.price
ORDER BY 1,2,4

The result is as below:

Price distribution

Can we have a brief idea about how the prices are distributed, similar to a histogram? Yes, we can!

As each of the books in the table above has 2 or more prices, first of all, a temporary table ‘books_with_median_price’ is created to make sure each book has a unique price by using its median price. Then we allocate the prices into four groups: 0, 0–20, 20–40, and 40+ and count the number of books for each price group. Two different methods are used below to achieve the same result.

Method 1 - Case statementWITH books_with_median_price AS(
SELECT name,
PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY price) price
FROM bestsellers
GROUP BY name
)
SELECT
CASE WHEN price =0 THEN '0'
WHEN price >0 AND price<=20 THEN '0-20'
WHEN price >20 AND price<=40 THEN '20-40'
ELSE '40+' END AS price_range,
count(name) count
FROM books_with_median_price
GROUP BY price_range
ORDER BY price_range;
-----------------------------------------------------------------Method 2 - UnionWITH books_with_median_price AS(
SELECT name,
PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY price) price
FROM bestsellers
GROUP BY name
)
SELECT
'0'AS price_range,
count(name) AS Count
FROM books_with_median_price
WHERE price=0
UNION (
SELECT '0-20' AS price_range,
count(name) AS Count
FROM books_with_median_price
WHERE price>0 AND price<=20
)
UNION (
SELECT
'20-40' AS price_range,
count(name) AS Count
FROM books_with_median_price
WHERE price>20 AND price<=40
)
UNION (
SELECT
'40+' AS price_range,
count(name) AS Count
FROM books_with_median_price
WHERE price>40)
ORDER BY price_range;
Result:
+-------------+-------+
| price_range | count |
+=============+=======+
| 0 | 9 |
| 0-20 | 301 |
| 20-40 | 33 |
| 40+ | 7 |
+-------------+-------+

As the result reveals above, over 88% of bestselling books are sold at a price equal to or below 20.

The author(s) of the book ✍

Author with the most bestselling books

We wonder who is the author having the most bestselling books within the period 2009–2019. The query below will help us find the answer. We use GROUP BY and COUNT(DISTINCT name) to find the number of bestselling books for each author and then sort the number from largest to smallest with ORDER BY 2 DESC. This is followed by statement LIMIT 1 which extracts the top row as the returned result. Based on our analysis, Jeff Kinney has the most books on Amazon’s Top 50 bestsellers during the period.

SELECT author, count(DISTINCT name) number_of_books
FROM bestsellers
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1;
Result:
+-------------+-----------------+
| author | number_of_books |
+=============+=================+
| Jeff Kinney | 12 |
+-------------+-----------------+

What about his books? Which of them were on the bestseller? In the query below, we use the previous query in the WHERE clause to extract the bestselling books that were written by Jeff.

SELECT name, "Year", genre, price, author
FROM bestsellers
WHERE author =
(
SELECT author
FROM bestsellers
GROUP BY 1
ORDER BY count(DISTINCT name) DESC
LIMIT 1
)
ORDER BY 2,1

The result is as below:

The feedback from the readers of the book 👀

Highest rating for each author and rating variance among their books

Do authors with bestselling books have a consistent rating? What is the highest rating for each author? We can figure out the answer by using the Analytic function MAX() with an OVER clause which includes three parts:

  • PARTITION BY clause divides the rows of table ‘bestsellers’ by author
  • ORDER BY clause sorts the records in each partition using Year in ascending order
  • ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING is a window frame clause which means all rows in the partition are used for calculation. In our case, all the books with the same author are considered to calculate the highest rating.
WITH max_rating AS (
SELECT author,
name,
"Year",
"User Rating",
max("User Rating") OVER (PARTITION BY author ORDER BY "Year"
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS max_rating
FROM bestsellers
ORDER BY 5 DESC,1)
SELECT *,
"User Rating"-max_rating AS rating_variance
FROM max_rating
ORDER BY 5 DESC,1;

The result is as below:

We could see that the rating is as high as 4.9 for a few books such as Little Blue Truck written by Alice Schertle, Humans of New York by Brandon Stanton, etc. If the export is sorted by column rating_variance in ascending order, we could see that the rating of books for the same author can be 0 to 1.6 below their highest-rated books.

It is discovered that authors have a relatively consistent rating for their bestselling books during the years 2009–2019. This is mainly because numbers of books were ranked on the bestsellers for multiple years and that all the same books have an identical rating except the following two books:

  • The 7 Habits of Highly Effective People: Powerful Lessons in Personal Change
  • The 5 Love Languages: The Secret to Love that Lasts

These two books can be identified using the query below.

SELECT name
FROM bestsellers
GROUP BY 1
HAVING count(DISTINCT "User Rating" )>1
ORDER BY 1

Books with top 10 reviews each year

What are the books with the most reviews each year? For example, top 10 books based on the number of reviews. The following query applies the Analytic function RANK() with an OVER clause including:

  • PARTITION BY clause dividing the rows of table ‘bestseller’ by Year; and
  • ORDER BY clause sorting the records in each partition by reviews in descending order

The rank of each book by the number of reviews is recorded in the new column ‘rank_of_reviews’. The WHERE clause filters the rank ranging from 1 to 10 so that only the books with top 10 reviews are captured each year.

WITH top10_reviews AS(
SELECT
"Year",
name,
author,
reviews,
RANK() OVER (
PARTITION BY "Year"
ORDER BY reviews DESC
) AS rank_of_reviews
FROM bestsellers
)
SELECT *
FROM top10_reviews
WHERE rank_of_reviews BETWEEN 1 AND 10
ORDER BY 1 DESC,5

The result is as below:

Conclusion

By using SQL analysis, we succeeded in finding the answers to our questions about bestselling books and discovering some interesting insights on its price, genre, author, rating, reviews, and so on. Feel free to share the questions you are interested in regarding bestsellers and your solution to tackle the questions. Hope you enjoy reading this article. Appreciate any feedback!

--

--

Amy Yang

Research Assistant @ UNSW | Data science student @UTS. Enthusiastic about data-driven insights and data science applications