SQL Databases (Goodreads Part IV)

Connor Higgins
Connor Higgins
Published in
4 min readDec 11, 2019

Almost all analysts will need to interact with relational databases nowadays. R has excellent options for data manipulation in dplyr and data.table but the issue it runs into is it needs to hold all that data in memory … meaning that if you are working with a particularly large database then SQL queries are still your most likely route for initially extracting the data. This makes it a perfect time to return to the Goodreads data collected in past posts, and organize the data into a relational database (RDB).

Let’s take a look at Science Fiction this time around. We have here data collected from the Best Science Fiction reading list by the webscraper described by previous posts.

To get a basic feel for the data, lets print the first 20 titles. We can also use SQLite’s PRAGMA function for a quick glimpse at the table structure.

PRAGMA table_info(bestScienceFiction);SELECT *  FROM bestScienceFiction
LIMIT 20;

Of course with a database of science fiction authors we can look up specific authors. Isaac Asimov’s Foundation series was one of the reasons I gained an interest in statistics in the first place. Let’s see which of his books are present in the database.

SELECT * FROM bestScienceFiction
WHERE authors = "Isaac Asimov";

We can update tables quite easily with SQL as well. As an example let’s say we want to replace an author’s pen name with their actual name. There are a huge number of authors in science fiction who use(d) pen names, either for economic or social reasons. One ongoing sci-fi series that is being written under a pen name is also one of my favorites — the Expanse.

SELECT * FROM bestScienceFiction
WHERE authors LIKE "%S.A. Corey";

This is a series written by authors Daniel Abraham and Ty Franck, who both decided it would be easier to sell their work under a single author. Based on their middle names they created “James S.A. Corey.” We can update our database to reflect the actual authors’ names with the following code.

UPDATE bestScienceFiction
SET authors = "Daniel Abraham, Ty Franck"
WHERE authors = "James S.A. Corey";
/* SELECT an author */
SELECT * FROM bestScienceFiction
WHERE authors = "Daniel Abraham, Ty Franck";
COMMIT;

We can see the titles with the most number of reviews (which can serve as a proxy for the most read books).

However you’ll note it’s slightly inconvenient. The data comes from a webscraper written in R, where having a period in a field name poses no issue. In SQL though, having a period requires square brackets to select it.

SELECT *  FROM bestScienceFiction
ORDER BY [total.ratings] DESC
LIMIT 200;

Let’s alter the field to a name that is easier to work with.

ALTER TABLE bestScienceFiction
RENAME [total.ratings] TO "total_ratings";
SELECT * FROM bestScienceFiction
ORDER BY total_ratings DESC
LIMIT 200;

Much nicer.

Now that we can see the most popular titles, let’s also compare the most popular authors. For this step we will use the average_rating column to compute the average of all ratings received by an author. We will also only include authors with more than 4 titles in the list.

/* SELECT top average ratings per author (and sum number of titles) */
SELECT authors,
avg(average_rating) as Overall_Average,
COUNT(DISTINCT title) as DISTINCT_TITLES FROM bestScienceFiction
GROUP BY authors
HAVING DISTINCT_TITLES > 4
ORDER BY Overall_Average DESC
;

This was accomplished with an SQL window function. But its worth noting that in many cases it would be preferable to transfer tasks such as these to either R (dplyr/data.table) or Python (pandas). In many cases doing so offers greater utility–either for graphing purposes, for use in prediction or simply speed.

We can also look at merging data from this table with that of another speculative fiction genre — fantasy. As the two sub-genres fall within another wider genre, we can be fairly confident that there will be overlap, and there is.

SELECT * FROM bestScienceFiction 
INNER JOIN bestEpicFantasy
ON bestScienceFiction.goodreadsID = bestEpicFantasy.goodreadsID
ORDER BY authors;

This concludes the first SQL post and final (planned) Goodreads post. Next we are moving on from the Goodreads dataset and onto more topics in SQL and Python as well.

--

--

Connor Higgins
Connor Higgins

Current graduate student at Northeastern University, pursuing a career in data science. Also an avid reader of speculative fiction!