Best Bookshelf: Data Visualization Adapting Real World Objects with D3.js

This project is about data visualization project, Best Bookshelf.

What’s the first visual form that comes up when you hear “Data Visualization”? Bar charts or line graphs? Something more fancy like heatmap or force-directed network? While these proven visualization techniques have been widely used for effectively communicating data, unique visual forms can make better sense for its topical matching and can be simply more visually pleasing without sacrificing the most important role of data visualization — truthful encoding of numbers into visual elements.

This post is about the design process of my recent data visualization project Best BookShelf. I visualize book and author data adapting real world objects, book and bookshelf. The adaption has three aspects.

  1. The data dimensions of each book are encoded into the physical attributes of a book.
  2. The ways of grouping and ordering books are similar to various rules of storing books on a bookshelf (everybody has her or his own way!).
  3. Interaction and following animations replicate the action of picking up and moving books on a bookshelf.
How would you organize your books on the bookshelf? (source)
Curious to know what the height, width, diagonal lines, tags, and colors mean? How would you organize books?

Best books by the New York Times & their meta data

I care about the quality of contents that I spend time in. While looking for a good book recommendation, I came across an article, the 10 best books of 2016 by New York Times. Starting with this, I listed all of the best books back to 1996, the first year that NYT announced best/notable books. Later, harnessing Goodreads APIs and the NYTimes Books API (and of course manual filling spreadsheets), I collected meta data of each book and its author(s). The Python scripts are available at my GitHub repo.

At the end, I’ve got 206 books and 184 authors. The dataset includes following fields. Here I list them by data type — they are what will be visually encoded through design process.

Nominal data

  • Title of the book: e.g., Room
  • Name of the author: e.g., Emma Donoghue
  • Publisher: e.g., Little, Brown, and Company

Categorical data — each type has two values

  • Genre: Fiction and Nonfiction
  • Gender of author: Female or Male
  • Best seller: Either the books is a NYTimes best seller or not
  • Language: Originally written in English or translated

Quantitative data — ordinal

  • Year of selection: e.g., 2010
  • Publication and Original publication date
  • Birth date of author: e.g., October 24, 1969
  • Death date of author: this can be null

Quantitative data — continuous

  • Number of pages
  • Number of Goodreads users who rated the book
  • Average Goodreads rating
  • Age of author at the original publication: calculated from the original publication date and birth/death date.
  • Length of book title

Initial Examination of Dataset

At an early phase of visualization design, I usually do a quick examination of the dataset to see if there are any interesting patterns. Some of common methods I use are R with ggplot and Python with Matplotlib or Seaborn. For this project, I also tried Tableau Public. Through this initial work, I found some facts from the dataset. To list a few, I see most books have pages between 200–400; the authors of fiction have wider range of age than those of nonfiction. With the data for the past eleven years, I did not see much imbalance in the gender of authors (but you’ll see something different at the end of the post!).

Age of author at publication & Number of pages, generated with Python Matplotlib
Gender of authors of the best books from 1996 to 2006

In addition, I created small multiples by year. This is made with D3.js.

Small multiples: visualizing best books by year

At this point, I was not convinced that the dataset was compelling enough for an interactive visualization project. Honestly, I became quite skeptical about continuing this project despite the efforts I had put in generating the dataset.

Design Inspiration from Everyday Objects

As I admit it wouldn’t be interesting simply plotting books over two dimensional chart, I came to think a different approach — listing and filtering books depending on the various data dimensions. In this ways, books as data points can be displayed in linear way(s); then I realize this is exactly how people organize books on a bookshelves.

Encoding data fields into physical attributes of book

Among the physical properties of a book, I focused on those that can be actually seen when a book is vertically put on a shelf. You’ll see only the book spine as a thin long rectangle, whose width is decided by the number of pages (assuming the paper thickness is the same). Thus, it was an intuitive decision to encode the number of pages into the the width of a rectangle. The decision of height was rather arbitrary but I want it to be realistic; the dimension of a physical attribute can represent the range of a data field. Among the continuous data dimensions, I found that the average Goodreads rating of all books are mostly above three out of five. From the prior data examination, I found the writer’s age at publication could be interesting when visualized. Thus, this continuous variable is presented as a semi-transparent overlay over the spine. The javascript code with D3 and Lodash looks like:

const storyH = 100; //same as maximum book height
const bookWRange = [10, 60]; //book thickness
const bookHRange = [60, storyH]; //book height range
const pages = books.map((d) => d.book.pages);
const pageRange = getRange(pages, 100);
const ageRange = getRange(ages, 10);
//pages to width
const bookW = d3.scaleLinear().domain(pageRange).range(bookWRange);
//average rating to height
const bookH = d3.scaleLinear().domain([3, 5]).range(bookHRange);
//age to overlayed height
const middleH = d3.scaleLinear().domain(ageRange)
.range([10, bookHRange[0]]);
//get floor/ceiling range
function getRange(arr, by) {
return [
Math.floor(_.min(arr) / by) * by,
Math.ceil(_.max(arr) / by) * by
];
}

In the dataset, there are four kinds of categorical data. I wanted to encode this into other physical attributes of the book as well. All these dimensions are binary or have two possible values. Genres are color-coded (fiction for green, nonfiction for red). Woman author’s books are filled with diagonal lines (gender is usually presented with colors, but I wanted to avoid something too conventional.) The New York Times best sellers are tagged with a star. Translated books are highlighted with a triangle on the top left edge.

Total seven data dimensions are represented as physical attributes of a book. This legends are introduced on the actual project page.

Organizing Books — Grouping and ordering dataset

Let’s revisit the data types—categorical, nominal, quantitative ordinal, and quantitative continuous data. To organize books, Categorical and ordinal variables are used for grouping the data points, whereas nominal and continuous variables are for ordering.

Organizing (grouping or sorting) options in HTML page: first, the books are grouped by year, then by genre. The labels are both levels are also displayed.

These grouping and ordering functions are implemented as dropdown selections in the HTML page. Once a user select the dimension to group or order the dataset with, all books are rearranged. If the first selected item was a dimension for grouping, books can be further grouped or ordered under the first category (if the second level is also grouping, books are ordered as listed in the original dataset within the second group). If the user chooses a nominal or categorial dimension, the second option becomes invisible.

When the books are sorted by the number of pages (ascending), the dividers appear and the labels are seen as the first-level grouping

The results of sorting or grouping are incorporated in the visualization. I put the labels of the first option over the books, that is on the upper shelf. For example, when the books are first sorted by year, you’ll see the year and the number of books within the year. The second level labels are displayed over the divider between the books. As the book title or author’s name is rotated on the book spine, the name of the second level option is also rotated. When the books are sorted without prior grouping, I added dividers between books for better indexing.

D3 Transition for physical movement of the book

Moving mouse over a book triggers lifting the book and tooltip popup.

When a book is mouse-hovered, the vertical position of the book is changed, which is mimicking the state of book being picked. This subtle transition can embrace features of the real world objects.

Reorganizing books means changing the position of the visual elements. The transition of a book is animated at a random speed, to the new position on the same story first (X position), then the new story (Y position).

Insights from Best Bookshelf

As I described earlier, I did not find many striking facts (patterns, correlations or outliers) about the way the New York Times selects best books. However, adapting the logic and aesthetics of bookshelf, this visualization helps investigate the visually encoded properties of the books. Some fun insights I discovered are:

  • 2009 is the first year that the New York Times chose more books by female authors than male authors. Since 1996, there had been imbalance between women and men authors. I personally want to read more what women say, so this finding is important.
  • Nonfictions tend to have a longer title, mostly due to its descriptive subtitle. Only three nonfictions have shorter than 10-letters in their title in contrast to twenty-three fictions.
  • Original English books are dominant. Only eleven books are translated ones, none of which were selected before 2003.
  • Being a best seller doesn’t really mean the high number of ratings on Goodreads. However, it seems to be more related to the average rating. The chance of being a best seller increases about 50% if the average rating is higher than 4 out of 5.
  • Only four nonfiction books were published when the author was 35 years old or younger, where as fifteen fictions were written when the writer was 35 or younger. The five youngest authors are all novelists.

In the final design, I added an input form where users can search by author’s name, title or publisher. As a user types three or more letters, total number of results appear. In this way, you can learn the number of books by publishers. Knopf has the largest number of books, 25; 16 books by the Penguin press and 15 by Random House.

This project is an example of data visualization that does not directly use conventional charts or graphs. Instead, relating the theme of the dataset, book, I applied the physical attributes of book and bookshelf to the visualization design — a single datum as book (encoding quantitative continuous data to the size of book, categorical data to colors/patterns/marks), grouping and ordering dataset as the rules of organizing books, and D3 transition as the movement of books while organizing them. I hope this article is helpful for those who want to create customized data visualization that is creative enough but not confusing users.

Did you find anything interesting about the best books? Let us know via responses!