Physical book sales are once again increasing after years of decline amidst disillusionment with social media and mobile-centered lives. While Twitter APIs, Facebook Analytics, and Google offer an endless torrent of data, the book market also offers insights into the Zeitgeist.
According to a sample of 54,000 Amazon e-book titles, the chance of a book rated at less than 3 stars selling a considerable number of copies is miniscule. Highly-rated genre fiction makes up the majority of sales.
Books rated below 4 stars made up 19.4% of titles and 17.6% of gross sales, while books below 3 stars made up 5.6% of sales — indicating that both customers tend to give positive ratings and sellers are incentivized to game the ratings system with fake positive reviews.
Project Gutenberg is ‘a volunteer effort to digitize and archive cultural works, founded in 1971 as the world’s oldest digital library.’ I analyzed a dataset detailing their most popular 1,000 downloads.
Half of them total 65,000 words or less. The truly long books, over half a million words, are the domain of religion (The King James Bible, The Mahabharata, Antiquities of the Jews), lifetime compendiums (The Complete Works of William Shakespeare, Essays of Michel de Montaigne, The Journals of Lewis and Clark), infamously intricate novels (War and Peace, Les Misérables), and curiously, a Victorian-era self- and home-improvement book, Mrs. Beeton’s Book of Household Management.
Statistics, unfortunately, aren’t readily available on the proportion of readers who ambitiously start these tomes only to quickly and indefinitely put them aside.
Most sentences in the books cluster around a normal distribution of just under 20 words per sentence. Extremely verbose outliers predictably include philosophical classics: Aristotle’s Politics: A Treatise on Government and René Descartes’ Discourse on the Method of Rightly Conducting One’s Reason and of Seeking Truth in the Sciences.
Post-war science fiction dominates the low end of the spectrum of under ten words per sentence, including Mr. Spaceship and The Variable Man by Philip K. Dick.
On the surface it seems likely that ostentatiously loquacious writers (sorry) would be equally prone to flights of fancy with word choice, but the main goal of proper comprehension precludes the use of maximizing both word and sentence length.
The average length per word is consistent across the centuries, with 50% of works averaging between 4.7 and 5.0 letters per word. Contrary to reputation, James Joyce’s Ulysses falls within that median range of letters per word and short sentences (13.0 vs 20.8 words per sentence).
Joyce, on the other hand, did statistically minimize common words in favor of unique syntax.
At the turn of the twentieth century, there’s a substantial drop off in the frequency of ‘difficult’ words (defined by their absence on the Dale-Chall Readability Word List, which contains ~3,000 words understandable to 80% of American fourth graders). The median amount of difficult words declined from 7,500 for authors born in the 1800s to 984 for those born in the 1900s.
A variety of explanations are possible: diminishing use of obscure vocabulary as literacy rates grew and reading-as-a-pastime spread beyond the upper classes, words used commonly pre-1900 falling from favor and been replaced by modern analogues, and others.
Project Gutenberg, as a cultivated cultural repository, is certainly biased in favor of noteworthy volumes in the public domain (Congress and/or Disney extended copyright protections in 1998 to 95 years, so even books published in 1923 are just entering).
Language is never stagnant. Jane Austen and modern American children speak the same, yet a vastly different, language.
For a solely recent perspective, The New York Times makes their weekly Best Seller lists available.
Romance, murder/mystery, World War II and dragons rule the cumulative time spent on the past decade’s Best Seller list. The top four authors of the post-recession decade controlled the charts by publishing an astonishing amount of books over their careers: 179 (Danielle Steel), 31 (David Baldacci), 42 (John Grisham), and 60 (Stephen King).
Eleven of the largest publishing houses spent over 300 weeks on Best Seller Lists, while thirty small indie presses and imprints only managed five or fewer weeks over the ten year period.
Analyzing 1,016,221 words from the descriptions of 10,195 Amazon best-sellers reveals the main interests of contemporary buyers: World War II, New York City, new technologies, crime and archetypal heroes and heroines. Readers wish to escape to the past, the future or NYC.