Why do some books, as simple as they may be, succeed in becoming worldwide sensations? Do their authors treat the language differently? How do printed symbols lure us into epic worlds? I had to dig in.
I picked the most successful book series of the last 20 years and applied text mining techniques, seeking for patterns and, well, a way to reverse engineer an author’s mind while writing.
I analyzed the Harry Potter books by J.K.Rowling, the Game of Thrones books (ok nerd, “A Song of Ice and Fire”) by George R. R. Martin, the Hunger Games trilogy by Suzanne Collins and the Lord of the Rings trilogy + Hobbit by J. R. R. Tolkien.
4 authors. 19 books. 3,896,568 words.
- Common phrases
- Top nouns
- Top verbs
- Top adverbs
- Top adjectives
- Lexical density
The first thought while messing with natural language processing on books, is to isolate the most frequent phrases, usually found in bigrams, trigrams..n-grams. You may not find many common phrases among authors, but you get a hint about the story and the significance of some key concepts, such as the ring in LOTR or the arena in Hunger Games. Displayed are the top 2-grams to 7-grams.
While attempting part-of-speech tagging in popular fantasy-adventure books, you expect from nouns to talk about lords, kings, death. Gender “equality” is probably apparent in Game of Thrones lady, queen as well as in Hunger Games mother, our only series sporting a lead heroine.
Verbs in adventure books are probably the most important lexical feature, since they progress action and story. Our authors use their rare verbs sparingly, but some of them unintentionally show their soft spots. For instance, Tolkien seems to prefer visual verbs such as stood, fell.
The first observation is the lack of -ly adverbs in Game of Thrones. Perhaps George Martin follows Stephen King’s advice “The adverb is not your friend. They seem to have been created with the timid writer in mind”
(I love Tolkien abusing swiftly)
Along with nouns, the adjectives follow the same gruesome pattern of death, darkness, survival and hardness. Notable is the usage of colors by author: Martin mentions red, white, grey most, Rowling black, Collins white and Tolkien black, white, grey and green.
6) Lexical Density
Lexical density measures the ratio of content words to grammatical words. Content words are nouns, adjectives, most verbs, and most adverbs. Grammatical words are pronouns, prepositions, conjunctions, auxiliary verbs, interjections etc.
The formula for estimating lexical density:
Ld = (Nlex / N) x 100, where: Nlex = the number of lexical word tokens (nouns, adjectives, verbs, adverbs), N = total number of words in the analysed text
It is obvious to draw parallels between the book length and its lexical density. You cannot use unique words for one thousand pages (I am pointing at you, James Joyce). Indeed, the largest books of our analysis behave poorly. But, it is not always the case, for instance Harry Potter 2nd book achieves high score despite not being the smallest.
In order to be able to suggest proper books to schoolchildren, a bunch of researchers in the 50's proposed the Automated Readability Index (ARI).
The formula for calculating the ARI is given below:
The whole concept is to assign a score to books or documents that approximates the age needed to understand that text. The ARI score corresponds with the school grade. For example:
11–12 yrs. old — ARI 6, sixth grade student
12–13 yrs. old — ARI 7, seventh grade student
So, on average, the thresholds for understanding are: Hunger Games correspond to 10-year-olds, Game of Thrones to 11-year-olds, Lord of the Rings to 14-year-olds and Harry Potter to 15-year-olds.
- Game of Thrones and Hunger Games are more gender balanced based on the high occurrence of words such as lady, queen and mother.
- Lord of the Rings and Game of Thrones are more centered around key story concepts like the king and the ring. Harry Potter universe is distributed to many characters.
- Lord of the Rings features more visual verbs like stood and fell.
- Game of Thrones contains significantly fewer -ly adverbs.
- Hunger Games is the most lexically dense series, but at the same time features the most “childish” writing. Both top lexical dense books are written by women.
- Harry Potter and Lord of the Rings feature the most grown-up writing according to the level of understanding. Harry Potter series until the 6th book seems to “mature” along with the lead hero.
Text mining and analysis conducted in Python, assisted by the powerful NLTK library. The graphs are made by tinkering matplotlib styles. The phrases wordclouds are drawn with pytagcloud. The book manuscripts come from the official epubs. Very frequent words and phrases such as of, the or i don’t, would you etc. were eliminated as stopwords, as well as character names. If a part of speech appears in various forms, e.g say — said, only the most frequent is taken into account.
Dimitris Spathis studies Computer Science at Ionian University. He is fascinated by the intersection of code, arts and cognition.