One Hundred Years of Solitude: how I analyzed my favorite book
Ok, you got me: I love music and tv series. What you probably don’t know is that I’m a stubborn reader. I can’t list all of the books I’ve read in my life but I can answer the question what is your favourite book?.
It is One Hundred Years of Solitude by Gabriel García Márquez. One Hundred Years of Solitude is THE book: it contains everything you need or neither, it is a very fantastic novel, it is practically infinite in its content, it is the definition of time. Moreover, it represents my solitude.
I found a very nice edition some days ago in France. I can’t read it anymore but curiosity always kills the cat, so I was wondering whether it is possible to analyze it by meaning of some sort of textual analysis.
I did it and there is this post. Let’s jump to the Introduction!
I analyzed the book using two kinds of analysis, that are a textual analysis and some sort of a graph theory analysis. The former is made up by a textual analysis of the book, taking the words as atomic elements of the analysis. The latter uses the former in order to create a graph (or network) modelling the interactions between the characters in the book.
Before starting, a disclaimer: this post does not represent a scientific approach to the exposed problems. It only carries out a lot of curiosity.
Before starting, we need something. Yea, you guessed well: we need the book in some textual format. We ask, Google answers: here’s a .txt file containing the english translation of the original book.
The text mining has been carried out using Python and the nltk library. Visualizations have been made by matplotlib and Tableau. The graph was made using the networkx library, and Gephi for the visualization.
The main steps of the approach are the following:
- reading the book and tokenizing it using python and nltk,
- making the textual analysis with the same tools,
- building a graph on the top of the extracted information about characters.
The textual analysis
Let’ start with the textual analysis, that is take all of the text, spit it on the computer and see if something comes out. It seems that something interesting really came out!
Symbols, words, lexical richness
One simple question: how many symbols, total and distinct words does the book contain? Márquez, for his last draft of the book, pressed the keys of his typewriter 809644 times. One Hundred Years of Solitude contains 144739 words and the number of distinct words is 11027. There are more than 11k different words!
We can extract a nice measure called lexical richness, which is the ratio between the number of distinct words and the number of total words. In this case, we have 0.07618540959934779 and it means that all of the distinct words represent 7.6% of the entire text!
Márquez entitled his work One Hundred Years of Solitude. It could make you think that the word solitude appears a lot in the text. We can find it by identifying the dispersion of the word in the whole text. I’m curios, so I tried identifying other four words, that are time, love, life and death.
The plot shows the dispersion of the words in the text. Intuitively, a blue line represents a word which occurs in that part of the book, no otherwise. From the plot we can see that the old fox Márquez uses really few times solitude but makes the reader falling into the concept of time, using that word practically in the whole book. Furthermore, we see that love is the main theme starting from the end of the fifth generation of Buendía till the end of the book.
Erh… what? A hapax legomena is a word that occurs only once within a context, either in the written record of an entire language, in the works of an author, or in a single text.
One Hundred Years of Solitude contains 4741 hapax legomena. Here are 50 of them, picked at random:
epaulets, upsetting, civilization, motilón, marshal, domain, gluttons, despised, secretary, consulting, vise, forty-seven, modify, parrot, thirty-five, jeopardize, highest-flying, docility, wishing, rebuffs, wisely, walter, piglets, cans, dainties, demented, ports, chalice, mitigated, paragraphs, riddle, huts, alexandria, shuttered, consummate, adulterous, hoof, drugstore, tap-dancing, fabric.
Ehm… what? I’m a computer scientist, come on! A collocation is a sequence of words or terms that co-occur more often than would be expected by chance.
In One Hundred Years of Solitude, we find the following collocations:
josé arcadio, aureliano segundo, colonel aureliano, aureliano buendía, arcadio buendía, gerineldo márquez, santa sofía, pietro crespi, petra cotes; pilar ternera, colonel gerineldo, arcadio segundo, amaranta úrsula, chestnut tree, banana company, mauricio babilonia, apolinar moscote, father nicanor, prudencio aguilar, many years.
Well, now we’re getting fun! The idea behind this kind of analysis is that of modelling relationships between characters of One Hundred Years of Solitude by a network (a graph).
Before starting, we define what is a graph and what are the relationships between characters.
A network is just a set of objects connected between them by some sort of relationship. As an example, you, my dear reader, and your friends make a network: you’re entities which are connected by past experiences, common interests, etc. Throughout this whole post, I could call these entities vertices and the relationships edges.
Regarding the relationship between characters: we can not extract various kind of relationships in an automatic way. We say that a relation exists between a character A and B whether A and B occurs in the text and B appear at most after 30 words from the occurrence of A. Furthermore, I considered only the characters with at least one interaction.
Oh my, we have a graph!
You were waiting for it! Here’s the graph modelling the relationship of One Hundred Years of Solitude characters:
Vertices (circles) represent characters: the size of the vertices are proportional to the total number of interactions. Edges represent relationships: the size of the edge between two vertices are proportional to the number of interactions between those two characters. But… we have some issues: there are no names! Did you read the book? So, make your guess!
Done? Here are the names:
The Holy Trinity is there: Úrsula, Amaranta and Remedios are strongly connected between them three. We see that the model gives a nice visual hint regarding the importance of some late-introduced characters: whether a character is introduced late in the book, thus the importance of that character falls.
Characters and statistics
Let draw some simple statistics:
- there are 62 characters which at least one interaction, while the total number of characters is 71;
- we have at least 1 and a maximum of 33 interactions, the average number of interactions is 6.6451.
Let’ see if we can get some more interesting insights!
Diameter of a graph
The diameter of a graph is the maximum eccentricity of any vertex in the graph, that is the greatest distance between any pair of vertices. Intuitively, we think of the diameter as the shortest path which allows us to go from one side to the other of the graph.
The diameter of our graph is equal to 6 and an example of path which defines the diameter is the following:
Bruno Crespi, Pietro Crespi, Remedios, Aureliano Segundo, Mr. Herbert, Mr. Jack Brown, Dagoberto Fonseca.
In this case, the shortest path which starts at Bruno Crepsi and ends to Dagoberto Fonseca is the one below.
Is there a way of understand the importance of a character in a book? Yeah, sure it is, by reading the book, analyzing relationships between the characters, etc. Is there a way to define the importance in the graph of One Hundred Years of Solitude? SURE it is! We can compute the centrality of a vertex in the network.
Centrality identifies the importance of a vertex in the network. Let’s apply two well known centrality measures: degree centrality and betweenness centrality.
Degree centrality for a given vertex is defined as the number of edges of the vertex. Let’ see the first ten most important vertices in our graph. If you’re interested, here’s the visualization of the degree centrality for all of the characters.
Úrsula (Iguarán): 33
Colonel Aureliano Buendía: 25
Aureliano Segundo: 23
José Arcadio Buendía: 18
José Arcadio Segundo: 15
Gerineldo (Márquez): 15
Betweenness centrality for a given vertex is defines as the number of shortest path for a pair of vertices which pass through the given vertex. Here are the first ten most important vertices in our graph, and here’s the visualization for all of the characters.
Úrsula (Iguarán): 0.26675225258687174
Colonel Aureliano Buendía: 0.24285723082589036
Aureliano Segundo: 0.1333021576177314
José Arcadio Segundo: 0.12999823188347778
Mr. Herbert: 0.06448087431693988
Amparo (Moscote): 0.043136425398238344
Note that Úrsula remains the most important characters in the novel. Indeed, she has the greatest number of interactions and all of the interactions pass through her. It is interesting to see that for the degree centrality, second and third place go to Amaranta and Remedios which are main characters but they do not bring any information in the network, that is all of the other characters could have relations without involving them; for the betweenness centrality, Colonel Aureliano Buendía and Aureliano Segundo represent the main threads of the entire novel.
It could be interesting to define relationship in a more accurate way, as for an example two characters are in relation whether they are near each other in the time of talking. As you can imagine, dear reader, this approach is more difficult than the one I used and it could be investigated.
The approach here presented could be (virtually) applied to any other books. Be careful, anyway: tagging, that is the process of classifying words in their part-of-speech, is strictly related to the language. So, maybe for English could be easy but what about Italian?
I hope you liked this post. Please, feel free to suggest me anything you would like to see here or to indicate errors!