# One Hundred Years of Solitude: how I analyzed my favorite book

Ok, you got me: I love music and tv series. What you probably don’t know is that I’m a stubborn reader. I can’t list all of the books I’ve read in my life but I can answer the question *what is your favourite book?*.

It is **One Hundred Years of Solitude** by *Gabriel García Márquez*. One Hundred Years of Solitude is THE book: it contains everything you need or neither, it is a very fantastic novel, it is practically infinite in its content, it is the definition of *time*. Moreover, it represents my solitude.

I found a very nice edition some days ago in France. I can’t read it anymore but curiosity always kills the cat, so **I was wondering whether it is possible to analyze it by meaning of some sort of textual analysis**.

I did it and there is this post. Let’s jump to the Introduction!

### Introduction

I analyzed the book using two kinds of analysis, that are a **textual analysis** and some sort of a **graph theory analysis**. The former is made up by a textual analysis of the book, taking the words as atomic elements of the analysis. The latter uses the former in order to create a graph (or *network*) modelling the interactions between the characters in the book.

Before starting, a disclaimer: this post does not represent a scientific approach to the exposed problems. It only carries out a lot of curiosity.

### The data

Before starting, we need something. Yea, you guessed well: we need the book in some textual format. We ask, Google answers: here’s a .txt file containing the english translation of the original book.

The text mining has been carried out using **Python** and the **nltk** library. Visualizations have been made by **matplotlib** and **Tableau**. The graph was made using the **networkx** library, and **Gephi** for the visualization.

The main steps of the approach are the following:

- reading the book and tokenizing it using python and nltk,
- making the textual analysis with the same tools,
- building a graph on the top of the extracted information about characters.

The code is available on Github.

### The textual analysis

Let’ start with the textual analysis, that is *take all of the text, spit it on the computer and see if something comes out*. It seems that something interesting really came out!

#### Symbols, words, lexical richness

One simple question: how many symbols, total and distinct words does the book contain? Márquez, for his last draft of the book, pressed the keys of his typewriter **809644** times. **One Hundred Years of Solitude** contains **144739** words and the number of distinct words is **11027**. There are more than **11k different words!**

We can extract a nice measure called **lexical richness**, which is the ratio between the number of distinct words and the number of total words. In this case, we have **0.07618540959934779** and it means that all of the distinct words represent **7.6%** of the entire text!

#### Words dispersion

Márquez entitled his work **One Hundred Years of Solitude**. It could make you think that the word *solitude* appears a lot in the text. We can find it by identifying the **dispersion** of the word in the whole text. I’m curios, so I tried identifying other four words, that are *time*, *love*, *life* and *death*.

The plot shows the dispersion of the words in the text. Intuitively, a blue line represents a word which occurs in that part of the book, no otherwise. From the plot we can see that the old fox Márquez uses really few times *solitude* but makes the reader falling into the concept of *time*, using that word practically in the whole book. Furthermore, we see that *love* is the main theme starting from the end of the fifth generation of Buendía till the end of the book.

#### Hapax legomena

Erh… what? A **hapax legomena** is a word that occurs only once within a context, either in the written record of an entire language, in the works of an author, or in a single text.

**One Hundred Years of Solitude** contains **4741** hapax legomena. Here are 50 of them, picked at random:

epaulets, upsetting, civilization, motilón, marshal, domain, gluttons, despised, secretary, consulting, vise, forty-seven, modify, parrot, thirty-five, jeopardize, highest-flying, docility, wishing, rebuffs, wisely, walter, piglets, cans, dainties, demented, ports, chalice, mitigated, paragraphs, riddle, huts, alexandria, shuttered, consummate, adulterous, hoof, drugstore, tap-dancing, fabric.

#### Collocations

Ehm… what? I’m a computer scientist, come on! A **collocation** is a sequence of words or terms that co-occur more often than would be expected by chance.

In **One Hundred Years of Solitude**, we find the following collocations:

josé arcadio, aureliano segundo, colonel aureliano, aureliano buendía, arcadio buendía, gerineldo márquez, santa sofía, pietro crespi, petra cotes; pilar ternera, colonel gerineldo, arcadio segundo, amaranta úrsula, chestnut tree, banana company, mauricio babilonia, apolinar moscote, father nicanor, prudencio aguilar, many years.

### Network analysis

Well, now we’re getting fun! The idea behind this kind of analysis is that of modelling relationships between characters of **One Hundred Years of Solitude** by a *network* (a graph).

Before starting, we define what is a graph and what are the relationships between characters.

A network is just a set of objects connected between them by some sort of relationship. As an example, you, my dear reader, and your friends make a network: you’re entities which are connected by past experiences, common interests, etc. Throughout this whole post, I could call these entities *vertices* and the relationships *edges*.

Regarding the relationship between characters: we can not extract various kind of relationships in an automatic way. We say that a **relation** exists between a character *A* and *B* whether *A* and *B* occurs in the text and *B* appear at most after 30 words from the occurrence of *A*. Furthermore, I considered only the characters with at least one interaction.

#### Oh my, we have a graph!

You were waiting for it! Here’s the graph modelling the relationship of **One Hundred Years of Solitude** characters:

Vertices (circles) represent characters: the size of the vertices are proportional to the total number of interactions. Edges represent relationships: the size of the edge between two vertices are proportional to the number of interactions between those two characters. But… we have some issues: there are no names! Did you read the book? So, make your guess!

Done? Here are the names:

The Holy Trinity is there: **Úrsula**, **Amaranta** and **Remedios** are strongly connected between them three. We see that the model gives a nice visual hint regarding the importance of some late-introduced characters: whether a character is introduced late in the book, thus the importance of that character falls.

#### Characters and statistics

Let draw some simple statistics:

- there are
**62 characters which at least one interaction**, while the**total number of characters is 71**; - we have at least
**1**and a maximum of**33**interactions, the average number of interactions is**6.6451**.

Let’ see if we can get some more interesting insights!

#### Diameter of a graph

The diameter of a graph is the maximum eccentricity of any vertex in the graph, that is the greatest distance between any pair of vertices. Intuitively, we think of the diameter as the shortest path which allows us to go from one side to the other of the graph.

The diameter of our graph is equal to **6** and an example of path which defines the diameter is the following:

Bruno Crespi, Pietro Crespi, Remedios, Aureliano Segundo, Mr. Herbert, Mr. Jack Brown, Dagoberto Fonseca.

In this case, the shortest path which starts at **Bruno Crepsi** and ends to **Dagoberto Fonseca** is the one below.

#### Centrality

Is there a way of understand the importance of a character in a book? Yeah, sure it is, by reading the book, analyzing relationships between the characters, etc. Is there a way to define the importance in the graph of **One Hundred Years of Solitude**? SURE it is! We can compute the **centrality** of a vertex in the network.

Centrality identifies the importance of a vertex in the network. Let’s apply two well known centrality measures: **degree centrality** and **betweenness centrality**.

**Degree centrality** for a given vertex is defined as the number of edges of the vertex. Let’ see the first ten most important vertices in our graph. If you’re interested, here’s the visualization of the degree centrality for all of the characters.

Úrsula (Iguarán): 33

Amaranta: 27

Remedios: 26

Colonel Aureliano Buendía: 25

Aureliano Segundo: 23

Rebeca: 19

José Arcadio Buendía: 18

José Arcadio Segundo: 15

Gerineldo (Márquez): 15

Melquíades: 13

**Betweenness centrality** for a given vertex is defines as the number of shortest path for a pair of vertices which pass through the given vertex. Here are the first ten most important vertices in our graph, and here’s the visualization for all of the characters.

Úrsula (Iguarán): 0.26675225258687174

Colonel Aureliano Buendía: 0.24285723082589036

Aureliano Segundo: 0.1333021576177314

José Arcadio Segundo: 0.12999823188347778

Remedios,Meme: 0.12191073285263257

Amaranta: 0.09939612889878079

Gabriel: 0.09508196721311475

Mr. Herbert: 0.06448087431693988

Amparo (Moscote): 0.043136425398238344

Melquíades: 0.04185479524582708

Note that **Úrsula** remains the most important characters in the novel. Indeed, she has the greatest number of interactions and all of the interactions pass through her. It is interesting to see that for the degree centrality, second and third place go to **Amaranta** and **Remedios** which are main characters but they do not bring any information in the network, that is all of the other characters could have relations without involving them; for the betweenness centrality, **Colonel Aureliano Buendía** and **Aureliano Segundo** represent the main threads of the entire novel.

### Conclusion

It could be interesting to define relationship in a more accurate way, as for an example *two characters are in relation whether they are near each other in the time of talking*. As you can imagine, dear reader, this approach is more difficult than the one I used and it could be investigated.

The approach here presented could be (virtually) applied to any other books. Be careful, anyway: *tagging*, that is the process of classifying words in their part-of-speech, is strictly related to the language. So, maybe for English could be easy but what about Italian?

I hope you liked this post. Please, feel free to suggest me anything you would like to see here or to indicate errors!