This is the second part of a multi-part blog post in which I use various tools of Natural Language Processing to visualize Ulysses, by James Joyce. The aim is to provide proof in the collaborative possibilities between literary criticism and data science. The first part, where I elaborate on that idea as well as the inspiration for this project, can be found here. The code for the whole project can be found in its Github repository.
There are many, many ways to visualize a book. Even in the inspiration project for this post, there are four distinctive visualizations of the chapters using Euclidian, Manhattan, Canberra distances, and Normalized Compression Distance with mere TF-IDF values (more on this below).
Perhaps the most popularized notion of story visualization is Kurt Vonnegut’s Shapes of Stories idea:
The notion of stories having shapes is deeply appealing. It allows us to draw parallels between seemingly disparate narratives, including parallels to our own life. Philosopher Alasdair MacIntyre believes human life is narrative in nature, and that drawing such parallels allows us to construct new narratives which resolve epistemological crises. For these reasons there is a comfort in storytelling, literary criticism, and comparative literature, and it’s another compelling reason to use data science to corroborate these parallels. But what exactly are these parallels? What does Vonnegut mean by ‘shape of a story?’ Are these emotional timelines measured by sentiment? Are they plot-driven shapes, or shapes corresponding to character development? Or perhaps they involve even more complex measurements such as prose style, or the evolution of an implicit philosophy throughout a story?
These questions elicit the primary difficulty (and critique) of literary criticism, which is that satisfying solutions to such questions either resort to vagueness, or to the inaccessible argot of the ivory tower. Though it may be rigorous, as anyone who has meddled with Foucault will understand, it lacks the clarity and accessibility that an analytic approach can provide. European (or so continental) philosophy privileges a presupposed knowledge of history and European languages, just as analytic philosophy privileges deductive, irrefutable argumentation at the sacrifice of interdisciplinary scope. The ideal approach to such issues straddles -or dissolves- the line between the two attitudes, and in this particular project, I can’t help but think of Wittgenstein’s famous utterance in the Tractatus Logico-Philosophicus:
The limits of my language are the limits of my world.
I plan on showing (not saying, for the avid Wittgenstein reader) how TF-IDF and NLP can help us begin to recognize some of those limits.
Let us begin.
In this post I will walk through my process, from beginning to end, of visualizing Ulysses using TF-IDF, or term frequency–inverse document frequency. This standard approach to analyzing text with NLP is known as a ‘bag of words’ method. It doesn’t take word order (and so neither prose nor style) into account. A brief summary of how it works follows:
The set of all documents in the scope of a project is called a corpus. The term frequency, or TF, refers to the number of times a particular word appears in a document. That’s it. The word ‘Karenina’ appears far more often in the novel Anna Karenina than in other novels, and so to some extent, the ‘Karenina’ count will be representative of Anna Karenina as a whole.
However, words such as ‘the’, or ‘a’ will also appear often. Hence, it’s useful to offset a document’s term frequency with another variable which represents how many documents in the entire corpus also contain that word. This is done by taking the logarithmically scaled, inverse fraction of the number of documents that contain a particular word, which is known as the inverse document frequency, or IDF, and multiplying it by the term frequency. This process tends to neutralize common words, and weigh more unique words, in the sense that words with high TF-IDFs become informationally representative of that document’s identity.
While this process misses out on the crucial information latent in prose through word order (i.e. TF-IDF doesn’t capture the actual ideas behind the words, which a neural net would be capable of), TF-IDF nevertheless captures a surprising amount of information necessary to cluster documents, or novel chapters, in a way that matches our intuitions.
For this project, I began with the text of Ulysses, already broken up by chapter which I scraped from Project Gutenberg. The code for that, and what follows, can be found in the project’s Github repository.
I used scikit-learn’s TF-IDF vectorizer, which finds the TF-IDFs for all of the words in the corpus and represents each document as a matrix of those frequencies:
from sklearn.feature_extraction.text import TfidfVectorizervectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(uchapters)
There are 18 chapters in Ulysses, so now the variable ‘vectors’ here contains 18 matrices representing the 18 chapters of Ulysses in matrix form. In order to plot them visually, I needed to collapse each of these multi-dimensional matrices into two dimensions. As opposed to using cosine similarity for measuring distance, I found it easier to use Principal Component Analysis to collapse each chapter’s matrix into two vectors: x and y.
# convert sparse vectors to dense format for PCA prep
dense = vectors.todense()pca = PCA(n_components=2)# pca to 2 components
reduced = pca.fit_transform(dense)
With each chapter reduced to two dimensions through PCA, we can finally visualize them. Since PCA uses singular value decomposition for feature reduction (in this case TF-IDF reduction), we can be assured that each bubble’s (x, y) coordinates contain the maximum possible information, or statistical variance, about those TF-IDF scores, thus ensuring that bubble proximity reveals chapter similarity based on word content.
The green line connects the chapters as they happen chronologically. The size of the blue bubbles represents the word count of each chapter. If you’re familiar with the plot of Ulysses, it may be interesting to attempt making sense of this graphic before looking up chapter summaries found here. But I was pleasantly surprised to find that the bubbles cluster together in a way that corresponds, more or less, to the plot. Chapters 3–11 consist in the sensible mixture of the novel’s main characters going throughout their day in Dublin, so it makes sense that their TF-IDF’s are similar to each other, despite having some unequal focus on character perspectives. (One wonders how the graphic might change if we heavily bias character names when calculating our TF-IDF scores?) The first two chapters are solely from the perspective of one of the novel’s arguable protagonists, Stephen Dedalus, whose highbrow diction and rate of reference to Shakespeare, the Catholic church, and Latin phrases may explain why the bubbles for those chapters are removed from the others.
On the other hand, Leopold Bloom, the novel’s other (arguably primary) protagonist who is the titular Odysseus, or Ulysses, retracing the events of The Odyssey throughout this one day in Dublin in 1904, dominates much of the perspective of later chapters. What makes this visually complicated, however, is the degree to which Joyce increasingly distorts those chapters’ perspectives by using highly experimental prose techniques, such as in Chapter 15, Circe, where the entire chapter is presented in the form of a script for stage. Thus, because Chapter 15 contains all of the characters in a highly surreal rotation of styles and prose, with each character’s line containing that character’s name, it may explain why that chapter’s bubble in the graphic is close to 0 on both principal component axes, since as a bag-of-words representation, TF-IDF scores for that chapter would be highly uninformative due to a lack of unique term identifiers. It’s also worth noting that Chapter 10, Wandering Rocks, is the other chapter-bubble close to 0 on both principal component axes:
This is explained by the fact that Chapter 10 is somewhat of an ‘interlude’ chapter. In this chapter we also see every character from different perspectives, with fairly neutral diction from a God’s-eye point of view, and so TF-IDF scores would be equally uninformative after PCA reduction, hence values of 0.
Chapter 17 is far removed, and I can’t quite figure out why. It’s the only chapter that puzzles me. I’ll have to dive in further to understand why, but I imagine the diction is unique enough in that chapter to give it a distinctive set of TF-IDFs:
As for Chapters 13 and 18, I was particularly satisfied to see their bubbles stand together, apart from the rest. Chapter 18, Penelope, is arguably the most famous chapter in the novel, and is one gigantic stream-of-consciousness flow of thoughts from Leopold Bloom’s wife, Molly Bloom. Her thoughts, naturally, involve many of the same things Leopold is thinking about throughout the day, including the affair Molly had earlier that day with Hugh Boylan, and presumably many of the words in Chapter 18 are also in Chapter 13, which explains Chapter 18’s bubble’s proximity to Chapter 13's, which is up in the left corner by itself. This is explained by the fact that Chapter 13, Nausicaa, is solely a Leopold Bloom perspective, and while the plot involves a comically lewd episode watching fireworks on a beach, the actual content of the chapter includes Leopold’s brooding thoughts about his life, about Molly, and about her affair which he’s desperately trying to avoid thinking about.
Overall, I was very satisfied with these preliminary results. For my next post, I’ll actually find a sentiment timeline for Ulysses, which will hopefully reveal some Vonnegut-style ‘shapes to the story’ as mentioned above. I’ll also attempt to see if any discernible ‘shapes’ happen on a chapter level, as well.
Thanks for reading! And keep an eye out for Part Three.