One rainy, Ontarian day a few months ago, I got the itch to visit my local bookstore as one sometimes does on days when there isn’t much else to do. Grabbing my umbrella and venturing out, I soon found myself covered in rain drops standing in front of the bookstore computer, unsure of what exactly I was looking for.
Typing in a few non-descript words and finding nothing, I decided to see if the shop had stocked anything by Vladimir Nabokov, one of my favorite authors. Why I often type in the names of long dead novelists looking for new works, I’ll never know. While my search yielded no new novels, I came across something equally as interesting.
Nabokov’s Favorite Word is Mauve by Ben Blatt describes a series of experiments using data science techniques to analyze famous works of literature. The fact that Nabokov’s name is in the title is just a coincidence; he isn’t mentioned much. The main purpose of the book is to determine if one can predict literary success from certain characteristics — use of adverbs, punctuation, and the like. As a software engineer with a passion for literature, this prospect excited me very much.
Before Blatt deep dives into these questions he shares the story that inspired his book. While Alexander Hamilton is known more as a Broadway superstar these days, he was a founding father of the United States, and one of the authors of what are now known as the Federalist Papers. Besides Hamilton, James Madison and John Jay also authored some of the papers, however there were twelve disputed papers of which authorship was unknown.
Using the science of stylometry (artistic style), Mosteller and Wallace were able to formulate an algorithm to finally put the mystery to rest in 1963. It turned out that Hamilton had, in fact, written all the disputed papers, despite academic opinion at the time that it was Madison who had done so.
At the story’s conclusion, my brain was racing. What other insights could we gain through the use of this science? What mysteries could be solved using some simple code and some text samples? In the end, I came right back to where I had started: Nabokov.
Pale Fire, the 1962 masterwork by Vladimir Nabokov, is extremely unique. The novel is split into two main parts: the titular poem written by John Shade, and corresponding commentary written by Charles Kinbote.
Since its inception, there have been questions raised to the authorship of both the poem and commentary; one widely accepted theory is that the commentary is not written by Kinbote, but rather his sinister colleague Professor Botkin. Nabokov even endorsed this theory. Still, there are others who believe Kinbote is an alter-ego of Shade, or vice versa. While some of these theories are beyond the power of stylometry in its current form, here we will try to determine if the poem and commentary were written by the same individual.
The question we are trying to answer is related to, but not strictly the same, as the Federalist paper problem described above. While Mosteller and Wallace were attempting to determine the true author of a document from a pool of potential authors, here we are are trying to determine if two works were written by the same individual. Luckily, in their paper Determining If Two Documents Are Written by the Same Author, Koppel and Yaron determined that these two problems can be reduced to one another; to determine if the authors of the poem and commentary are the same, we need to introduce a pool of imposter authors. Not just any imposters, however — we need to choose our lineup of suspects carefully.
First off, this experiment is slightly different than those previously attempted in that both our potential author and disputed work is written by the same author, at least in our world, namely Nabokov. If we tried to compare Kinbote to Capote, it is obvious Kinbote would be the man we were looking for. Luckily, Nabokov has provided us with a host of unreliable narrators to choose from. For this experiment, our potential authors will include Kinbote, Humbert Humbert (Lolita), and V (The Real Life of Sebastian Knight).
A second challenge comes with the fact that all of these author’s works are focused around another main character, and varying locales; Kinbote with Shade and Zembla, Humbert with Lolita, and V with Sebastian. To make this comparison fair, I’ve removed most instances of proper nouns. To do this, I tried to choose passages that were independent of their subject. Kinbote’s section on suicide and Humbert’s description of the perfect murder come to mind. Sometimes, it was necessary to remove a word or two from the sample.
The same can be said for punctuation. Since the stylometry relies so heavily on used words, I’ve separated all contractions, as they do not play nicely with the algorithm used; for example, self-control would be counted as a 11-count word, when in reality splitting it up gives better insight into the author’s vocabulary.
Koppel and Yaron showed their method can be very accurate with short text samples, so those chosen are only between 250 and 500 words. With five samples per author, we can draw from 1250–2500 words used by the author, which should be enough for some patterns to form. The samples I used are by no means exhaustive; I’ve provided the code and samples used, so further work can be done:
Stylometry analysis of Nabokov's Pale Fire Use stylometric methods to determine the true author of Pale Fire. Methods…
Before performing the algorithm to determine the writer of Pale Fire, let’s perform some basic computations on our data. This technique was first proposed by T. C. Mendenhall, who believed that an author’s stylistic signature could be determined by counting how often they used words of different lengths. This gives a gauge of the author’s vocabulary. See the graphs for our potential authors, as well as the writer of Pale Fire (with title canto) below.
Note that the x-axis denotes the word length, while the y-axis denotes the number of uses of words of that length.
One can decipher many things from these graphs, but the first thing to take note of is that the graph for the canto is quite a bit smoother than the other two, mainly because the sample size for the poem is bigger than those of the potential authors. This shouldn’t matter much, as the pattern will still be the same. We also note that the vocabulary of the poem is slightly larger than that of our potential authors, perhaps because of the differences in prose and poetry.
To analyze, let’s take a look at the first nine entries on the x-axis of each graph. These are the word lengths most often used by each author. We have:
Cantos: [3,4,2,5,6,7,1,8.9]Kinbote: [3,2,4,6,5,7,8,1,9]Humbert: [3,2,4,6,5,7,1,8,9]V: [3,2,4,5,6,7,1,8,9]
We see here that Kinbote matches 3/9, Humbert matches 5/9, and V matches 7/9. By looking at the graphs, we also see that Humbert’s vocabulary is the most verbose with the longest words, which is no surprise.
By this crude analysis, we can see that Kinbote is actually the least likely of our authors to have written Pale Fire. Without looking into the data further, we would conclude that Kinbote is not the author of the poem. Stylometry has advanced quite a bit since Mendenhall’s time, however. While these numbers do provide insight, they do not take any context into account. Nuances in the writings go unnoticed. In order to go deeper, we’ll have to use a different algorithm.
Most algorithms in modern day stylometry measure the “distance” between two sets of text. This “distance” can be defined many different ways, usually using different statistical methods. Here, we’ll be using Burrow’s delta method, which essentially measures how much the disputed sample and the samples written by potential authors diverge from the average of all of them put together. This algorithm considers the exact words that were used, and weighs them fairly so that certain common words do not alter the result. One can find a complete description of the algorithm here.
To test that this algorithm works, we will first test our potential authors list against a different disputed work — one that isn’t even disputed. Using the foreword of Pale Fire, written by the same individual who wrote its commentary, we will see if we get the correct result of Kinbote. Note that the individual with the lowest delta score is the most likely author. Running the algorithm with the foreword, we get:
Delta score for candidate humbert is 1.1486340383396685Delta score for candidate kinbote is 0.9518314112226345Delta score for candidate v is 1.05899937250705
The algorithm correctly predicts Kinbote (or Botkin) as the most likely author of the foreword, by a rather large margin.
Running the algorithm against Pale Fire, we find our results:
Delta score for candidate humbert is 0.9600294963002932Delta score for candidate kinbote is 0.9918804751828916Delta score for candidate v is 1.1230558877047745
The algorithm does not name Kinbote as the most likely author of Pale Fire, but rather Humbert. This does not entirely mean that Kinbote is not the author of Pale Fire, though it does support the notion.
In their paper, Koppel and Yaron describe the use of “threshold” to determine if the result of the algorithm is viable or not. Unfortunately, we do not have such a threshold at this time, as it would require many more samples and much more experimentation. The fact that this is a poem may play a part, as well, as an author’s style may not transfer completely between prose and poetry. More work could be done to determine if this is the case.
We do note, however, that Humbert’s score is extremely close to the score Kinbote received above, so it would stand to reason he would be named the author, and not Kinbote. As Humbert is an imposter, this implies that Kinbote did not write the poem, and that Pale Fire was in fact written with two different pens.
Whether or not this person was Kinbote or not remains up to the reader. Botkin(e) is an anagram of Kinbote, and stylometry has other methods of dealing with those. Perhaps another algorithm could be applied to find more patterns within the pages of Pale Fire. One wonders if Nabokov would agree — it does take some of the fun out of things, doesn’t it?