I dare say you will never use tf-idf again

Tyler Schnoebelen
Apr 10 · 14 min read

Julia Silge, astrophysicist, R guru and maker of beautiful charts, a data scientist with what seem by any account to be comfortable and happy cats, united the best blessings of existence; and had lived in the world with very little to distress or vex her.

I assure you dear reader, it brings no pleasure to vex such a soul. However, I suspect danger to poor Julia from all the hospitality and kindness Mr. TF-IDF has shown her (and she, him!). If she does not take care, I fear she may be required to sink herself forever. For though she finds the company of Mr. TF-IDF very agreeable, I am sure that the match has little to recommend it.

Uff, okay, that was exhausting. I’m back to my voice instead of attempting Jane Austen’s. Hi. This post may be useful to folks doing text analysis, maaaaaybe it’ll be interesting to fans of Jane Austen who want to know which novel was Jane Austen at her most Austenian and which of her precursors/contemporaries were most similar to her in style.

Honestly, the whole post is mainly imagined for an audience of one: Julia Silge, whose work I admire—she and David Robinson put a bunch of useful R code and explanations in a book on text mining recently. This post is basically a follow-up to some tweets she and I exchanged.

If you like the stuff in here and/or in Julia and David’s book, you may also want to check out some of Jason Kessler’s github and Dan Jurafsky et al using the methods I explore here on food reviews.

You might also want to check out Mark Liberman on Obama’s favorite State of the Union words. As he points out, the weighted log-odds approach used in this blog post in lieu of R is meant to:

take account of the likely sampling error in our counts, discounting differences that are probably just an accident, and enhancing differences that are genuinely unexpected given the null hypothesis that both X and Y are making random selections from the same vocabulary.

This fun image of Jane Austen from Jordan Andrew Carter

The data

Since I’m interested in the ways Jane Austen’s style comes across, I need to find a comparison set. So I grabbed the Project Gutenberg texts of a variety of authors who were Austen’s contemporaries or influences (please definitely recommend folks I missed!).

Samuel Richardson (1689–1761), Laurence Sterne (1713–1768), Charlotte Lennox (1730–1804), Olaudah Equiano (1745–1797), Charlotte Smith (1749–1806), Fanny Burney (1752–1840), Elizabeth Inchbald (1753–1821), Mary Robinson (1757–1800), Ann Ward Radcliffe (1764–1823), Maria Edgeworth (1768–1849), Amelia Opie (1769–1853), Walter Scott (1771–1832), Mary Brunton (1778–1818), Susan Ferrier (1782–1854), Catherine Crowe (1803–1876), Dinah Maria Mulock Craik (1826–1887), M. E. Braddon (1835–1915).

If you want to tease Austen about the typo in this title, please make sure to include a link to the manuscript you wrote when you were 15

For reference, Austen lived from 1775–1817. She first published in 1811 with Sense and Sensibility. For what it’s worth, I’ve included some of her juvenilia and letters. The most fun of which is from when she was 15 years old and is called (the misspelling is preserved!) Love and Freindship.

Play-by-play with R code

First lets load our 267,219 rows of data (from 18 authors across 126 different works — although you can quibble with this second number since, for example, Dinah Maria Mulock Craik’s work A Life for a Life is treated as multiple works (volumes 1–3). This code also cleans up an issue where Fanny Burney appears in the data frame as both “Fanny Burney” and “- Fanny Burney”.

setwd("wherever_you_have_all_your_text_files")filenames <- list.files(path = “.”, pattern = NULL, all.files = FALSE, full.names = FALSE, recursive = FALSE, ignore.case = FALSE)read_text_filename <- function(filenames){
ret <- read.delim(filenames,sep="~",header = F)
ret$Source <- filenames #EDIT
ret
}
library(plyr)df<- ldply(filenames, read_text_filename)library(tidyr)df<-separate(data=df,col=Source,into=c("Title","Author"),sep=" - ")df$Author<-gsub(".txt","",df$Author)df$Author<-gsub("- ","",df$Author)

For the time being, I don’t actually care about the works, just the authors. So I’m going to simplify the data frame to be just author and the text:

df<-df[,c(3,1)]

Now I want to get bigrams (two-word phrases). I like bigrams a bit more than unigrams because it gives a bit more context and syntax — but there are disadvantages (especially the less data you have). You can read more about bigrams and negation here or increases in accuracy by adding bigrams (but not getting much with trigrams) here.

You could also go the opposite way — as Monroe et al do — and use stemming. That’ll turn something like cat and cats into just ‘cat’. It’ll turn helpful, helping, helped, helps, and help into ‘help’. And it’ll turn Sense & Sensibility into sens & sens.

In truth, when I do this kind of work on my SMS text messages or Cher’s tweets, I use my own special tokenizer so that I can get emoji, emoticons, and punctuation-clumps all out. (Tokenizing is just “how are you splitting ‘words’ up — not all languages use spaces and even in English, do you want a string of five emoji to be treated as one thing or five?) But basically, the thing I think you should use for modern text tokenization is close to what computational linguists at CMU worked up.

Okay, one other note is that I’m going to keep capitalization, mostly because we’re going to want to be able to filter out proper names later.

library(dplyr)library(tidytext)df_bis<-df %>%
unnest_tokens(bigram,V1,token="ngrams",n=2,to_lower = F)
df_bis_more<-df_bis %>%
count(Author,bigram)
df_bis_more<-df_bis_more %>%
bind_tf_idf(bigram,Author,n)
df_bis_more <- df_bis_more[order(-df_bis_more$tf_idf),]

By this method, we see that the highest tf-idf score overall is “uncle Toby” (Laurence Sterne, used 835 times), “Miss Woodley” and “Lord Elmwood” by Elizabeth Inchbald (461 times and 421 times, respectively). The highest Jane Austen bigrams are “Mr Knightley”, “Mr Darcy”, and “Miss Crawford” (Austen writes these 269, 244, and 215 times, respectively). I’m not reporting the actual tf-idf scores because they are…pretty meaningless as numbers. (This is another mark against them.)

But these results seem about right. Shouldn’t major characters with unique names come out on top? One problem is that these are so much about reference that they are stylistically pretty uninteresting. Maybe the one thing we can see is that Austen very rarely has Lords in her writing, only: 8 “Lord Barbourne”, 9 “Lord St”, 4 each of “Lord Ravenshaw” (!!!) and “Lord Spencer”. And in case, you’re curious three“Oh lord”’s and nine “Lord how”’s.

An alternative to TF-IDF

Okay, now let’s do something different. We’ll calculate the weighted log-odds ratio with uninformative Dirichlet prior (see Monroe et al). I think that “uninformative prior” is a bit of a misnomer here, but it captures the fact that we’re just comparing all of these authors against each other. If we were just analyzing Jane Austen’s books against each other as she does, we could use Fanny Burney and all the rest for an INFORMATIVE prior since that’d capture a bit more about the background. We could use a big corpus of modern web language but…that’s not right. The main point is that you would draw from a bigger sample that is relevant to the people/texts you are studying.

termfreq<-aggregate(list(TF=df_bis_more$n),list(bigram=df_bis_more$bigram),FUN=sum)df2<-merge(df_bis_more,termfreq,by="bigram")df2$DFnotthem<-df2$TF-df2$ndf2$sum<-sum(df2$n)authortermfreq<-aggregate(df_bis_more$n,by=list(Author=df_bis_more$Author),FUN=sum)df2<-merge(df2,authortermfreq,by="Author")df2$xnotthem<-df2$sum-df2$xdf2$l1them<-(df2$n+df2$TF)/((df2$sum+df2$x)-(df2$n+df2$TF))df2$l2notthem<-(df2$DFnotthem+df2$TF)/((df2$sum+df2$xnotthem)-(df2$DFnotthem+df2$TF))df2$sigmasqrd<-1/(df2$n+df2$TF)+1/(df2$DFnotthem+df2$TF)df2$sqrt<-sqrt(df2$sigmasqrd)df2$logodds<-(log(df2$l1them)-log(df2$l2notthem))/df2$sqrt

That gives us the main thing we’re going to compare against tf-idf, but let’s do a few other things.

df2$perc<-df2$n/df2$TFlibrary(plyr)numauthors<-ddply(df2,'bigram',function(x) c(numauthors=nrow(x)))df2<-merge(df2,numauthors,by="bigram")df2 <- df2[order(-df2$tf_idf),]df2$ranktfidf<-seq.int(nrow(df2))df2 <- df2[order(-df2$logodds),]df2$ranklogodds<-seq.int(nrow(df2))df2$rankdiffabsv<-abs(df2$ranktfidf-df2$ranklogodds)

Okay, so the very first thing to note is that the top two by log odds are ranked super-low by tf-idf. These are “said the” and “of the”, which are things that the Walter Scott uses a lot (5,824 and 32,468 times, respectively). The log-odds scores for these bigrams are extremely high. They are extremely Scottian (“one of the” and “out of the”; “said the Lady”, “said the King”, “said the page”, and of course the best one, which was used 89 times in The Monastery, “said the Sub”).

One of the problems with using tf-idf for stylistic analysis is that if everyone uses them they’ll get a score of 0 even if some people use them a whole lot more than others. That’s because the “idf” in “tf-idf” is for inverse document frequency. As an English reader/speaker, you won’t be surprised that all 18 authors use “of the” and “said the”. The inverse document frequency is calculated as the natural log of the total number of documents (=authors, so 18) divided by the number of documents (authors) who use the phrase (in this case everyone uses it, so 18 again). The natural log of 18/18 = natural log (1) = 0. So you multiply the tf*0=0. The rationale behind this is basically right — tf-idf is commonly used in finding search terms and honestly, it’s not a great search term if it appears EVERYWHERE.

However, you and I both know that some people say/write words everyone else does…just way more. That’s like Walter Scott. To my mind, this is a great advantage of log odds. And that it helps you distinguish these authors. So in tf-idf Jane Austen is also a “0” for said the and of the…but in log-odds her values for these are -6.1 and -6.6. That is, log odds help us discover that although Jane Austen uses these words (25 times for said the and 3,699 for of the), these are — compared to the other authors — far fewer than we’d expect at chance. A question that I won’t pursue is whether Austen declines to have roles speak without names (e.g., she may prefer “said Father Dowling” to “said the priest”). And of course, you could say the cowl of the vigilante, but maybe Austen would prefer the vigilante’s cowl. Or maybe there’s something more specific about the definite article the that Austen doesn’t like.

The highest values of both methods are mostly proper nouns. So let’s do a quick filter to find bigrams that don’t include any capital letters. Here are the top lower-case phrases that are separated by at least 50 spots when you compare the rankings. Which of these do you think is more Austen-y?

  • every thing, her aunt, the general’s, very affectionately, she must, very differently, of there, inclination for, exactly what, a something
  • do not, to be, very much, she could, very well, not be, could not, must be, of it, not know

I’m not sure I have a preference myself. But this will reveal an advantage of log-odds: in this case I can say that either list you choose is full of high log odds. As you add or subtract texts, the log-odds for phrases will shift but whenever they are above 1.96 that means it’s very Austen-y. The less evidence you have, the more the weighted log-odds will shrink. Tf-idf values are very hard to compare or understand at sight — you have to understand them relative to the corpus. So another benefit of log-odds is human readability.

Of the top 10,000 terms by tf-idf across all 18 authors, log-odds agrees that 8,607 of them should be very highly ranked (>1.5 log odds). It only thinks two of the top-ranked tf-idf items are dumb (i.e., negative log-odds). Even these are only weakly negative: De Courcy is -0.47 log-odds for Jane Austen and I don’t is -0.033 log-odds for author Catherine Crowe. To be honest, I don’t think this makes the Monroe et al method better. But in the next sections I’ll show some extensions you could probably figure out how to do with tf-idf, but aren’t as straight-forward.

Take me to peak Austen

There are 932 bigrams that have log-odds for Jane Austen over 1.5. We can take her novels and count up how many each of them appears. We’ll normalize them by multiplying everything to be “per million bigrams”. For example, there are 80 uses of “so very” in Emma from a total of 164,267 bigrams in that novel. If Emma was a much longer book — a million bigrams — we’d expect there to be 487 “so very”’s. This just lets us compare works of different lengths. We’ll multiply this value by the log-odds we calculate for Austen out of the set of all authors. “So very” is very Austenian: its log-odds score is 7.06. Finally, we sum up each work. We’ve controlled for length and weighted more Ausenian phrases more heavily. So we just have to compare sums.

It turns out that Jane Austen is at her most Austenian in Emma, the third novel she published. Here are her books in order of Austenianishness. This is the exact order, but there are some natural cut-points, so I’ve put them in three groups (in each bullet bundle, the Austenian scores are ordered but close):

  • Emma (1816), Sense and Sensibility (1811), Pride and Prejudice (1813)
  • Mansfield Park (1814), Northanger Abbey (posthumously published, 1818), Select Letters (1796–1816)
  • Persuasion (posthumously published, 1818), Lady Susan (juvenilia, ~1793–1795), Love and Freindship (juvenilia, ~1790)

Perhaps you want a paragraph that is Peak Austen within the Peak Austen Novel? Here ya go, with bolding for especially Austenian turns of phrase:

One thing only was wanting to make the prospect of the ball completely satisfactory to Emma — its being fixed for a day within the granted term of Frank Churchill’s stay in Surry; for, in spite of Mr. Weston’s confidence, she could not think it so very impossible that the Churchills might not allow their nephew to remain a day beyond his fortnight. But this was not judged feasible. The preparations must take their time, nothing could be properly ready till the third week were entered on, and for a few days they must be planning, proceeding and hoping in uncertainty — at the risk — in her opinion, the great risk, of its being all in vain.

The very most Austenian phrases are:

Yes, I feel guilty that I switched from R to Tableau to make this graphic; I put stuff I thought was interesting in purple — there’s something about certainty in Austen and something else about beingness

Some of my other personal favorite Austenian phrases include in love, equal to, exactly what, very pretty, very agreeable, much obliged, and every respect.

The anti-style of an author

I love the ability to see what it is that people DON’T say. The current calculations only report negative log-odds for an author that actually uses a bigram. One way to fill out values is going to be to find the missing rows. For example, Fanny Burney has zero tokens of Mr Knightley—let’s add one for her and everyone else (including Austen).

We can use some R code to do this, but frankly it takes longer to fill in all these bigrams than I want. So I’m going to go ahead and say I only care about terms that are already above 1.5 or below -1.5 log-odds for at least one author. (If you’re being more scientific you probably want to use 1.96 for your cut-offs.)

topbottom<-subset(df2,abs(logodds)>1.4999)justbis<-as.data.frame(unique(topbottom$bigram))df3<-subset(df2,bigram %in% justbis$`unique(topbottom$bigram)`) library(plyr)df3filled<-df3 %>% complete(Author,bigram,fill=list(n=0))df3<-df3filled[,c(1:3)]df3$n<-df3$n+1

The subsetting reduces the df2 data frame to just 13% of all rows. Then filling in the columns quintuples it again.

In the interest of space, I’m going to redo everything above to get new log-odd scores for this subset of ~164k unique bigrams that at least one author had a very strong predilection for/against. If you’re watching carefully, you know that the new numbers ignore a bunch of the “uninteresting” bigrams. The perfectionist in me hates that I’m not using ALL the data, but my intuition is that it’s not going to be all that different to reduce computation time. I have not proven that this is true so feel free to contradict me. (I often talk people out of removing ‘stop words’…here, I’m basically saying ‘don’t use predetermined stop words because you don’t really know if they matter or not, but go ahead and ignore words you have some basis for calling…boring’.)

The more words an author has, the bigger deal it is for them not to use a phrase that is popular with others. So for Jane Austen, this n+1 shows that the following phrases — which she never uses — are the most surprising given how often others use them:

  • must needs, of God, of death, thou hast, o the, thou art, I don’t

The phrase must needs is especially associated with Mary Elizabeth Braddon, but Walter Scott uses it a lot, too and Samuel Richardson, Charlotte Lennox, Fanny Burney each use it over 20 times.

Olaudah Equiano’s autobiography, includes the horrors of slavery but has surprisingly little syntactic negation (eg, ‘don’t’, ‘do not’) compared to his contemporaries

Meanwhile, I don’t is something Braddonesque, Edgeworthian, and Burneyish. It’s extremely non-Austeny. The only author who uses less I don’t is Olaudah Equiano who never uses it. Note that do not (no contraction) is very Austenian and so is Do not. Olaudah Equiano doesn’t use do not, either. I’m not sure what’s going on with him and negation, but it may be worth looking into.

Which author is most like Jane Austen (who isn’t Jane Austen)?

Our new data frame lets us calculate who is most similar to everyone else. I’m going to make the claim that I don’t really care about terms that are used 75% of the time by one author or that occur in the bottom quartile of overall term frequency in this data-frame-of-extremes (“39” will be the limit, recall that this isn’t the exact number because we added +1 for each author that didn’t use a term).

highlyspecificterms<-subset(df3,perc>0.75 | TF<40)highlyspecificterms<-as.data.frame(unique(highlyspecificterms$bigram))df3reduced<-subset(df3,!(df3$bigram %in% highlyspecificterms$`unique(highlyspecificterms$bigram)`))df3forcor<-df3reduced[,c(1,2,16)]df3forcor<-spread(df3forcor,Author,logodds,fill=NA)df3forcor.cor=cor(df3forcor[,-1],method="spearman")^2df3forcor.dist=dist(df3forcor.cor)df3forcor.clust=hclust(df3forcor.dist)plot(df3forcor.clust)

Here we see that it looks like Fanny Burney and Jane Austen are the most similar. And you’ll notice that for reasons I don’t know that Project Gutenberg calls an author whose first name is “Elizabeth” simply “Mrs. Inchbald”.

I woke up looking this good.

This certainly makes sense — Fanny Burney is someone whose work was well known and influential to Jane Austen. I haven’t dug into the bigrams that unite them (sorry…but isn’t this long enough?!) That said, both of them use the epistolary form in their novels (where some of the novel is a letter a character is writing to another character).

The most famous link between the two is that end of Burney’s novel, Cecilia is probably where Jane Austen got the title of her most famous work (the caps are in the Project Gutenberg text, btw).

“The whole of this unfortunate business,” said Dr Lyster, “has been the result of PRIDE and PREJUDICE. Your uncle, the Dean, began it, by his arbitrary will, as if an ordinance of his own could arrest the course of nature! and as if he had power to keep alive, by the loan of a name, a family in the male branch already extinct. Your father, Mr Mortimer, continued it with the same self-partiality, preferring the wretched gratification of tickling his ear with a favourite sound, to the solid happiness of his son with a rich and deserving wife. Yet this, however, remember; if to PRIDE and PREJUDICE you owe your miseries, so wonderfully is good and evil balanced, that to PRIDE and PREJUDICE you will also owe their termination: for all that I could say to Mr Delvile, either of reasoning or entreaty, — and I said all I could suggest, and I suggested all a man need wish to hear, — was totally thrown away, till I pointed out to him his own disgrace, in having a daughter-in-law immured in these mean lodgings!

Isn’t immured an amazing word? ‘To enclose within walls’ — the ‘mure’ in it is the same root you see in ‘mural’.

Now if you’ll excuse me, I suddenly need to go read some Edgar Allan Poe.

Tyler Schnoebelen

Written by

Linguistics and language, data science and artificial intelligence, UX and design, travel, San Francisco. Want to talk about emoticons, emoji or AI?