A word cloud of words used in Abstract

I really enjoyed Abstract — Netflix’s series of documentaries on design. This is the sort of content that makes you want to not consume passively and create something.

I’ve been playing with text analysis in R and decided to apply some of the techniques I had come across recently, to create a word cloud of all the words used in the show. Like all data science projects, this one began with acquisition of data. Netflix offers subtitles in multiple languages. Each time you turn on the subtitles or change the language, it downloads them. It took a couple of minutes to locate the request in Chrome:

Since the series has just 8 episodes, rather than trying to automate the process of downloading the English subtitles for each episode, I downloaded them manually.

The subtitles are offered in TTML format. For creating a word cloud, I am only interested in the textual content in the file and not so much in the timing or font data. So rather than try and find a specialised TTML handling library, I wrote a quick Python script that simply treated each file as just another XML file and dumped the text for each episode into one tab delimited file:

episode script
1 [director] Part of this is I’m trying to figure out some of the...
2 [sea birds calling] [Tinker] I probably think about feet a lot ...
3 [Es] Over the last two decades of working, one of the things I'...
.
.
.

The next step was reading this file in R. I used the versatile read.csv but it didn’t work as expected.

> d <- read.csv('~/proj/abstract.tsv', sep="\t")
> str(d)
'data.frame': 3 obs. of 2 variables:
$ episode: int 6 7 8
$ script : Factor w/ 3 levels "[Ilse] Some people think interior design is a look. In fact, It must be really fun buying furniture is somethin"| __truncated__,..: 2 3 1

The data frame had only 3 rows. It clearly read the whole file because it picked the column names from the header (episode, script) and got the last 3 rows but I am quite puzzled as to why read.csv would skip rows 1–5. I tried passing the as.is=T argument toread.csv to prevent it from interpreting strings as factors, but that didn’t stop it from skipping rows 1–5.

'data.frame': 3 obs. of  2 variables:
$ episode: int 6 7 8
$ script : chr "[Paula] I walk outside and I see typography everywhere. New York City is a city of signs. Sometimes things writ"| __truncated__ "[Platon] I'm not really a photographer at all. The camera is nothing more than a tool. Communication, simplicit"| __truncated__ "[Ilse] Some people think interior design is a look. In fact, It must be really fun buying furniture is somethin"| __truncated__

Ultimately, I fell back on fread from the data.table package and it worked like a charm:

> library(data.table)
> d <- fread('~/proj/abstract.tsv')
> str(d)
Classes ‘data.table’ and 'data.frame': 8 obs. of 2 variables:
$ episode: int 1 2 3 4 5 6 7 8
$ script : chr "[director] Part of this is I'm trying to figure out some of the big picture things. How aesthetically to tell y"| __truncated__ "[sea birds calling] [Tinker] I probably think about feet a lot more than the average person. As a shoe designer"| __truncated__ "[Es] Over the last two decades of working, one of the things I've discovered is often things are made to fill v"| __truncated__ "[director] Is this going to be a slapstick comedy? Is it an action film? You know, let's have fun with it. -Yea"| __truncated__ ...
- attr(*, ".internal.selfref")=<externalptr>

The next step was to break the script into words. The tidytext package helps us do just that (and more):

> library(tidytext)
> words <- unnest_tokens(d, word, script)
> head(words)
episode word
1: 1 director
2: 1 part
3: 1 of
4: 1 this
5: 1 is
6: 1 i'm

Let’s get a count of each word:

> library(dplyr)
> words %>% group_by(word) %>% summarise(count=n()) %>% arrange(desc(count))
# A tibble: 5,047 x 2
word count
<chr> <int>
1 the 1863
2 and 1262
3 a 1217
4 to 1170
5 i 1135
6 of 985
7 that 878
8 it 796
9 in 681
10 you 681
# ... with 5,037 more rows

So as you’d expect, most of our most frequent words in the script are stopwords. Removing them is actually pretty simple:

> words <- words %>% anti_join(stop_words)
> words %>% group_by(word) %>% summarise(count=n()) %>% arrange(desc(count))
# A tibble: 4,496 x 2
word count
<chr> <int>
1 people 157
2 design 123
3 time 110
4 music 107
5 playing 97
6 car 66
7 yeah 66
8 feel 56
9 idea 50
10 world 50
# ... with 4,486 more rows

Now the thing that caught my attention was that the word “music” was the 4th most frequent word. As far as I recalled, with the exception of the 3rd episode featuring Ev Devlin, there wasn’t a lot of reference to music in this series. When I opened the tab-delmited file I had generated from the TTML files in a text editor and searched for “music”, the reason for this word’s high frequency became clear to me. The subtitles included lot of references to music playing in the background, e.g. I found the following phrases in just the first episode alone:

[electronic music continues playing], [electronic music ends], [ominous music playing], [calming music playing], [upbeat piano music], [jazz music playing], [electronic music playing], [electronic music continues], [electronic music continues], [chime music playing], [calming music playing], [calming music continues], [chime music playing], [chime music continues], [upbeat music playing], [instrumental music playing]

I used a regular expression to remove these phrases from the script, and repeated the steps above to get the words and their counts:

> d$script <- gsub('\\[(?:\\w+\\s){0,3}music(?:\\s\\w+){0,3}\\]', "", d$script, perl=T)
> words <- unnest_tokens(d, word, script)
> words <- words %>% anti_join(stop_words)
> words %>% group_by(word) %>% summarise(count=n()) %>% arrange(desc(count))
> words
# A tibble: 4,485 x 2
word count
<chr> <int>
1 people 157
2 design 123
3 time 110
4 car 66
5 yeah 66
6 feel 56
7 idea 50
8 world 50
9 ralph 49
10 love 48
# ... with 4,475 more rows

Much better. Once we have the words and their counts, making a word cloud is easy:

> library(wordcloud)
> words %>% count(word) %>% with(wordcloud(word, n, max.words=50, min.freq=5, color='purple4', random.order=F))

It’s equally simple to make one word cloud per episode:

> par(mfrow=c(2,4))
> colors <- c("blue","green4","green","gold","orange","orange3","red","red3")
> for (i in 1:8) {
words %>% filter(episode == i) %>% count(word) %>% with(wordcloud(word, n, max.words=50, min.freq=5,color=colors[i],random.order=F))
}

There is a lot more that could be done here — e.g. notice that in the word cloud for Tinker Hatfield, the word shoe and shoes are repeated — something we could address by singularising the words before plotting the word cloud.

If you are interested in analysing text with R, I’ll highly recommend the Text Mining with R: A Tidy Approach book.

Update: I figured out why read.csv was skipping the first five rows — the numerous quotation marks inside the file were interacting badly with its default quoting behaviour. So if you’d rather use read.csv than fread, set the quote parameter to a blank string:

d <- read.csv("~/proj/abstract.tsv", sep="\t", quote="", as.is=T)