Yarn, Digging Around
In the last post I looked into aggregated metrics and frequently used words in comparison to other videos. In this post we’ll continue discussing frequently used words as collections. To start, I took the 10 most common terms from each video and then compared them against one another using a frequent item set mining algorithm (RELIM).
This generates a plethora of combinations. Each set contains a union of terms related to a matching set of videos. So, a matching set that contains:
([‘right’ ‘okay’ ‘like’ ‘just’ ‘oh’ ‘get’ ‘hey’ ‘yeah’]):3
Is matched to only three videos in our system. Since we started with just the top 10 terms for each video, this means we’re matching 80% of the terms, which is fab. It turns out the three videos are from the TV series Master of None. This makes sense, because for those terms to appear repeatedly requires similar dialogue, consistent writing, or predictable actor improvisation.
Video 1: Master of None, S01E07, Ladies and Gentlemen
The top 10 words and their counts for this video are: “just”: 36, “like”: 31, “oh”: 31, “yeah”: 31, “right”: 30, “got”: 29, “man”: 25, “get”: 24, “hey”: 23, “okay”: 20
Video 2: Master of None, S01E01, Plan B
“yeah”: 53, “right”: 43, “just”: 42, “go”: 38, “like”: 36, “get”: 34, “okay”: 33, “hey”: 30, “know”: 26, “oh”: 26
Video 3: Master of None, S01E05, The Other Man
“like”: 41, “just”: 36, “yeah”: 32, “right”: 32, “know”: 30, “oh”: 28, “hey”: 24, “get”: 21, “okay”: 21, “good”: 20
Another way to look at this data is to graph it visually. Below we graph the nodes and edges for the top three most frequent terms for each video. You can start to see some basic relationships between the videos. For example nodes with lots of connections are larger. In this diagram we’re using a force directed layout algorithm to position the nodes.
The most prominent nodes in our network collect near the center and are: “know” with 99 connections, “oh” with 81, “get” with 51, and “yeah” with 46.
If you pick out a show like Fargo (S02E01), you’ll see it’s three most popular terms are: “know”, “yeah”, and “ok”. These terms are also very popular terms in the graph.
As you pan away towards the outer side of the graph you start to see weakly connected components. The term frequency drops off quickly and this is reflected in the graph as orphaned islands. You can see that Star Trek (S01E05) and The Powerpuff Girls share “professor” as a term, so they are linked, but other than than they are on their own.
You’ll also see totally isolated islands like Pee Wee’s Big Adventure, with lone nodes of “pee”, “wee”, and “bike”. This is not to say other movies don’t have those terms…it’s just they are not in the top three most frequent terms of other videos.
It could be a reference to my architectural background, but I’m a sucker for information transformation and visual representation of data. I also love the scale continuum from all aggregated data to a simple single Yarn.
In a future post, I’ll explore a more comprehensive way of comparing videos and their transcripts. Stay tuned.