Yarn, Digging Around

Jeffrey Krause
The Data Experience
4 min readMar 18, 2016

--

In the last post I looked into aggregated metrics and frequently used words in comparison to other videos. In this post we’ll continue discussing frequently used words as collections. To start, I took the 10 most common terms from each video and then compared them against one another using a frequent item set mining algorithm (RELIM).

This generates a plethora of combinations. Each set contains a union of terms related to a matching set of videos. So, a matching set that contains:

([‘right’ ‘okay’ ‘like’ ‘just’ ‘oh’ ‘get’ ‘hey’ ‘yeah’]):3

Is matched to only three videos in our system. Since we started with just the top 10 terms for each video, this means we’re matching 80% of the terms, which is fab. It turns out the three videos are from the TV series Master of None. This makes sense, because for those terms to appear repeatedly requires similar dialogue, consistent writing, or predictable actor improvisation.

Video 1: Master of None, S01E07, Ladies and Gentlemen

https://getyarn.io/yarn-clip/e526ea87-91c8-4967-ae0c-02e704358863

The top 10 words and their counts for this video are: “just”: 36, “like”: 31, “oh”: 31, “yeah”: 31, “right”: 30, “got”: 29, “man”: 25, “get”: 24, “hey”: 23, “okay”: 20

Video 2: Master of None, S01E01, Plan B

https://getyarn.io/yarn-clip/063242f2-e9c9-487e-ac46-a8a0365cbd72

“yeah”: 53, “right”: 43, “just”: 42, “go”: 38, “like”: 36, “get”: 34, “okay”: 33, “hey”: 30, “know”: 26, “oh”: 26

Video 3: Master of None, S01E05, The Other Man

https://getyarn.io/yarn-clip/9bfd14b8-7e70-4f0b-ba97-f1792a2855aa

“like”: 41, “just”: 36, “yeah”: 32, “right”: 32, “know”: 30, “oh”: 28, “hey”: 24, “get”: 21, “okay”: 21, “good”: 20

Another way to look at this data is to graph it visually. Below we graph the nodes and edges for the top three most frequent terms for each video. You can start to see some basic relationships between the videos. For example nodes with lots of connections are larger. In this diagram we’re using a force directed layout algorithm to position the nodes.

The most prominent nodes in our network collect near the center and are: “know” with 99 connections, “oh” with 81, “get” with 51, and “yeah” with 46.

If you pick out a show like Fargo (S02E01), you’ll see it’s three most popular terms are: “know”, “yeah”, and “ok”. These terms are also very popular terms in the graph.

https://getyarn.io/yarn-clip/7559e6d2-f4b6-4643-b58f-f1222eee3fea

As you pan away towards the outer side of the graph you start to see weakly connected components. The term frequency drops off quickly and this is reflected in the graph as orphaned islands. You can see that Star Trek (S01E05) and The Powerpuff Girls share “professor” as a term, so they are linked, but other than than they are on their own.

https://getyarn.io/yarn-clip/3c289bf3-fe20-4cae-8f8c-b8359d43606f
https://getyarn.io/yarn-clip/61d74453-0cb1-4d40-b1d8-03f41f7719cc

You’ll also see totally isolated islands like Pee Wee’s Big Adventure, with lone nodes of “pee”, “wee”, and “bike”. This is not to say other movies don’t have those terms…it’s just they are not in the top three most frequent terms of other videos.

https://getyarn.io/yarn-clip/4b97394a-ea96-4ea2-9d97-a245365fbc30

It could be a reference to my architectural background, but I’m a sucker for information transformation and visual representation of data. I also love the scale continuum from all aggregated data to a simple single Yarn.

https://getyarn.io/yarn-clip/c34d70e1-fd06-4218-a866-b83e9dc72d28

In a future post, I’ll explore a more comprehensive way of comparing videos and their transcripts. Stay tuned.

If you enjoyed this post, please click that little green heart below. Or send a response, [ Typing sounds ].

--

--

Jeffrey Krause
Jeffrey Krause

Written by Jeffrey Krause

Currently making Yarn: https://yarn.co/, deep search in video, entrepreneur, digital product design, developer, generative design