Text Analytics on ‘Friends’ TV Series — 10 Seasons

Apoorva Mishra
The Startup
Published in
5 min readNov 24, 2020

If you are a fan of the famous TV show ‘Friends’, you must have found yourself arguing regularly — Who is the most important character? Who is the most complex? Which characters were close? Who was the most positive or negative character?

I am sure we all have opinions but the only justified answers can come from data. As a result, I decided to apply text analytics and web scraping principles I have learned through my Master’s at UCLA Anderson School of Management, to analyze my favorite TV series with 58,251 dialogues and 228 episodes spread across 10 seasons.

I found the following website which contained the dialogue scripts for all 10 seasons of the series.

Now I will walk you through the details of the project.

In terms of the technical tools used, I used Python 3.7 environment in Jupyter notebooks (I have shared the Github link of the notebook at the end of this article). In this article, I have shared the overall steps on a broad level and focussed mainly on the process and the results. Feel free to check out the GitHub link to follow the Python code in detail.

Overall, there were three key steps in the process:

1. Scraping the data from the website
2. Latent semantic and textual analysis
3. Extracting the key insights

Let's talk about each of the steps one by one.

  1. Scraping the data from the website.

I used ‘requests’ and ‘BeautifulSoup’ libraries in Python to extract the data from the website and store it into ‘pandas’ dataframe in a structured manner. At the end of this exercise, my data looked like this:

Figure 1: Initial dataframe

2. Latent semantic and textual analysis

In this process, I used ‘gensim’,’TextBlob’ and ‘nltk’ libraries to stem and lemmatize the data as well as performed a lexical and semantic analysis to extract sense out of the dialogues.

Overall this step included filtering of dialogues by 6 main characters, topic modeling exercise to understand the character complexity, affinity analysis to understand each character’s relationship with one another, dialogue frequency analysis, and sentiment analysis.

Those interested in understanding the concepts behind latent semantic analysis — especially topic modeling in detail, can take a look at this excellent Datacamp tutorial.

After filtering the dialogues by characters and pre-processing the dialogues to lemmatize them and remove the stop words, I used ‘gensim’ library to identify the number of topics for each of the characters. After the coherence score analysis, the ideal number of topics for each character was identified. I named this analysis for each character the thought diversity score. The result is somewhat unexpected (we will discuss more on this in the insights section):

Figure 2: Thought diversity score

Using the ‘gensim’ library only, I extracted the topics for each of the characters and mapped the character associations for each other. Basically, I calculated the frequency of appearance of a character in each other’s topics. I then prepared a heatmap of this affinity analysis using ‘seaborn’ library which looked as follows:

Figure 3: Affinity analysis

The next analysis was simple, I simply calculated the dialogue frequency for each character, which looked as:

Figure 4: Dialogue Frequency

The last step in the analysis was sentiment analysis. In this step, for each character, I determined the percentages of negative, positive, and neutral dialogues using the ‘TextBlob’ library. The result looked like this:

Figure 5: Sentiment analysis

Now, this brings us to our last step.

3. Key Insights

  1. Using the thought diversity score, we can clearly say that ‘Rachel’ was the most complex character. Hard work indeed for Jenniffer Anniston.
  2. Affinity score map clearly shows that Ross and Rachel had extremely good chemistry and had a good top of mind presence in each other’s conscience. This was evident throughout the show.
  3. Dialogue frequency shows Joey as the dominant character. Aligns well with the character’s popularity as a TV series solely focussed on ‘Joey’ was launched right after ‘Friends’. Though that didn’t do well — probably scope for another project.
  4. ‘Phoebe’ was the most positive character in the show while ‘Monica’ was the most negative. This aligns a lot with the views of my friends who have watched the show multiple times and are a huge fan.

Overall it was a fun project to work on where I applied both web scraping and complex text analytics skills to understand my favorite TV series even better.

Moreover, I believe this is a pertinent skill to have not just for data scientists or analysts, even product and marketing managers can leverage this skill to quickly grasp the voice of customers and take action on insights.

We live in the information age where it’s hard to keep track of customers’ and users’ voices and opinions on multiple channels and in huge volumes. Every product-led company — whether B2B or B2C company, tech or non-tech, needs to capture user voices and analyze them to generate actionable insights. This is where web scraping and lexical text analytics become useful.

For my geeky friends, who want to explore the code, feel free to access my Jupyter notebook for this project from here and if you have any questions or feedback, feel free to get in touch with me here.

--

--