Use BeautifulSoup and WordCloud to visualize the most frequent words in Firefly

Michael (Misha) Smirnov
2 min readFeb 22, 2022

--

Awkward how most of the speaking parts are given to white males

I love firefly. Does it hold up? I don’t know. But its script is published online, so we’ll use it as an example to display textual data in graphical form using WordCloud.

First, we need to download and parse the HTML from the script. The script is located here: https://firefly2002.weebly.com/scripts.html, and when we look closer at the HTML, all the text is within the “paragraph” class:

We’ll filter by the “paragraph” class

So, let’s import BeautifulSoup and filter out the paragraphs. We’re going to strip out all the extra tags and replace them with spaces.

url = 'https://firefly2002.weebly.com/scripts.html'html_text = requests.get(url).textsoup = BeautifulSoup(html_text, 'html.parser')paragraphs = [para.get_text(strip=True, separator=" ") for para in soup.find_all('div', {"class": "paragraph"})]

This gives us a list of all the paragraph divs in the page. We’re just looking at word frequency, so we don’t need anything else. Now, we’ll import and download the right packages, and cut out stopwords.

from wordcloud import WordCloudimport nltkfrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizeimport matplotlib.pyplot as pltnltk.download('stopwords')nltk.download('punkt')stop_words = set(stopwords.words('english'))word_tokens = word_tokenize(" ".join(paragraphs))filtered_words = [w for w in word_tokens if not w in stop_words]w1_space_split = " ".join(filtered_words)

Now we have one long string without words like “the” and “a”. All we do is create and format our wordcloud, and plot it!

my_wordcloud = WordCloud(width=1200, height=1200, background_color='white', min_font_size=10).generate(w1_space_split)plt.figure(figsize=(8,8), facecolor=None)plt.imshow(my_wordcloud)plt.axis('off')plt.show()
Easy Peasy!

And that’s it! Follow me on twitter @SaladZombie, and here’s the whole code for you to run in Colab:

--

--

Michael (Misha) Smirnov

Data Scientist at Amazon. PhD in Neuroscience. Coder, creator, woodworker, all around cool guy who likes high fives.