Use BeautifulSoup and WordCloud to visualize the most frequent words in Firefly
--
I love firefly. Does it hold up? I don’t know. But its script is published online, so we’ll use it as an example to display textual data in graphical form using WordCloud.
First, we need to download and parse the HTML from the script. The script is located here: https://firefly2002.weebly.com/scripts.html, and when we look closer at the HTML, all the text is within the “paragraph” class:
So, let’s import BeautifulSoup and filter out the paragraphs. We’re going to strip out all the extra tags and replace them with spaces.
url = 'https://firefly2002.weebly.com/scripts.html'html_text = requests.get(url).textsoup = BeautifulSoup(html_text, 'html.parser')paragraphs = [para.get_text(strip=True, separator=" ") for para in soup.find_all('div', {"class": "paragraph"})]
This gives us a list of all the paragraph divs in the page. We’re just looking at word frequency, so we don’t need anything else. Now, we’ll import and download the right packages, and cut out stopwords.
from wordcloud import WordCloudimport nltkfrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizeimport matplotlib.pyplot as pltnltk.download('stopwords')nltk.download('punkt')stop_words = set(stopwords.words('english'))word_tokens = word_tokenize(" ".join(paragraphs))filtered_words = [w for w in word_tokens if not w in stop_words]w1_space_split = " ".join(filtered_words)
Now we have one long string without words like “the” and “a”. All we do is create and format our wordcloud, and plot it!
my_wordcloud = WordCloud(width=1200, height=1200, background_color='white', min_font_size=10).generate(w1_space_split)plt.figure(figsize=(8,8), facecolor=None)plt.imshow(my_wordcloud)plt.axis('off')plt.show()
And that’s it! Follow me on twitter @SaladZombie, and here’s the whole code for you to run in Colab: