5 Minute EDA: Word Cloud of Successful and Unsuccessful Shark Tank Companies

Aya Spencer
5 Minute EDA
Published in
4 min readFeb 19, 2022

I used word cloud to find terms relating to successful (and unsuccessful) Shark Tank companies over 6 seasons.

Photo by C Dustin on Unsplash

What is a word cloud?

Word cloud is sometimes also referred to as “tag cloud.” According to Wikipedia, word cloud is a

visual representation of text data, which is often used to depict keyword metadata on websites, or to visualize free form text. Tags are usually single words, and the importance of each tag is shown with font size or color. When used as website navigation aids, the terms are hyperlinked to items associated with the tag.

So in other words, it’s a way to show the frequency of a keyword in a text string or a collection of strings. Word clouds are useful when trying to figure out what terms are being used the most in a narrative. Such information can be used to measure things like community engagement as well as to discover emerging conversations.

So let’s get started!

Preparing data

I leveraged the Kaggle database and found a dataset that includes all startups that were pitched on the TV show “Shark Tank” from seasons 1 through 6, along with data on whether or not the company scored a deal with the Sharks. A snapshot of the dataset looks like this:

Deciding which field to run the word cloud

I’m going to be using the “description” field to run my word cloud. Notice that the description field includes some stop words such as “for” and “with” that I would have to remove prior to performing the word cloud.

A note on stop words

Stop words are words that should be filtered out prior to conducting natural language processing. There is no universally agreed upon list of words that are considered stop words, but in general, words such as “like”, “as”, and “the” are all considered stop words. If you want to add your own list to the default stop words recognized in Python, you can do this:

stopwords = set(STOPWORDS)stopwords.update(["whereas", "because", "business", "startup"])

Generating word cloud

I decided to perform word cloud separately for those that scored a deal with the Sharks vs those that didn’t so that I can visualize the differences in the descriptions between these two buckets.

deal = df['deal'] == True
nodeal = df['deal'] == False
df_deal = df[deal]
df_nodeal = df[nodeal]

You can choose a color for your word cloud. I decided to stick with the basic white background.

text = " ".join(review for review in df_deal.category.astype(str))
text_nodeal = " ".join(review for review in df_nodeal.category.astype(str))
wordcloud = WordCloud(stopwords=stopwords, background_color="white", width=800, height=400).generate(text)wordcloud_nodeal = WordCloud(stopwords=stopwords, background_color="white", width=800, height=400).generate(text_nodeal)
Description word cloud of companies that scored a deal on Shark Tank (season 1–6)
Description word cloud of companies that did not score a deal on Shark Tank (season 1–6)

Make it pretty

Did you know that you can turn your word clouds into any shape by utilizing a jpg image? I put each of my word clouds into a separate image frame. For those that scored a deal, I masked it inside of a “thumb up” image, while those that didn’t score a deal was masked under a “thumb down” image. Here’s an example of a thumb up jpeg image that I found online:

I then converted the jpegs into a numpy array:

thumbup = np.array(Image.open("thumbup.jpg"))
thumbdown = np.array(Image.open("thumbdown.jpg"))
example screenshot of array output

Here are the results!

Description word cloud of companies that scored a deal on Shark Tank (season 1–6)
Description word cloud of companies that did not score a deal on Shark Tank (season 1–6)

Do you notice anything interesting? I thought it was interesting that specialty foods showed up large in both buckets, suggesting that this segment is probably a very saturated and competitive market.

Try it out for yourself with a dataset of your choice, and let me know what you find!

This is part of my 5-minute EDA series, where I run quick exploratory data analysis on an interesting dataset. Thanks for reading!

--

--