Using Word Clouds to Represent Textual Data Trends

Yang Lu
INST414: Data Science Techniques
3 min readFeb 10, 2022

I wanted to see trends in the news. This data would be able to help with financial decisions such as investing, or general purposes as such hobbyists seeing what is trending in their field of hobby.

The data that could answer this question are headlines from news. Generally a good headline should be “clear and specific, telling the reader what the story is about”(SPC) It is relevant because it would provide a short summary of what the article is about, and since I am mostly looking for keywords of a topic and not the whole article. The headline would be a good place to look for data.

The API I used, NewsAPI, have a python library which I imported. I also used the json library to parse the output of the api. I used the io library to import StringIO, because the pandas’s read json wasn’t reading the json.dump as a string. I used pandas to read the json as a data frame which I then converted to a .csv file.

For text datas, a common data analysis method is to make a word cloud. I used the libraries wordcloud, matplotlib, and pandas to read in the created .csv file and plot a word cloud from it. Since the topic of interest is stocks to invest. I typed in “stocks to invest” as a topic and it made a word cloud of:

It is not very useful. The main problem I encountered was stop words, and other common words that adds no value. I wrote a piece of code that got rid of the stop words in the word cloud, there are still a lot of useless words left. One thing that is interesting is EV at the top left. It does not say whether it is good or bad to invest in. This is most likely due to the word cloud showing what is going on in the news, which is something to do with EV:

that “li” tho

What I learned is that word clouds are not a good way to determine whether to invest or not in a topic. It does provides a rough idea of what is being reported however.

Some problems:

~json.read not reading the json.dumps as a string(used StringIO)

~useless words showing up in wordcloud(made a txt file of words to blacklist and fed it to the wordcloud maker for it to ignore)

~html elements in headlines showing up in word cloud(same as above)

~finding out to use json.dumps and json.read after hours of trying to format the api output correctly to a .csv

The limitations of my approach is that it does not provide an definite answer to my question. I believe this is because using news as a data source is messy to parse. What I could also try, should I still continue this method is going through each article and making a word cloud that combines the total content, rather than headlines. What I could have also done was just doing some stock api analysis and find the top performers. I would then use an ordinary least squares model to make a prediction for the stocks.

--

--