Please, Not Another Word Cloud! How to Build Compelling Visualizations with NLP Models

Published in

IBM Data Science in Practice

6 min readMar 29, 2021

Graffiti on a brick wall saying “Everything has beauty but not everyone can see it” — Photo by Annie Spratt on Unsplash

Storytelling is a fundamental but often overlooked skill of a data scientist. We build models, but without communicating their value to stakeholders, clients, or even customers at times, we are lessening both the impact and value of our creations. In this post, I will give a set of tips and ideas for visualizing natural language processing models — a keyword analysis model, a sentiment analysis model, and a named entity recognition (NER) model — in particular. However, these ideas can be applied to any type of unstructured data. All of the models built and visualizations shared throughout this post have been shared in this Jupyter Notebook.

The dreaded word cloud

Many clients, in my experience, prefer two things with regards to hearing about data: 1) they prefer quantitative data generally over qualitative, and 2) they prefer a nice clean crisp visual. These visuals really do impact the discussions and conversations you can have with a client.

These visuals are harder and demand more work with unstructured data. Say you are building a model to predict earnings at a business. You have data of past earnings from the company, the business forecast, and other discrete and quantitative data. So, you build a line chart — this is your go-to graph for such a model and no one discusses it.

a person in front of a screen with a bar chart on it and holding a pencil and using their mobile phone — Photo by Chris Liverani on Unsplash

Now, say you are a pharmaceutical company and you collect Tweets about COVID-19 vaccines and vaccinations (this is a real-life data set that I pulled from Kaggle). Now say you want to understand what people are saying. How do you visualize this?

The often go-to answer here is a word cloud. As a linguist, I’m not particularly a fan of word clouds. While some can be beautiful (for example: this is a nice post with some really great ones), they are very hard to communicate exact meaning with them. Certain words or phrases can stand out, but many other factors are unknown.

Fundamentals

After five years of building NLP models as a researcher and then four years at IBM as a data scientist, there are three things that I’ve learned are key to visualizing natural language data:

1. Really think about what insights your model provides. Try to generate an easy-to-understand metric when possible.

2. Think about how to connect the unstructured text data to structured data when possible.

3. Show what your model is doing in a creative way.

Insights and showing metrics

Number one: Really think about what insights your model provides. Try to generate an easy-to-understand metric, when possible.

Considering the previous section, think about the following question: if you had to choose between a word cloud and a bar chart, which would you choose? The answer is that you should choose the bar chart 99% of the time. In the images below, consider that they are conveying the exact same thing. Both convey the same top words: PfizerBioNTech, vaccine, PfizerBioNTech vaccine, Moderna, today. In the word cloud, you can likely gather that the ones that are larger are more prominent, but it’s unclear what the frequency is. However, frequency is part of the model and the bar chart very clearly shows those. I used Plotly for this chart, which is a great Python library for interactive visualization.

a word cloud with the words “PfizerBioNTech”, “vaccine”, and “Moderna” highlighted — Word Cloud Using Keyword Frequency in Tweets

A bar chart showing the frequency of words related to COVID-19 vaccines, such as PfizerBioNTech with a count of 127. — Plotly Bar Chart Using Keyword Frequency in Tweets

This is a great example of an NLP model which can generate an easy-to-understand metric, like frequency. Make sure that those important bits of information do not get lost in your visualization.

Connecting data types

Number two: Think about how to connect the unstructured text data to structured data when possible.

I work a lot with sentiment analysis models at IBM. I used to work with the Chief Analytics Office and built a tool named Clarity that displayed marketing intelligence analytics for marketing teams and offering managers. These users need to understand how products are doing and they often need to know how a product’s performance impacts the revenue stream. Many of our most impactful analyses involved pulling in review data, performing sentiment analysis on it, and then connecting these to the structured data on those products. Those results often led to the most meaningful conversations with our stakeholders.

With our PfizerBioNTech data set, we can build a good sentiment analysis model. But what can we do with it to give actionable insights? Let’s try connecting it to stock prices. Stock prices are a metric that business stakeholders understand well. Combining it with natural language understanding to do sentiment analysis, we can produce the following interactive visualization. The sentiment analysis model is built using Watson’s Natural Language Understanding API and the visualization is again built in Plotly.

a stacked bar chart over time showing total sentiment as a sum of negative and positive sentiment about COVID-19 vaccines and a line chart overlain on the same time scale showing stock prices of BioNTech. They appear to correlate between highs and lows. — Sentiment in Tweets vs. Stock Price

You could just show your stakeholders sentiment over time, but it’s not necessarily going to give them any real insights or create conversation. But when you bring in the stock prices and combine these two types of data together, you can help stakeholders understand the impact, influence, and/or correlation that tracking and monitoring Tweet sentiment can have on a business. In the image above, you can see the valleys are the same and the peaks are the same for both sets of data. The signal from the Twitter data is obviously significant, and now you can help your clients or stakeholders understand why.

Creativity

Number three: Show what your model is doing in a creative way.

For this last section, let’s say you want to identify sentiment towards the FDA so that you can understand how people feel about the choices the FDA is making. One thing you will want to pull in is a named-entity recognition (NER) model. NER models identify entities in the text by type of entity, such as organizations.

SpaCy is a great and easy to use Open-source Python library for natural language processing. It has a cool module called displaCy. DisplaCy allows you to provide a string and it then highlights the named entities. So, instead of building a diagram which shows what an NER model does, you can give an example. This provides a really easy way to explain what the model is doing.

A sentence with NER highlighting — The original sentence is “The US FDA has approved two coronavirus vaccines: Pfizer and Moderna Inc. Over 200 vaccines are being developed. The phrase “The US FDA” is marked as an Org, the word “two” is marked as a “cardinal”, the word “Pfizer” is marked as a person, the word “Moderna” as an Org, and the word 200 is marked as a “cardinal”. — NER Output from displaCy

Some Caveats

Every use case isn’t the same, so some of the tips I’ve given here may or may not apply. It may be hard to understand when they apply and when they don’t. On the flip side, when it comes time to present your results, this is a very useful exercise to go through and knowing which approach to take does get easier with time. Questions to ask yourself…

· Is your visualization clear?

· Is your visualization quantified?

· Is the intention easily understood?

Keep in mind, you will have to dedicate a lot of time to building visualizations as it’s not a straightforward process. In the end, though, when you go to present your work, you’ll save time. Your visualization will explain your work for you, and you’ll spend more time actually talking about the insights and recommendations themselves. I’ve seen visualizations be the catalyst for the direction a conversation can go with a client. For example, using the same NLP model under the hood, I’ve seen great visualizations guide a good conversation with a client and bad visualizations make a conversation go very poorly. So, it is definitely time well spent.

Conclusion

NLP models can be hard to explain. Hopefully, these tips provide a starting point for data scientists to build cool and creative visualizations. 😎

Below, I’ve provided a list of resources I’ve used in this post and a link to a blog post with common Python NLP libraries. Enjoy and use these as a basis for your own exploration in NLP visualization!

References and useful links:

Overview of NLP libraries

Plotly — the interactive visualization library used in this post

The Watson API I used in this post

The Kaggle dataset I used in this post

GitHub link to the Jupyter notebook with all the code for the models/visualizations