NLP visualizations for clear, immediate insights into text data and outputs
Using Plotly Express and Dash to explore data and present outputs in natural language processing (NLP) projects.
Extracting information from text remains a difficult, yet important challenge in the era of big data. Whether it comes to customer feedback, social media posts, or the news, the sheer volume of data to be analyzed can overwhelm information to be extracted.
This is where modern natural language processing (NLP) tools come in. They can capture prevailing moods about a particular topic or product (sentiment analysis), identify key topics from texts (summarization/classification), or amazingly even answer context-dependent questions (like Siri or Google Assistant). Their development has provided access to consistent, powerful, and scalable text analysis tools for individuals and organizations.
Still, aspects unique to languages can make it difficult to explore data for NLP or communicate result outputs. For instance, metrics that are applicable in the numerical domain may not be available for NLP. (E.g. what would be a mean, or a standard deviation of a set of word tokens?) Even if they could be calculated, presenting the data to audiences can be challenging.
So, in this article, we wanted to share with you ways that Plotly Express and Dash can ease some of this pain.
Plotly Express and Dash were designed with code readability and succinctness as priorities, to enable easy creation of high-quality local (Plotly Express) and web dashboard (Dash) visualizations. In other words, they aim to have data visualization support your work, not have it become a new headache.
With that said, let’s get into it! We use a consumer complaints database corpus for this example, but the concepts and visualizations we discuss should be universally applicable.
(All analysis and notes here are for demonstration purposes only.)
Our dataset contains over 18,000 rows and three columns. While this isn’t large by modern standards, it’s not really possible to ‘eyeball’ this raw data.
Let’s explore this dataset with Plotly Express, starting with the distribution of complaint counts by their date (to see trend over time):
Now we’ll plot a histogram for the 20 companies with the most complaints:
Or by narrative length:
You may have noticed the succinctness of our code. Analysis by multiple variables, or changing to a log scale is also a cinch — just pass additional parameters as shown below:
Even better, these Plotly charts integrate seamlessly into Dash for dashboard generation as you will see later.
Now that we have looked at the distributions, let’s move on to review the text data in substance, starting with n-grams.
N-grams are simply sequences of tokens (words), and have many practical applications as well as being a great exploratory method. As single words can only tell us so much, let’s move straight to plotting counts of top bigrams.
Isn’t that neat? Most of these bigrams appear to indicate sensible groups of complaint types, and the counts show the volume of each group (credit report and credit card related complaints appear to be most common).
To drill down further into this data, a hierarchical visualization, such as a treemap, could be used. This example below divides the data by company and then whether the phrase ‘credit report’ is included. Box sizes indicate group sizing, and color indicates average narrative length.
Notice that the visualization immediately reveals length-related patterns. Credit report related complaints tend to be longer, and a couple of companies’ complaints also stand out generally.
In some cases, you may wish to compare proportions of complaint bigrams for each company, in which case a stacked bar might be useful:
Companies with higher volumes of credit card complaints pop out to the eye, as does one with a high student loan-related complaint.
For a closer review, we may even compare two companies directly, as done here for top 50 bigrams:
This enables an easy comparison of two datasets by subject matter.
While we don’t have time to get into the technical weeds, very broadly speaking, word embeddings (dense embeddings to be precise) enable qualitative comparisons of words. They can represent words, and, by extension, concepts or documents as high dimensional vectors, which also provide opportunities for interesting visualizations. Take a look at this simple representation of bigrams using a bubble chart:
Here, high-dimensional bigrams are represented as two-dimensional representations using a dimensionality reduction technique called t-SNE.
Similar charts could be produced for any subset to compare text similarities and insights — say, for each company, or by length.
This might be a good opportunity to highlight that each of these charts were created in just a few lines of code using Plotly Express. Not only that, although you see static screenshots here, Plotly will generate interactive charts in your browser or notebook. Crucially, they can easily be incorporated into a live dashboard with Dash.
NLP dashboards made easy with Dash
The value proposition of Dash is similar to, and intertwined with, those that made Python the leading language for NLP. It has a low learning curve, readable yet succinct code, a thriving community of users, as well as useful libraries and modules that can be leveraged to create dashboards.
Significantly for data scientists who are not also web developers, Dash abstracts many elements of web development to Python, allowing you and your team to remain in the Pythonic state of mind if desired.
Take a look at this Dash example for a navigation bar — notice that the HTML/DOM elements all created from within Python.
This is the web app that the snippet was taken from.
Dash provides Python interfaces to web-based components, while being declarative and reactive. Together, it enables easy creation of flexible, informative front ends that are accessible for everyone to interact with, whether for data exploration or presentations.
As foreshadowed above, incorporating one of these Plotly Express charts into Dash is straightforward.
For example — the word embedding bubble chart can be implemented in Dash like this:
As implemented, the user can select a parameter (perplexity) as a dropdown item, which initiates the callback function and updates the graph reactively — changing the 2-dimensional representation of the vectors. Below is a comparison of the bubble charts, at two different perplexity values.
This two-company bigram comparison is also incorporated in the Dash application as shown below.
More importantly, we only needed around 30 lines of code to add each Plotly Express chart to the Dash app, including interactivity and formatting, all without ever leaving Python. We think that this will ultimately improve productivity and efficacy for data scientists such as yourself.
Obviously, this is just a quick skimming of what is possible in NLP visualizations, but we hope to have showed you the kind of simplicity and ease of use that we believe makes Dash and Plotly a powerful tool for NLP practitioners.