7 NLP Tasks for Information Visualization

Get a grip on the Natural Language Processing landscape! Start your NLP journey with this Periodic Table of 80+ NLP tasks

Rob van Zoest
The Startup
7 min readFeb 3, 2021

--

Periodic Table of Natural Language Processing Tasks by www.innerdoc.com and created with the Periodic Table Creator
Periodic Table of Natural Language Processing Tasks is created with the Periodic Table Creator

Russian chemist Dmitri Mendeleev published the first Periodic Table in 1869. Now it’s time for the NLP tasks to be organized in the Periodic Table style!

The variation and structure of NLP tasks is endless. Still, you can think about building NLP Pipelines based on standard NLP tasks and dividing them into groups. But what do these tasks entail?

More than 80 frequently used NLP tasks are explained!

Group 15 : Information Visualization

75. Interactive App Creation

Presenting your NLP task results should be transparent, interactive and fancy. Several solutions are available to share your data and code. Notebooks are open-source web applications that allows you to create and share documents that contain live code, equations, visualizations and narrative text. These Notebooks and web apps are flexible and you can arrange the user interface with building-blocks or plugins for your specific use-case.

With Streamlit you can code the UI in the same script as you analysis is. By saving the script, the browser will automatically refresh. It’s a great way of building interactive demo’s. You can also deploy your scripts to the Streamlit cloud platform. The computing power in the free cloud platform is not suited for heavy apps.

Streamlit self-driving car demo with code on the left and browser UI with a sidebar on the right (source)
Streamlit demo: Controllable face GAN generator (source)

Jupyter Notebooks (f.k.a. IPython Notebook) facilitate in-browser interactive computing with direct results. There is also JupyterLab which is a web-based interactive development environment for Jupyter notebooks.

Jupyter Notebook with text-, code- and output blocks (source)
JupyterLab environment (source)

Google’s Colab (from Colaboratory) Notebooks are like Jupyter Notebooks, it’s free, runs in the cloud and there is no setup. In Colab you can choose to run on a (light but free) GPU runtime, instead of CPU.

Colab demo notebook (source)

76. Annotated Text Visualization

Printing text, but prettier. Often you want to show text with emphasis on specific words and their metadata. A simple plugin for Streamlit apps is this word annotations plugin. spaCy also has its Named Entity Visualizer.

Annotations made with a Streamlit plugin (source)

77. Wordcloud

The Wordcloud has been around for a long time. Visualizing information is a profession in itself. So there are some best practices, but Wordclouds seem to ignore these. Here are some remarks about the (missing) elements in a Wordcloud:

  • Stopwords are excluded, while the word don’t has an important meaning in front of another word. Including stopwords will mess up the Wordcloud because of their high frequencies.
  • Multi-word expressions are not calculated. The separate words from a multi-word expression (e.g. New York Times) will be interpreted totally different.
  • Different colors have no different meaning.
  • Vertical or horizontal words have no different meaning. The same applies to words at the top/bottom/left/right.
  • No context is given to clarify the sense of that word.

Although there is a lot of resistance against Wordclouds, they are still around. You can generate your Wordclouds with the stylecloud python library.

Compare two Wordclouds about the State of the Union 2002 vs 2011 (source)

78. Word Embedding Visualization

Visualizing Word Embeddings is often done to inspect the embedding and experience the cohesiveness of a subset of the embedding. It is all about dimension reduction; how to get a 2-D chart from e.g. a 300 dimensional embedding. Three often seen dimension reduction techniques:

  • T-SNE (t-Distributed Stochastic Neighbor Embedding) maps the multi-dimensional data to a lower dimensional space. This is computationally expensive. After this process, the input features are no longer identifiable, and you cannot make any inference based only on the output of t-SNE. Hence it is mainly a data exploration and visualization technique. T-SNE is good at preserving local context (neighbors).
  • PCA (Principal Component Analysis) is a linear feature extraction technique. It combines your input features in a specific way that you can drop the least important feature while still retaining the most valuable parts of all of the features. As an added benefit, each of the new features or components created after PCA are all independent of one another.
  • UMAP (Uniform Manifold Approximation and Projection) has some advantages over t-SNE, most important is the increased speed and better preservation of the data’s local (neighbors) and global (clusters) structure.

Scattertext is a famous package for finding distinguishing terms in corpora, and presenting them in an interactive, HTML scatter plot. This is done by visualizing the difference and overlap of two categories of documents. You can try a demo about republican vs democratic speeches.

Scattertext visualization (source)

Googles TensorBoard Embedding Projector graphically represents high dimensional embeddings. This can be helpful in visualizing, examining, and understanding your embedding layers. A similar but simpler library is RASA’s Whatlies that also helps to inspect your word embedding.

Visualization in the Tensorflow projector for the most similar words to ‘school’ (source)

79. Events on Timeline

Plotting events chronological on a timeline increases the insight. Some example setups for plots are:

  • Document Timeline: document publishing date vs document title
  • Sentence Timeline: date, timestamp or period vs its sentence (demo)
  • Dispersion Plot: the location (word offset) of a keyword in a text
Dispersion plot for Game of Thrones keywords (source)

Some libraries for inspiration: Seaborn stripplot, Yellowbrick dispersion plot, NLTK dispersion plot, Calmap heatmap per day from pandas timeseries, Knightlab javascript timeline for data in google sheets.

80. Locations on Geomap

Geocoded Named Entities can easily be mapped on a geographical map. There are several services and libraries to do the job:

  • With a Mapbox account (50k web map loads/month free tier) you can plot your coordinates from a Pandas dataframe to a Plotly scatter-mapbox.
  • GeoPandas makes working with geospatial data in python easier. It extends the datatypes used by Pandas to allow spatial operations on geometric types.
  • Folium creates beautiful and interactive maps by using Python and Leaflet, a JavaScript library for interactive maps. Folium has a lot of Jupyter demo notebooks.
Folium chart with Python and Leaflet (source)

81. Knowledge Graph Visualization

A Knowledge Graph is a knowledge base with interlinked descriptions of entities. This can be used to put data into context and enhance search engines. Keywords and Named Entity Recognition in combination with relation extraction is a good source when feeding Knowledge Graphs.

Technically, a Knowledge Graph is a network that represents multiple types of entities (nodes) and relations (edges) in the same graph. Each link of the network represents an (entity, relation, value) triplet. For example: Eiffel Tower (entity) is located in (relation) in Paris (value). When you know A relates to B and B relates to C, then you automatically profit from the advantage that there is an indirect connection between A and C.

Knowledge Graph visualization (source)

You can build your own network in python with Networkx and draw it with pyplot from Matplotlib. If data gets bigger, you need to scale up to a Graph Database like Grakn, ArangoDB or Neo4j.

Fraud Detection and Exploratory Data Analysis are important use-cases. For example, graph databases were intensively used to explore complex networked data during the Panama Papers investigation.

ABOUT THIS POST

I have tried to make the Periodic Table of NLP tasks as complete as possible. It’s therefore more a long-read than some self-contained blog articles. I split the 80 articles into the groups of the Periodic Table.

You can find the other group-articles here!

The set-up and composition of the Periodic Table is subjective. The division of tasks and categories could have been done in multiple other ways. I appreciate your feedback and new ideas in the form below. I tried to make a clear and short description for each task. I omitted the deeper details, but provided links to extra information where possible. If you have improvements, you can send add them below or you can contact me on LinkedIn.

Please drop me a message if you have any additions!

Download the Periodic table of NLP tasks here!

Create your own customized Periodic Table here!

ABOUT ME

Founder @ innerdoc.com | NLP Expert-Engineer-Enthusiast | Writes about how to get value from textual data | Lives in the Netherlands

Feel free to connect with me on LinkedIn, Twitter.com/innerdoc_nlp or follow me here on Medium.

--

--

Rob van Zoest
The Startup

Founder @ innerdoc.com | NLP Expert-Engineer-Enthusiast | Writes about how to get value from textual data | linkedin.com/in/robvanzoest/