Photo by @scottwebb from Unsplash with added text by author

Two Python Repositories for Text Visualisation

From well-made to wow for text visualisations

Alex Moltzau
The Startup
Published in
3 min readAug 17, 2020

--

It is really incredible what you can find freely available on the Internet, especially within programming languages with a large user base such as Python. It is incredible to find a topic for this on GitHub called text visualisation. I thought I would examine the two most starred repositories (repo(s)) in this topic, namely: Texthero and Scattertext.

Texthero

Straight away Texthero starts with an easy introduction.

It is all about:

  1. Text preprocessing.
  2. Representation.
  3. Visualization.

A repo that promises to bring you from zero to hero.

“Texthero is a python toolkit to work with text-based dataset quickly and effortlessly. Texthero is very simple to learn and designed to be used on top of Pandas.”

It includes tools for several aspects:

  • Preprocess text data: it offers both out-of-the-box solutions but it’s also flexible for custom-solutions.
  • Natural Language Processing: keyphrases and keywords extraction, and named entity recognition.
  • Text representation: TF-IDF, term frequency, and custom word-embeddings (wip)
  • Vector space analysis: clustering (K-means, Meanshift, DBSCAN and Hierarchical), topic modeling (wip) and interpretation.
  • Text visualization: vector space visualization, place localization on maps (wip).”

It is free, open-source and well documented.

Their argument is that it is hard to deal with different NLP tools (NLTK, SpaCy, Gensim, TextBlob, Sklearn).

This is why they developed their solution.

Install texthero via pip:

pip install texthero

“☝️Under the hoods, Texthero makes use of multiple NLP and machine learning toolkits such as Gensim, NLTK, SpaCy and scikit-learn. You don’t need to install them all separately, pip will take care of that.

For faster performance, make sure you have installed Spacy version >= 2.2. Also, make sure you have a recent version of python…”

It can look pretty neat.

I would recommend you to check it out! I will be trying it on my project.

Scattertext

Another package is Scattertext. I would say this one is complementary. Additionally it is quite impressive with an interactive plot, and it can be neat if you want to present the information in a visually appealing way given the right conditions for the data.

“A tool for finding distinguishing terms in corpora, and presenting them in an interactive, HTML scatter plot. Points corresponding to terms are selectively labeled so that they don’t overlap with other labels or points.”

Scattertext has a lot of demos! Therefore, you will find plenty of examples to experiment with or draw inspiration from.

If you have a lot of documents and want the frequency displayed as well as all occurrences this can be rather excellent.

The visualisation is interactive and searchable. Check it out here.

It is stunning what kind of work has gone into making this and it is helpful that it is shared online.

Hope this was helpful if you are currently working with text.

You could likely have found this yourself, but if you are following my journey I hope you have discovered something you did not previously know about.

This is #500daysofAI and you are reading article 440. I am writing one new article about or related to artificial intelligence every day for 500 days.

--

--

Alex Moltzau
The Startup

Policy Officer at the European AI Office in the European Commission. This is a personal Blog and not the views of the European Commission.