Karthik Vadhri
Intuition Matters
Published in
6 min readJul 12, 2022

--

Quick & Easy ways to convert Text Data into Actionable Insights

Text Analytics has been fast evolving , with a lot of research going on in around analysing & visualising text data. The primary reason for the huge focus on text data — is partly due to the pace at which it is being created.

With global internet penetration reaching 60% and population growing at a rate of 1.1 %, it has already reached 5 Billion users across the globe. These users generate almost 2.5 Quintillion bytes of data(considering everything from photos uploaded to social media posts, product reviews, etc) . A significant chunk is in form of text data (tweets, emails, messages, etc)

500 million tweets are sent across the globe, daily!

There are businesses out there looking to get insights from this text data — and make data informed decisions on the next best action plan, and how to generate profits from the global internet population.
With the amount of text data being generated in different forms , user surveys, feedback, etc, we have seen business users struggling with mining insights from this text data, given the complexity of language.

This has made Text Analytics is a niche area for data scientists, and the demand is growing faster than the pace of data creation. My previous article Natural Language Processing by Intuition provides an overall understanding of various NLP techniques, and the nuances in each of them.

This article, however is focussed on the tools & techniques which can help convert text data into actionable insights with little effort & knowledge of core NLP techniques. Below are the target groups, this article is intended to help!

1- Business Users — who want quick insights from short text they have access to & quickly understand the hidden insights — and dont have the privilege of hiring a data scientist. [note that this requires a basic understanding of python]

2- Computer Engineers — who know programming, but not data science.

3- Data Scientists — with tight deadlines to deliver business value from their work, myself being one of them.

A lot of algorithm implementations are available as pre built packages(on Python, or any other programming language) , which makes our life easier & helps explore this data with very little effort.

A google search on top 10 text analytics packages will probably give you a lot of licensed tools (credit to the marketing teams ), but there are a lot of out of the box implementations that can be leveraged to generate insights from text data with a few clicks or less than 5 lines of python code.

Nltk, spacy have a lot of options that enables the data scientist to play with text, like the count vectoriser, bigrams, stop words, pos tagging, etc., with a lot of documentation available, but these are aimed at data scientists with deep understanding of NLP techniques. However, there are a few hidden gems, which makes it easy for anyone(not just a data scientist) looking to understand a piece of text. A few sample use cases listed below:
a) Authors, who want to get a summary of the article, and insights like average reading time, determine readability etc.
b) Operations managers , who want to make sense of thousands of product/service feedbacks received from customers.
c) Product Managers looking at comments from user surveys of the newly launched product and understand the unmet needs to the users & plan their roadmap.

and the list goes on.

1: Word Tree : Keyword in Context

A word tree is a visualisation that displays text data in a hierarchical way: as a tree of elements, usually single words, connected by lines. This helps understand the different contexts a word has been used in, from the corpus of text.

For business users, there are implementations, where they can directly paste the text data and get a visual representation.
Google Visualisation library has a direct implementation of the word tree, which can be imported as a Java Script can be easily integrated.

Sample Word Tree from Jason Davies — Open Source Implementation

Jason Davies — who introduced the word tree in his paper , also has an opensource implementation, targeted at business users, where users can just paste text and generate a word tree instantly.

2: textstat

Textstat is an easy to use python library to calculate statistics from text. It helps determine readability, complexity, and grade level. This supports the most frequently languages, English, French, German, etc and is evolving to support all languages.
It helps understand basic stats of text. There are predefined functions like reading_time(text) that gives the average reading time of an article, flesch_reading_ease(text) can assess the ease of readability in a document, etc.

3: textblob

Another easy to use, open source python library, built as a wrapper layer around NLTK & pattern, and plays nicely with both.

It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more, in just a few lines of python code.

4: texthero

Still in its beta version, Texthero is a fast evolving python library that quickly and effortlessly enables you to work with text-based dataset . It is very simple to learn and designed to be used on top of Pandas. Texthero includes tools for preprocessing data, extracting key words & phrases, visualisation, text segmentation, etc.
Data Scientists & NLP experts can contribute to this fast growing community.

5: textwrap

The textwrap module is used to format and wrap plain texts, making it easier to read & understand large paragraphs of text.
Text wrapping is a useful tool when processing textual languages in applications of natural language processing, data analysis, and even art or design work.

Few other notable mentions, that require a hands on expertise in python & a detailed understanding of NLP.

scattertext

As the name suggests, scattertext is a scatter plot of text data, which enables data scientists to find distinguishing terms in corpora and displaying them in an interactive HTML scatter plot.

Scatter Text in Action

BERTopic
BERTopic is a topic modelling technique that leverages hugging face transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

Initially developed by Maarten Grootendorst in 2020, this algorithm has been steadily gaining traction for all text segregation/topic modelling use cases.

A Data Scientist can instantly visualise the clusters/topics formed, with just an extra line of code. This functionality has been the significant value add, along side the capabilities of creating efficient embeddings employing its default sentence transformer model paraphrase-MiniLM-L6-v2a. Clustering the embeddings using with UMAP (to reduce the dimensionality of embeddings) and HDBSCAN (to identify and cluster semantically similar documents) & creating topic representation leveraging c-TF-IDF (class-based term frequency, inverse document frequency) are integrated into the algorithm.

LIT — Language Interpretability Tool
This visual, interactive model understanding tool for NLP use cases can be run either as a standalone server or inside notebook environments like Jupyter, golab or GCP Vertex AI.

LIT is built to answer questions such as:
a) What kind of examples does my model perform poorly on?
b) Why did my model make this prediction? Can this prediction be attributed to adversarial behaviour, or to undesirable priors in the training set?
c) Does my model behave consistently if I change things like textual style, verb tense, or pronoun gender?

Although, all companies are investing in building data science teams, business stakeholders & product managers are finding it difficult to get the time of data scientists for their use cases. This is where leveraging existing open srouce implementations can help convert data into actionable insights without much effort.
Also, data scientists are under the radar now, for showcasing business value from data, and in shorter sprints. Hence, leveraging such tools/pre built applications can help DS Teams deliver value through actionable insights.

Please feel free to comment if you find other NLP implementations that helped you deliver business value in a snap!

Stay tuned to Intuition Matters for more informative articles.

--

--