Geek Culture
Published in

Geek Culture

Image by Gerd Altmann from Pixabay

Automated Text Analysis using Streamlit

Efficient and Quick Text Analysis Tool built using Streamlit including Text Summarization, POS Tagging and Named Entity Recognition.

Introduction to Text Analysis

Text analytics is the automated process of translating large volumes of unstructured text into quantitative data to uncover insights, trends, and patterns. Combined with data visualization tools, this technique enables companies to understand the story behind the numbers and make better decisions. It is an artificial intelligence (AI) technology that uses natural language processing (NLP) to transform the unstructured text in documents and databases into normalized, structured data suitable for analysis or to drive machine learning (ML) algorithms. [1]

Need to automate text analytics process

Anyone who has worked on an NLP problem knows that text analysis is the most important step before starting with any ML solution. It can often take a lot of time to write all this code and hence, it’s important to be able to go through this process quickly. Creating an automated text analysis app has a lot of benefits. Anyone who wants to get started with data exploration does not have to write pipelines to visualize their data and begin modelling. An application like this reduces the time between Exploratory Data Analysis (EDA) and model building.

To that extent, I have built an application for automated Text Analysis using Streamlit. If you’re not aware of what Streamlit is, you can check out this article I wrote on a powerful use case for building data apps using Streamlit.

Text Analysis Application

My application has many tools for data visualization and analysis in it. The user interface is an important aspect of my application. The user interface offers multiple tools to choose from using a left dropdown option.

Preview of the user interface of my application

In the following article I provide a run-through of the application that I have built. You can find the code for this application here.

Requirements

There are bunch of libraries that you need to run this application. I have listed them all here.

Tools that can be accessed from the application

There are many tools integrated into this application and you can put them all in a dropdown like this.

1. Word Cloud Generator ☁️

A word cloud is a data visualization method that forms an integral aspect of text analytics. It is a text representation in which words are shown in varying sizes depending on how often they appear in our corpus. The words with higher frequency are given a bigger font and stand out from the words with lower frequencies. The pre-requisite to this task is a set of text cleaning and processing functions.

I have integrated these functions as a part of my tool which run as soon as the text is added on the application. Another feature I have added is the where the output can defined in a particular shape based on the image that is provided to the generator.

The function create_wordcloud has been created in a different file and can be found here.

Word cloud output for Harry potter data and Snape mask

2. N-Gram Analysis

n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. In this analysis, I am trying to identify the most commonly occurring n-grams. While word cloud focusses on singular words, this analysis can yield multiple phrases instead of just one. This is used to analyze writing style of a person to see how repetitive it is and what patterns occur in the writing a lot. It’s implemented in a similar manner so I won’t go over it again and the code can be found here and the output here.

This is a two-part task. The first part is sentiment extraction which given a sentence, identifies whether it is positive of negative sentence. The second phase is identifying which keyword in the sentence is causing that emotion. The keywords are extracted based on what the contribution of the word is to the final sentiment value of the statement.

The models were trained on a large corpus first to make them more robust and generalizable. The data was collected using a scraper that I built to scrape Google Play Store Reviews. The scraper collected roughly 12,500 reviews. The implementation of the scraper can be found in the Notebooks folder under the name of sentiment_analysis_data_collection.ipynb.

The code for the sentiment analysis task can be found here.

Given a large corpus, we need to summarize it into a summary which covers key points and conveys the exact meaning of the original document to the best of our ability. I have created an extractive text summary tool which takes the same words, phrases, and sentences from the original summary, thereby selecting the most important sentences in the given text. There are different methods of estimating the most important sentences in a large text. The number of sentences is calculated using a compression ratio.

Sentence Count = η * Total Sentence Count

where η is the compression ratio between 0 to 1. I have used different forms of text-ranking algorithm which builds a graph related to the text. In a graph, each sentence is considered as vertex and each vertex is linked to the other vertex. These vertices cast a vote for another vertex. The importance of each vertex is defined by the higher number of votes. [6] The algorithms I used can be found in the Text_Summarization.ipynb notebook. The summarization code in Streamlit can be found here.

In traditional grammar, part of speech (POS) is the category of words that have similar grammatical properties and similar syntactic behaviour. Parts of speech are useful because they reveal a lot about a word and its neighbors. Knowing whether a word is a noun or a verb tells us about likely neighboring words. The activity of assigning POS tags to words in a corpus is known as POS Tagging. I intended to do the tagging based on the Penn-Treebank tagset [4]. It is a set of 45-tag set which has been used to label many corpora.

The implementation can be found in the text_analysis.ipynb notebook. The result in the application can be seen below.

Named entity recognition (NER) is a very important aspect of information recognition which is further used in knowledge graphs, chatbots and many other implementations. The task involves classifying text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values etc.

The library I use for this is SpaCy which performs named entity annotation. I have also added a selection box from where we can choose which named entities we want to display. SpaCy’s Named Entity Recognition model has been trained on the OntoNotes5 corpus [5]. Once the user inputs a sentence, it is tokenized, tagged and then add to the NER function which helps in selection and visualization of different entities. The result can be seen below.

The code for both POS tagging and NER can be found here. The aim of the this article wasn’t to provide you with all the code right here because this is a huge application with many facets. You are free to explore the repository and the power of Streamlit on your own. The article is intended to introduce you to the concepts of text analysis and deploying your ML tools on Streamlit. Hope you liked it!

References

[1] Brimacombe, J. M. (2019, December 13). What is text mining, text analytics and natural language processing? Linguamatics. Access.

[2] Metsis, V., Androutsopoulos, I. and Paliouras, G. (2006) Spam Filtering with Naive Bayes — which Naive Bayes? Third Conference on Email and Anti-Spam (CEAS), Mountain View, July 27–28 2006, 28–69.

[3] Jurafsky, D. (2000). Speech & language processing. Pearson Education India.

[4] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: the penn treebank. Comput. Linguist. 19, 2 (June 1993)

[5] Weischedel, Ralph, et al. OntoNotes Release 5.0 LDC2013T19. Web Download. Philadelphia: Linguistic Data Consortium, 2013.

[6] Mihalcea, Rada & Tarau, Paul. (2004). TextRank: Bringing Order into Text.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store