Text Mining : A Basic Beginner’s Guide

Published in

Budding Data Scientist

7 min readFeb 4, 2020

What is Text Mining?

One of the domains that has created a lot of buzz in today’s technological field is Text Mining. It is also called as Text Data Mining, Information Extraction or KDD (Knowledge Discovery in Databases). So, for a newbie, trying to understand this vast domain might seem to be a cumbersome task. Let us look into this domain from scratch.

According to Wikipedia, ‘Text Mining is the discovery, by computer, of new previously unknown information, by automatically extracting information from different written resources’. This mainly includes finding novel insights, trends or patterns from text-based data. Such novel insights can be highly essential in fields like business. The main sources of data for text mining is acquired from customer and technical support, emails and memos, advertising and marketing, human resources as well as other competitors.

Index

Process of Text Mining
Relevance and Applications of Text Mining
Few Softwares Used in Text Mining

The Process of Text Mining

The process of text mining mainly involves five steps:

i) Text Pre-processing: The raw text data obtained will be unstructured in nature. First, it needs to be cleaned. There are a few steps in this pre-processing.

Text Normalization: This process involves the conversion of the data into a standard format. Here, the whole text is converted into upper or lower case, the numbers, punctuation, accent marks, white spaces, stop words and other diacritics are removed. Python can be used to implement this.
Tokenization: In this process, the whole text is split into smaller parts called tokens. The numbers, punctuation marks, words, etc. can be considered as tokens. Natural Language Toolkit (NLTK), Spacy and Gensim are a few tools that can be used for tokenization.
Stemming: It is the process of reduction of words to their stem, base or root form. The two main algorithms used for this process is Porter stemming algorithm and Lancaster stemming algorithm. NLTK as well as Snowball can be used for this.
Lemmatization: The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form. But, as compared to stemming, lemmatization does not simply remove the inflections. Instead, it uses information from different computational repositories to get the correct base forms of words.
Part-of-speech Tagging: It aims to assign parts of speech to each word of a given text based on a its meaning and context. NLTK, spaCy, Pattern are a few softwares that can be used for this.
Chunking: It is a natural language process that identifies constituent parts of sentences and links them to higher order units that have discrete grammatical meanings. NLTK is a good tool for this.
Named Entity Recognition (NER): It aims to find named entities in text and classify them into pre-defined categories. NLTK, spaCy can be used for this.
Relationship Extraction: This helps in identifying relations among named entities like people, organizations, etc. It allows to get structured information from unstructured sources such as raw text.

For more details on text pre-processing, click here.

ii) Text Transformation: This process mainly involves the document representation by the text it contains and the number of occurrences. There are mainly two approaches for this step.

Bag of Words: A text is represented as a bag of its words (multi set), disregarding grammar and even word order, but keeping multiplicity.
Vector Space: In this model, a document is converted into a vector of index terms derived from words. Each dimension of the vector corresponds to a term that appears in their text. Its weight records the importance of the term to the text.

iii) Feature Selection: Feature selection is also known as attribute selection or variable selection. It is the selection of the most relevant features from the available variables that gives most information for your prediction variables. The irrelevant features can increase the complexity and decrease the accuracy of the analysis. Pearson correlation coefficient, Chi — squared, recursive feature elimination, lasso regression, tree-based algorithms are a few methods that can be used for this. Python can be used for all the analysis. For more details on feature selection, click here.

iv) Data Mining: Here we combine the process of text mining with the traditional data mining techniques. Once the data is structured after the above processes, the classic data mining techniques are applied on the data to retrieve the information. These techniques include classification, clustering, regression, outer, sequential patters, prediction and association rules. For detailed information on data mining, click here.

v) Evaluation: After the data mining techniques are applied, we get an end result. That result is to be evaluated and checked for the accuracy in the prediction.

Relevance and Applications of Text Mining

Huge amounts of data are created everyday through economic, academic as well as social activities. All this information can be utilized optimally with the correct combination of skill sets. Data and text mining and analytics can be helpful for this. Text mining has been extensively used in today’s business as well as corporate domains. Some of the applications of text mining are given.

1. Risk Management: The humongous amount of textual data that is available helps the companies to have a deeper look into their health and performance. Risk analysis is an important factor in the development of every companies. Insufficient risk analysis can result in major failures for the company. Text mining can enable the company to mitigate the risk factors and also can help in deciding which firms to invest in, which people to give loans to and so much more by analyzing the documents and profiles of various clients.

2. Customer Care Services: Text mining as well as natural language processing has been extensively used in order to enhance the customer experience. Now-a-days chat bots that mimic human customer care officers have been used in many websites in order to make the user experience more customized. Text mining has been used in order to provide a rapid, automated response to the customers, which has reduced their reliance on the call center operators to solve the problems.

3. Personalized Advertising: The field of digital advertising has been revolutionized by the development of text and web mining and this is one of the latest applications of text mining. The text data related to all that a person types or searches online are shared with the other companies, which in turn show ads that has a higher probability of being clicked and converted into a sale.

4. Spam Filtering: One of the widely used means of official communication is e-mail. It has a really wide application, but a darker side to this are the spam mails that infest the inboxes of the users. These spam mails use up a lot of storage and they can also be a source from which the viruses or scams can enter. Various companies are using intelligent text mining softwares as well as the traditional keyword matching techniques in order to identify and filter the spam mails.

5. Social Media Analysis and Crime Prevention: Social media has been on the trend for a long time and millions of normal users use this medium as a means of communication. The anonymous nature of internet has made it easy for many criminals to plan their various strategies online. The task of identifying the potentially threatening messages from the normal ones is a task that has been made possible by the use of advanced text mining softwares. Also, online text analysis can be a good method to analyse what is ‘hot’ or trending in a particular time. This can be highly beneficial for various commercial companies.

Few Softwares Used in Text Mining

i) DiscoverText: DiscoverText combines flexible and adaptive software algorithms with human based coding to provide a framework for conducting accurate and reliable large-scale analysis. The software has the capability to merge data from various sources, such as text files, emails, open-ended answers on surveys, and online sources including Facebook, Google+, blogs, Tumblr, Disqus and Twitter. This ability to pull text from diverse sources combines information and associated structured metadata from multiple and unique information channels.

ii) Google Cloud Natural Language API: Google Cloud Natural Language API revels the structure and meaning of text by offering powerful machine learning models in an easy to use REST API. You can use it to extract information about people, places, events and much more, mentioned in documents, news articles or blog posts. You can use it to understand sentiment about your product on social media or parse intent from customer conversations happening in a call center or a messaging app.

iii) Lexalytics Salience: Lexalytics is a leader in text analytics software solutions, providing entity extraction, sentiment analysis, document summarization and thematic extraction for today’s businesses. Lexalytics builds a multilingual text analytics engine, Salience. Salience is currently integrated into systems for market research, social media monitoring and sentiment analysis, survey analysis / voice of customer, enterprise search, public policy.