Classification Techniques in Text Mining

Isha Gupta
3 min readAug 20, 2020

--

Isha Gupta
NMIMS’s Mukesh Patel School of Technology Management and Engineering.

A mind map of Text Mining and Analytics with various techniques and algorithms

Introduction

Text mining is the process of extracting knowledge from the large collection of unstructured text data.

With the advancement in technology each day, Text mining has become the key element in Industries to discover new information or help answer specific research questions.

This article gives a brief overview on the classification techniques for mining text data and and various Algorithms used.

Text Classification

Text classification is a technique, where extracted documents are classified into predefined classes. Text Classification technique can be broadly classified into two types: Supervised Document Classification and Unsupervised Classification.

In supervised classification, an external mechanism (Human feedback) defines classes for the classifiers while in the Unsupervised Classification, the classification is done without any external reference.

There is also a Semi-Supervised Document Classification technique, where some documents are labeled by the external mechanism (means some documents are already classified for better learning of the classifier).

Different Classifiers

Decision Trees

It is a hierarchical tree structure to classify the text documents. A Decision Tree text classifier in (Russell Greiner and Jonathan Schaffer, 2001) is a tree with internal nodes called branches, each having a weight assigned and the ending nodes are called the leaf nodes.
Decision tree uses ‘divide and conquer’ approach for classification.

Advantages:
1. Decision trees are simple to understand and easy to interpret.
2. Modifications and addition of new possible scenario can be easily done.

Disadvantages:

  1. As levels of a tree increases the complexity of calculations also increases.
  2. Decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where decisions are made at each node locally and cannot guarantee to return the globally optimal decision tree.

Naïve Bayes classifier

Bayesian classifiers are statistical probabilistic classifiers used for text categorization. It uses this posterior probability for classification of the documents. The document is assigned to the class if it has maximum posterior probability for that class.

Advantages:

  1. It is fast to train and classify the data or documents.
  2. It is not affected by irrelevant features.
  3. Streaming data is handled well.

Disadvantages:

  1. It is independent feature model so that the present of one feature does not affect other features in classification tasks.

Nearest Neighbor Classifier

In the document, a training set is calculated with all other documents. If K similar documents are considered then it is called as K nearest neighbor classifier.
It uses the local neighborhood to predict the class of an object. The majority vote of its neighbors decides the class of an object.

Advantages:

  1. The cost of the learning process is zero.
  2. No necessity of assumptions about the characteristics of the concepts to learn have to be done.

Disadvantages:

  1. The model cannot be interpreted.
  2. It is computationally expensive and requires more time to find the k nearest neighbors when there is large number of training datasets.

PostScript

With the drastic increase in the world digitization, there has been an explosion in the volume of documents. Text Classification is hence needed to classify the documents according to the predefined classes based on their content.

References

A SURVEY ON TEXT ANALYTICS AND CLASSIFICATION TECHNIQUES FOR TEXT DOCUMENTS Nihar Ranjan, Abhishek Gupta, Ishwari Dhumale, Payal Gogawale and *Rugved Gramopadhye

Text Mining and Analytics using Natural Language Processing.

https://medium.com/@ishagu10/text-mining-and-analytics-using-natural-language-processing-f2df7233b5b

--

--