Understanding Text categorization part3(Natural Language Processing)

Monodeep Mukherjee
2 min readJan 3, 2023
Photo by Vasilina Sirotina on Unsplash
  1. Handling Imbalanced Dataset in Multi-label Text Categorization using Bagging and Adaptive Boosting(arXiv)

Author : Genta Indra Winata, Masayu Leylia Khodra

Abstract : Imbalanced dataset is occurred due to uneven distribution of data available in the real world such as disposition of complaints on government offices in Bandung. Consequently, multi-label text categorization algorithms may not produce the best performance because classifiers tend to be weighed down by the majority of the data and ignore the minority. In this paper, Bagging and Adaptive Boosting algorithms are employed to handle the issue and improve the performance of text categorization. The result is evaluated with four evaluation metrics such as hamming loss, subset accuracy, example-based accuracy and micro-averaged f-measure. Bagging ML-LP with SMO weak classifier is the best performer in terms of subset accuracy and example-based accuracy. Bagging ML-BR with SMO weak classifier has the best micro-averaged f-measure among all. In other hand, AdaBoost MH with J48 weak classifier has the lowest hamming loss value. Thus, both algorithms have high potential in boosting the performance of text categorization, but only for certain weak classifiers. However, bagging has more potential than adaptive boosting in increasing the accuracy of minority labels.

2.On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis (arXiv)

Author : ose Camacho-Collados, Mohammad Taher Pilehvar

Abstract : Text preprocessing is often the first step in the pipeline of a Natural Language Processing (NLP) system, with potential impact in its final performance. Despite its importance, text preprocessing has not received much attention in the deep learning literature. In this paper we investigate the impact of simple text preprocessing decisions (particularly tokenizing, lemmatizing, lowercasing and multiword grouping) on the performance of a standard neural text classifier. We perform an extensive evaluation on standard benchmarks from text categorization and sentiment analysis. While our experiments show that a simple tokenization of input text is generally adequate, they also highlight significant degrees of variability across preprocessing techniques. This reveals the importance of paying attention to this usually-overlooked step in the pipeline, particularly when comparing different models. Finally, our evaluation provides insights into the best preprocessing practices for training word embeddings

--

--

Monodeep Mukherjee

Universe Enthusiast. Writes about Computer Science, AI, Physics, Neuroscience and Technology,Front End and Backend Development