Text Classification in NLP using Cross Validation and BERT

Nandan Grover
14 min readFeb 15, 2023

Introduction

In natural language processing, text categorization tasks are common (NLP). Depending on the data they are provided, different classifiers may perform better or worse (eg. Uysal and Gunal, 2014). However, there is data where a correlation between (vectorised) texts and classes would be expected, but the assumption is not satisfied, and the classifiers perform poorly. The main reason for this is that there are a lot of classes and a lot of different texts. We’ll look at a variety of preprocessing strategies as well as techniques for improving the classifier’s performance.

In order to gain useful insights from our data we have built four classifiers:

  1. Binary Classifier for predicting if a sentence was uttered by therapist or client
  2. Binary Classifier for labelling if a conversation is high quality or low quality
  3. Multi-Class Classifier for classifying text to appropriate behaviour type
  4. Bert Transformer for classifying text to appropriate behaviour type

The folder structure for the architecture is shown in Figure 1. “binary_classifier_interlocutor.ipynb” file stores our binary classifier which uses ensemble learning to classify if a text was uttered by the therapist or the client while “binary_classifier_quality.ipynb” determines if the overall conversation between a therapist and client is of high quality or low quality. The…

--

--