How to Apply Data-centric AI Mindset to Text Classification Problems?

Mastering the customer complaint classification using various data-centric Python packages and AI tricks

9 min readJun 22, 2023

If you were opening a steak house, you would want to serve the customers the most flavorful steak for dinner. Naturally, you would be concerned about the cooking equipment — whether the frying pan has a thick base and non-stick coating. However, you probably wouldn’t put all your effort and time into choosing the perfect pan.

Similarly, in machine learning, finding the best model isn’t the sole target when providing business solutions. You would:

Concern picking which steak — ribeye? T-bone? (the appropriate dataset)
Consider how to season the steak — with sea salt, or ground pepper? (data engineering process)
Spend lots of time on how to cook better and innovate with the recipes (error analysis)
Design how to cook steaks with consistent quality at an industrial scale (MLOPs).

Machine Learning from Hype to Reality

Building state-of-the-art models is fantastic as it leverages the knowledge from the algorithms research side. However, creating business value from machine learning on the applied side is an entirely different discipline.

Many of the challenges of applying AI in the real world stem from the imperfections in the data.

The cruel reality may include:

scarce data, with fewer than 5000 examples
data bias, meaning unclear data lineage and inaccurate representativeness over the entire population
noisy data, such as class imbalance, outliers, skewed distribution, and missing/ duplicated data

Data isn’t magic and cannot guarantee that we’ll obtain anything useful from it. If we ignore the extent of insights provided by the existing data and jump straight into the model development, there is a risk of “garbage in, garbage out”. That means, getting poor results even with the fancy model. To ensure we get the most relevant, diversified, and random examples to the model, incorporating a data-centric AI approach is a noteworthy choice. Though collecting more data often helps, it is usually an expensive and cumbersome task. Hence, we should focus on enhancing the data quality.

Improve Data Iteratively with a Fixed Model

When it comes to machine learning problems, I typically follow these steps for applying data-centric AI techniques:

1. Profile the data to identify any imperfections

2. Clean out any dirty data

3. Train using an easy-to-implement model

4. Conduct error analysis to focus on a subset of data/ edge cases

5. (Iteratively) Improve the data through augmentation, annotation, and further pre-processing, then retrain with the same model

The data-centric approach goes beyond mere data preprocessing. It requires iteratively analyzing the prediction/ classification gap, and improving the dataset through trial-and-error with a growth and creativity mindset.

The winners of the data-centric AI competition held by Andrew Ng, Chairman, and Co-founder of Coursera, have widely shared their strategies and techniques for improving handwriting datasets (here and here). A handful of experts are also shifting their focus toward publishing more innovative ideas and developing various tools for data-centric AI.

To extend this topic further at a different angle, I would like to demonstrate how to systematically adopt the data-centric approach on a textual dataset, with the leverage of Python packages.

Text Classification Problem Overview

Plenty of consumer complaints were constantly received across various renowned organizations (e.g. Equifax) in the financial sector. They were made across various products and services, including credit reports, student loans, money transfers, etc. To streamline the complaint handling process, we need to apply machine learning to classify the complaint narratives into one of the products/ services categories.

The dataset that we will be working on is in Kaggle.

Data Profiling & Data Cleansing

We can initially perform a low-code Exploratory Data Analysis (EDA) by using the ydata-profiling package:

It produces an HTML/ JSON report, which includes the overview section (e.g. unveil the data distribution and visualizations) and the alert section (e.g. highlight potential data quality issues).

1. Overview section

Data distribution (after skipping the missing data)

Complaints distribution (Image by author)

This dataset is quite imbalanced, with over half of the customer complaints belonging to the top 3 product categories out of 18. We are uncertain if the skewed class proportions in the dataset accurately reflect the complaint distribution in real-life. Additionally, the classification performance of minority groups can be questionable.

Word cloud

It is not surprising that the common stop words like “the” and “and” appeared most frequently in the source complaints. However, an interesting observation is that the word “XXXX” has the highest frequency. This is a dummy word that could negatively impact model performance too.

Complaint example with the highest occurrence

Complaint narrative: “There are many mistakes appear in my report without my understanding.”

Product categories it belongs to: “Credit reporting, credit repair services, or other personal consumer reports” and “Debt collection”

2. Alert section

Missing data: This accounts for approximately 70% of the dataset, leaving only around 380k data that can be used for classification.
Duplicated data: Around 95% of data are of unique values, while the remaining value is duplicated.

Model Preprocessing & Training

I used Term Frequency and Inverse Document Frequency (TF-IDF), a standard pre-processing technique in natural language processing (NLP). It provides a list of word frequency scores and highlights words that appear commonly in the complaint narrative but less often in other documents. For example, “loan” and “home” could be important words in the complaint narrative for the product category “mortgage”.

After preprocessing the text, I applied the Multinomial Naive Bayes classifier (MNB) to train the model. MNB is a probabilistic learning method based on the Bayes theorem, which is a conditional probability formula that you might have learned in secondary education. By using this algorithm, we can calculate the probability of each product category for a given complaint narrative, and output the product category with the highest probability.

You may doubt whether we should pick word embedding or TF-IDF for preprocessing, and whether to choose a deep learning model (such as transformer) or MNB. It is completely acceptable to use any processing & training strategy as a starting point. Here we only aim to set the baseline performance, they are chosen due to two simple reasons — easy to implement and have a low computation cost.

Error Analysis

We analyzed the baseline performance using the widely recognized Python package scikit-learn:

This summarises the model’s performance with precision, recall, and F1-score for both individual labels and the global level in the classification report.

Here are some observations regarding test performance:

“Credit reporting” is the fourth largest product category, with approximately 7,900 complaints labeled as such. However, its performance is extremely poor, with only a 0.01 F1-score.
The recall score for the top three product categories is better than their precision scores, indicating that the model has a tendency to classify complaints into those three categories even if they actually belong to other categories.
Some categories have similar or ambiguous names, resulting in poor classification performance. Examples include “Credit card”, “Prepaid card”, and “Credit card or prepaid card”.
No customer complaints are classified correctly into certain product categories, including “Money transfer” and “Payday loan”.

Now, let’s explore how we can approach the problem in a data-centric manner.

Data-centric AI techniques

I iteratively applied several approaches to improve the quality of the customer complaint dataset.

Data annotation

Inconsistent labeling in visual applications can impact the algorithm’s performance. Examples are bounding box size and the number of bounding boxes. When come to the text classification task, we also need to carefully review the text labels before proceeding. By combining and renaming ambiguous categories, I reduced the number of categories from 18 to 12.

Text data processing

A wide range of standard NLP pre-processing techniques are used, including removing punctuation and stop words, lowercasing, tokenization, stemming, and lemmatization. One effective methodology to improve the dataset is Part-of-speech (POS) tagging using NLTK library.

This step nicely processes the input text and attaches with additional information about the sentence’s grammatical structure.

Since some complaints are lengthy, I tried to transform and shorten sentences by only including the words with nouns as part-of-speech. This change significantly improved the accuracy.

Remove “bad” dataset examples

After performing data annotation and text data processing, it was apparent that there was still noise present in the data. If we were to choose to screen the complaint data manually, the task would be both ineffective and subjective. In order to clean up messy real-world data examples, we can utilize Cleanlab an open-source AI tool that cleans the messy real-world data examples using the confident learning algorithm. With the feature embedding from the sentence transformer model, as well as the predicted probabilities obtained from our MNB:

The library provides example-level scores and creates a shortlist of the most likely issues in the dataset, based on factors such as label quality, out-of-scope data, and near-duplicates.

Examples of complaints with low label quality (Image by author)

I tested and utilized this package to identify data with lower label quality. With over 280k training examples available, it is crucial to ensure that the examples used to train the models are both relevant and informative. Therefore, I decided to remove 5k bad examples, which accounted for less than 2% of the whole training dataset. Due to an imbalanced distribution of data, only examples originating from the top five product categories with the highest number of complaint examples were filtered out.

Text augmentation

There still exist some edge cases that could not be classified well due to their small proportions. Therefore, we need to augment these edge case samples.

Thanks to the lightweight Python package nlpaug, we can easily oversample the complaints of minority products at the character level, word level, and sentence level.

After several trials and errors, it is found that the dataset improves significantly when applying the synonym augmenter which leverages semantic meaning to substitute words via word embeddings. Two important things to keep in mind are: (1) try not to overfit the augmented data, and (2) never allow the augmented data to appear in the validation and test dataset.

There are many possibilities and directions to augment the text. For example, from the latest publications in research areas, we can make use of ChatGPT with few-shot learning to synthesize the data.

Re-train with the same model

After applying the data-centric AI approaches, the overall F1-score increased from 0.61 to 0.80, which marks a significant improvement across product categories:

(1) For “Credit reporting, repair, or other” (with the highest frequency), the F1-score increased from 0.67 to 0.84.

(2) For “Other financial service” (with the lowest frequency), the F1-score improved from 0 to 0.47.

Key Takeaways

To apply machine learning solutions to real-world problems, we can use various data-centric AI techniques such as pre-processing, data annotation, data augmentation, and removing bad examples. These strategies are generally applicable across various data domains and dataset types. However, it is important to remember that each machine-learning problem is unique, so we need to generate analytical ideas using business acumen and continually inject creativity to improve our solutions.

Before you go

If you enjoy this reading, I welcome you to follow my Medium page to stay in the loop of exciting content around data science, project management, and self-improvement.

BECOME a WRITER at MLearning.ai //FREE ML Tools// Clearview AI

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com