Comprehensive Guide: Creating an ML-Based Text Classification Model

Published in

Atlantbh Engineering

12 min readApr 26, 2024

In the previous blog, we first defined the problem of customer support ticket classification that Atlantbh had the opportunity to solve. After presenting the business goal, we briefly described the proposed approach and obtained results. This blog post serves as a follow-up. We aim to delve into individual steps of the proposed approach, highlighting the best practices and providing insights gained from our experience.

Note that although we will remain in the context of customer support ticket classification, most of the steps described below apply to almost any text classification task, regardless of the specific problem domain.

Figure 4. in the previous blog post shows the overview of text classification flow using the ML approach. This figure outlines six steps, from problem definition and business goal to classifier deployment. On the other hand, the figure below shows a more granular flow of text classification by splitting feature engineering and model construction phases into multiple individual steps.

Figure 1. Text Classification Flow (Anja Plakalovic)

Exploratory Data Analysis

Exploratory Data Analysis (EDA) holds significant practical value for any ML task, providing insight into underlying structure and data characteristics. EDA enables the identification of the class distribution and detection of a potentially imbalanced dataset. However, an imbalanced dataset does not necessarily mean a “red flag”. In practice, imbalanced datasets are quite common. This is also the case with our dataset. For example, customers are more likely to report shipping issues than their account settings issues. There are many different ways to “combat” an imbalanced dataset, such as oversampling or undersampling, and many others, but we will not explain each of these techniques in detail here. Nevertheless, it is important to point out that in the case of an imbalanced dataset, selecting the appropriate metrics for model performance evaluation is of great importance. This will be discussed further as we go through the model construction process.

EDA can also provide valuable information and basic statistics of text corpus (i.e., unstructured text dataset), such as discovering the various data structures within a dataset. It is important to note that EDA is not a one-time process performed during the project’s initial phase, as some may assume by looking at the simplified version of the text classification flowchart. (Figure 1.) On the contrary, there is a feedback loop between EDA and feature engineering steps. (Figure 2.)

Figure 2. Feedback Loop Between EDA and Feature Engineering Steps (Anja Plakalovic)

Insights gained from the initial EDA directly influence the selection of steps we will perform in the feature engineering phase. Let’s say that we discover through EDA that there are some irrelevant records — it is necessary to remove these records from the dataset in the data cleaning phase. Conversely, the feature engineering steps also affect EDA. As the data goes through various changes in the feature engineering phase, it prompts a re-examination of the EDA to assess the impact of the feature engineering on the dataset and yield new insights. For example, stop word removal or lemmatization changes the previously established characteristics of the text corpus, and it is necessary to re-establish them. This feedback loop between EDA and feature engineering allows the development of a robust text classification model that increases the effectiveness of capturing underlying patterns in text data.

Feature Engineering Phase

Several fundamental steps are often performed during the feature engineering phase of a text classification task to ensure the quality and relevance of features extracted from raw text data. However, steps such as language translation or certain data cleaning and text preprocessing techniques may vary depending on the dataset characteristics and task-specific requirements. For example, language translation is crucial when dealing with multilingual datasets but is irrelevant for monolingual ones. Similarly, if a dataset contains different data formats, additional steps of data cleaning and text preprocessing may be required to resolve this variability. By recognizing a balance between commonly applicable procedures and task-specific requirements, we can tailor our feature engineering strategies to effectively address the unique challenges given by each NLP task and dataset.

Data Cleaning

After performing EDA on the initially obtained data, we quickly realized that our dataset contains records in various formats. In addition to plain text, which accounts for most of the dataset, we concluded that around 10% of records represent inquiries in HTML format. In many real-world situations, this often happens. For example, when collecting data from websites or emails, text usually contains HTML tags, which are irrelevant for analysis and may even interfere with further preprocessing steps. Removing these HTML tags and converting such text into plain, readable format was one of the first steps in data cleaning. Beautiful SoupPython library has proven to be a powerful tool to efficiently identify HTML-like messages, while the html2text library demonstrated its effectiveness in converting these messages to plain text.

The second part of data cleaning involved identifying and removing duplicates and irrelevant or noisy records from the dataset to ensure the data quality. In our scenario, the dataset included different types of non-standard records, such as test messages or other forms of inquiries that do not contribute to the classification objective, so it was necessary to develop specific conditions for filtering such records. Manual inspection also enabled us to verify and flag potentially irrelevant records. Once identified, such records were excluded from further analysis during the data cleaning process to prevent them from influencing the model training process and compromising classification performance.

Data cleaning usually reduces the dimensionality of the initial dataset by removing irrelevant features and records. In general, by filtering unnecessary columns (i.e., features) and discarding rows (i.e., records), the dataset becomes more streamlined. However, one should be very careful not to overdo the cleaning process, as excessive data removal can result in the loss of potentially valuable information, which can also negatively affect the performance of the classification model. Therefore, finding the right balance between the need for data cleaning and the preservation of relevant data is essential.

Language Detection & Translation

In the customer support domain, where inquiries may be in different languages, translation is necessary when creating a classification model. Instead of creating different models for each language, using language detection and translation tools is a more cost-effective and pragmatic approach. This strategy consolidates inquiries in multiple languages into a single dataset, facilitating comprehensive analysis. Since we worked with the data of an international company, it is not surprising that our dataset is multilingual. More precisely, our dataset contains customer inquiries in over 20 languages, where about 13% of records are in a language other than English.

When using APIs for language detection and translation, it is critical to consider data confidentiality and security. Ensuring that sensitive information remains protected during the translation is paramount and requires a careful selection of APIs and robust security protocols to protect data integrity and privacy.

Text Preprocessing

As already mentioned, text preprocessing steps are highly dependent on the specifics of the task and the dataset we are working with. In-depth EDA typically yields an intuitive determination of the necessary text preprocessing steps. We used the power of the Natural Language Toolkit (NLTK) Python library to perform around 15 individual steps as part of the text preprocessing phase. This library is popular and widely used for various NLP tasks. Performed steps range from converting all text to lowercase to removing emojis, URLs, special characters, irrelevant phrases, stop words, and many more. (Figure 3.)

Here, we will focus on two crucial steps in almost every text classification task: lemmatization and tokenization.

Lemmatization

Lemmatization is a text preprocessing technique that reduces words to their base or root forms (i.e., lemmas). Applying this preprocessing step leads to text normalization and vocabulary dimensionality reduction. There is another similar text preprocessing technique called stemming. Stemming removes suffixes to find the word root form, which does not necessarily result in valid words. On the other hand, lemmatization considers word context and part of speech, resulting in more precise transformations that represent valid words. (Figure 4.) However, although lemmatization is a more sophisticated approach than stemming, the downside is that it is more computationally intensive and much slower than stemming, which should be kept in mind when working with high-dimensional text datasets.

Tokenization

The term “tokenization” without specifying any additional context refers to the word tokenization. It is the most commonly used form of tokenization and involves splitting a text into individual words or tokens. However, we should note that there are also other types of tokenization. For example, NLTK also implements the sentence tokenization method. By breaking down text into smaller chunks, tokenization enables more granular analysis and further text data processing. Tokenization is a fundamental technique in every text classification task, as it serves as the preliminary step in converting raw text data into a format suitable for analysis by ML algorithm.

Train-Test Split

After tokenization, it is time to split the dataset into training and test subsets. This step is always better to perform before vectorization, especially when training our vectorization model rather than using a pre-trained one. We leave the test dataset aside and use it only at the very end to evaluate the performance of the classification model. This way, we ensure that the training and test datasets remain independent, thus preventing any information leakage from the test to the training dataset.

In general, the train-test split ratio always hovers around 80/20. This ratio has its roots in the well-known Pareto principle (also known as the 80/20 rule), which states that approximately 80% of consequences come from 20% of causes (the “vital few”). We decided to split our dataset using a 75/25 ratio. In the case of an imbalanced dataset like ours, it is necessary to provide an equal distribution of classes in the training and test datasets. This way, we ensure that the minority class is not included exclusively in the training or test dataset. Fortunately, this can be easily achieved using the “shuffle” and “stratify” arguments of sklearn’s train_test_split method.

Vectorization

Since ML algorithms cannot directly process text, conversion of text to numerical representation is an essential part of every text classification task. This process is called vectorization and is the final step in the feature engineering phase. We decided to use Word2Vec and train it on our data instead of directly using a pre-trained model. For training the Word2Vec model, we used the power of the Gensim library. This Python library has a module that efficiently implements the Word2Vec family of algorithms.

Once trained, Word2Vec is used on both training and test datasets to generate word embeddings for each word in the dataset vocabulary. This way, words are represented as high-dimensional vectors of numbers optimized for our domain-specific semantics. At the same time, Word2Vec ensures that relationships between words are captured so that similar meanings or contexts have similar vector representations.

Given that the previously tokenized dataset contains customer inquiries of variable length, word embeddings are first generated for each word and then aggregated to obtain a single vector representation for the entire record. This aggregation combines the information from all words in the record into a fixed-length vector determined by the dimensionality specified during the training of Word2Vec model. One of the most frequently used methods of aggregating is averaging word embeddings. (Figure 5.) This fixed-size vector of numbers captures the semantic information within the record and represents a suitable input to ML models.

Figure 5. Text Vectorization (Anja Plakalovic)

Model Construction Phase

The model construction phase, which follows feature engineering, consists of six steps (Figure 1.). Although these steps are closely related, they are presented separately on the flowchart to underline their importance. In the following sections, we will clarify these steps by emphasizing two decisions: classification algorithm selection and evaluation metrics.

Classification Algorithm Selection

Choosing the appropriate classification algorithm is a vital decision that is heavily influenced by the data scientist’s experience and domain knowledge. Experienced data scientists use their understanding of the dataset characteristics, the nature of the problem, and the known strengths and limitations of various algorithms to decide which algorithm to use. However, experimenting with different classification algorithms is often convenient as it provides valuable insights into which algorithms work best for a given task.

After evaluating the strengths and weaknesses of various algorithms, we chose Support Vector Machine (SVM) as the classification algorithm for its proven effectiveness in different NLP tasks. SVM’s several advantages make it a compelling choice for text classification tasks. Firstly, SVM is well-known for its versatility and robustness. It can handle high-dimensional data and capture complex relationships. SVM has the ability to find the optimal hyperplane for separating classes in a feature space, even in cases where the data is not linearly separable. It achieves this by using different kernel functions (e.g., linear, polynomial, radial basis functions). Moreover, SVM’s regularization parameter allows for fine-tuning the trade-off between model complexity and generalization. This flexibility ensures avoiding overfitting while maintaining good performance on unseen data.

Firstly, the baseline SVM model is trained and evaluated to establish a performance benchmark. The performance of this baseline model is then used to estimate the level of improvement in subsequent iterations. Hyperparameter tuning enabled us to determine the optimal SVM model parameters (e.g., kernel type, regularization parameter, and gamma value). More precisely, hyperparameter tuning entails multiple iterations of model training and evaluation using different combinations of hyperparameters. Once the model iterations are complete, the best-performing SVM model is selected. In the case of experimenting with several classification algorithms, the previously described steps are performed for each selected algorithm. Finally, by comparing each of them, the final model is chosen.

It is important to note that if we are not satisfied with the obtained results, we should consider going back to EDA, reviewing the steps performed in the feature engineering phase, modifying some of the preprocessing steps, or perhaps considering using a different algorithm to generate word embeddings.

Evaluation Metrics

For assessing the performance of a classifier, choosing the right evaluation metrics, especially in the context of imbalanced datasets, is vital. Stakeholder input becomes crucial as their priorities and goals often determine evaluation metrics. In our scenario, the stakeholders’ input primarily focused on accuracy as the key metric. However, we also relied on class-wise metrics such as precision, recall, and F1-score to ensure a comprehensive assessment. In addition, by using micro, macro, and weighted averages, we gained overall insights into model effectiveness.

Classifier Deployment

After the final model selection, the classifier is deployed in the production environment. Applying the created classifier in production on new customer support tickets (i.e., unseen text) consists of several steps. Firstly, it is necessary to detect the language of the new text and translate it if it is not in English. After that, the exact text preprocessing steps included in the process of classifier construction follow. Finally comes vectorization, which converts the text into a form suitable as input to the created classifier. By applying the classifier, the output is the category of a customer support ticket.

Figure 6. Using Text Classifier in a Production

In the classifier deployment phase, cooperation between different teams is crucial. Software engineers developing the system are responsible for the seamless classifier integration, while DevOps engineers manage the deployment pipeline and infrastructure. Data scientists who build the model work alongside them to ensure a smooth transition and continuous monitoring of the deployed classifier.

Proper implementation of MLOps practices ensures continuous integration, deployment, and monitoring of classifier performance. In addition, client expectations regarding the frequency of retraining should be clearly defined and agreed upon to maintain the model’s accuracy and relevance over time. This collaborative effort and adherence to best practices in implementation are essential to achieving sustainable and efficient ML-based solutions in real-world applications.

Originally published at https://www.atlantbh.com on April 9, 2024.

Blog by Anja Plakalović, Data Analyst at Atlantbh.