Decoding User Complaints: A Case Study with Mistral 7B LLM
Analyzing user complaints is a critical challenge for many companies, involving not just understanding but also effectively responding to customer concerns. One of the main challenges lies in managing the diversity and complexity of these complaints, which vary widely in language and content. Moreover, the high volume of complaints necessitates automated processes for efficient management. In this context, Large Language Models (LLMs) are emerging as fundamental tools in executing this task.
Recently, we have witnessed a significant increase in the development of open-source LLMs, redefining the boundaries of Artificial Intelligence (AI) and Natural Language Processing (NLP). Among the notable innovations in this area is the quantization of models, a technique that reduces the model size and computational requirements, allowing its use on more accessible machines and at reduced costs [1].
This study aims to explore these technological advances, employing the open-source Mistral 7B model [2] to analyze a vast set of complaints from Reclame Aqui website. The objective is to demonstrate the potential and effectiveness of LLMs in interpreting and managing user feedback on a large scale. You can follow the step-by-step of this study through the notebook made on Google Colab available on my Github.
Note: This article was also published on my LinkedIn (portuguese only). And there is also a Portuguese version published on Medium available here.
About the Data
We collected data from Reclame Aqui, a Brazilian site with over 30 million consumers and 500,000 registered companies, which has 1.5 billion page views per year [3]. A total of 7,000 distinct complaints, classified by users into 14 categories, were collected from a telecommunications provider, resulting in a balanced dataset with 500 complaints per category.
Data Extraction
To extract data from the company, we developed a Python script using Selenium and BeautifulSoup packages. The extraction methodology is beyond the scope of this study, but the script is available in the repository as 01. webscrapping_reclameaqui.ipynb, should you wish to use it.
Data Cleaning
Initially, we performed basic data cleaning, removing double spacing and converting all text to lowercase. We replaced the mask inserted by Reclame Aqui with a shortened version, aiming to reduce text size. Texts with less than 3 characters were removed as they did not provide sufficient information for analysis.
Sampling
Given that this work is geared towards study purposes and considering the limitations of the free Colab, we conducted a stratified random sampling of the complaints, divided into two datasets. The first set containing 202 samples was used as the validation dataset, and the second, with 2000 samples as the test dataset. We maintained the characteristics of the original set in the word distribution, as illustrated in Figure 1.
Data Analysis
LLM models have a context limit. To optimize the classification performed by Mistral 7B, we restricted the input text to 2000 characters. This choice was based on the sample distribution shown in Figure 2. When analyzing statistical distributions of data, we noted that most complaints (third quartile) had about 143 words, which is approximately 1000 characters. Therefore, the decision to limit the text to 2000 characters as a conservative measure.
Model Choice
Google Colab, in its free version, offers a T4 GPU with 16GB of VRAM, which is more than sufficient to load models with 7 billion parameters. There are various open-source models available, such as Mistral, Falcon, Zephyr, and Openchat. For this study, we chose Mistral due to its excellent performance in various benchmarks. Running this test for other models is quite easy, requiring only adjustments in the prompt structure.
Prompt Engineering
As shown in Figure 3, Mistral 7B requires a standard text input pattern to achieve better performance. We developed various prompt patterns to test on the test dataset, including zero-shot and few-shot approaches with one and two dialogues [4]. We also experimented with prompts with the task (task) positioned before and after the complaint, to assess the model’s ability to retain information.
Basically, the prompts are formed in 3 blocks, being the task, the customer complaint, and a limitation warning in the model’s response.
prompt_task_after = '[INST]' + complain_template + '\n'
+ task_template + '\n'
+ warning +'\n\nRótulos:[/INST]'
prompt_task_before = '[INST]' + task_template
+ '\n' + complain_template
+ '\n' + warning + '\n\nRótulos:[/INST]'
We also included in the prompt all the categories that the model could use in the classification. As it is about complaints from a mobile operator, the categories were:
tags = ['sinal/conexão de rede', 'cobrança indevida',
'consumo saldo/crédito', 'plano/benefício',
'cancelamento linha/plano', 'chip/sim card', 'spam',
'portabilidade', 'recarga/pagamento', 'dificuldade de contato']
The excerpt below exemplifies one of the prompts used, remembering that the model can choose one or more labels.
<s>[INST]
Reclamação: Esse plano é ruim!
Tarefa: Classifique a reclamação. Atenção use apenas as categorias abaixo.
sinal/conexão de rede, cobrança indevida, consumo saldo/crédito,
plano/benefício, cancelamento linha/plano, chip/sim card, spam,
portabilidade, recarga/pagamento, dificuldade de contato
Importante, apenas classifique sem explicar!
Rótulos:[/INST] plano/benefício
</s>
[INST]
Reclamação: {user_complain}
Tarefa: Classifique a reclamação. Atenção use apenas as categorias abaixo.
sinal/conexão de rede, cobrança indevida, consumo saldo/crédito,
plano/benefício, cancelamento linha/plano, chip/sim card, spam,
portabilidade, recarga/pagamento, dificuldade de contato
Importante, apenas classifique sem explicar!
Rótulos:[/INST]
Table 1 describes each of the prompts tested.
Multi-label Evaluation
To evaluate the prompts, we manually labeled the 202 cases in the test set, a process that took approximately 3 hours. Due to the multi-label nature of the problem, we used precision, recall, and f1-score metrics provided by the sklearn’s classification_report
function.
There are various ways to aggregate metrics, such as micro-average, macro-average, weighted average, and samples average. We opted for the samples average, an approach specifically designed for multi-label scenarios, which calculates metrics individually for each instance and then averages them. This methodology is particularly useful for evaluating the model’s effectiveness in predicting the set of labels for each individual sample [5]. In Table 2, we present the score comparison for each of the prompts.
Remembering that:
- Precision: It is the proportion of positive identifications that were actually correct. A low precision indicates that there were too many False Positives (FP), meaning that the model predicted many instances as positive that were actually negative.
- Recall: It is the proportion of actual positives that were identified correctly. A low recall indicates that there were too many False Negatives (FN), meaning that the model failed to identify many actual positive instances.
- F1 Score: Is the harmonic mean of the precision and recall, bringing a good balance between both.
As per Table 2, the prompt with the highest f1-score was the one with the task following the complaint description, using a two-example few-shot (p_tsk_aft_2s). The details of the scoring are described below.
We can note in Table 3 that the model scores better for some categories, which is expected as some complaints are more straightforward, while others are more implicit.
Results and Conclusions
After selecting the best-performing prompt based on the f1-score, we applied the model to the validation set with 2000 examples. The process took about 1 hour and 30 minutes.
Interestingly, although the dataset presented a balanced distribution of labels, the analysis in Figure 4 revealed that most complaints at some point mentioned issues related to recharge/payment, line/plan cancellation, wrongful charges, and plan/benefit. This suggests that most problems are more related to commercial aspects than to technical issues like signal quality.
As we used a labeled dataset, it was possible to compare the labels applied by the model with those made by the customers. In Figure 5, we have a co-occurrence matrix, where the data in the rows represent the customer labels, while the columns reflect the labels applied by the model. We can see an alignment between the Mistral 7B labels and the customer labels, especially in the areas highlighted in dark blue, indicating that the model did a good job.
In addition to verifying the effectiveness of the Mistral 7B, we identified some important insights:
- Prompt Sensitivity: The model proved to be highly sensitive to the structure of the prompt. The order of the sentences significantly impacts its performance.
- Effectiveness of Few-Shot Prompt: The inclusion of examples in two-interaction conversations, as part of the few-shot strategy, notably improved the model’s score.
- Bias in Tag Usage: The model tends to favor the tags provided in the examples used in the few-shot, increasing the recall rate.
Proposals for Future Improvements: We used the Mistral 7B model, in the “main” branch with 4-bit quantization and a group size of 128 [6]. Other versions may offer better results, despite a longer processing time. Alternative LLM models, such as Falcon, Zephyr, and Openchat, might also be more effective. A comparative analysis between these models would be a valuable exercise, expanding our understanding of the capabilities and limitations of each in the task of processing and analyzing user complaints.
References
- [1] Introduction to Weight Quantization: https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c
- [2] Mistral 7B: https://arxiv.org/abs/2310.06825
- [3] Reclame AQUI bate recorde de reclamações no mês de dezembro de 2021: https://blog.reclameaqui.com.br/reclame-aqui-bate-recorde-de-reclamacoes-em-dezembro-de-2021/
- [4] Harness the Power of LLMs: Zero-shot and Few-shot Prompting: https://www.analyticsvidhya.com/blog/2023/09/power-of-llms-zero-shot-and-few-shot-prompting/
- [5] Evaluating Multi-label Classifiers: https://towardsdatascience.com/evaluating-multi-label-classifiers-a31be83da6ea
- [6] TheBloke/Mistral-7B-Instruct-v0.2-GPTQ: https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GPTQ