Strategies for Effective Multi-Language Text Classification using the Best Prompt Techniques, LLMS, and ChatGPT Models
As everyone who hasn’t been living under a rock in recent months knows, Large Language Models (LLMs), and specifically chatGPT, are machine learning models that have been trained with a single, straightforward goal in mind: to predict the next word given a text.
In order to excel at this task, a model must not only understand the provided text but also generate new well-structured text and “store” and “retrieve” a vast amount of contextual knowledge about “the world” as understood by the text’s author.
By providing these models with massive amounts of data and a simple completion objective function, disruptive capabilities on text understanding, knowledge retrieval, and text generation have emerged.
The Hotels Network and LLMs
At The Hotels Networks, we provide over 16,000 hotels in 100 countries with a software-as-a-service platform that customizes the content of their web pages in real-time to adapt to each user, providing a better user experience and maximizing the performance of their website in terms of conversion rate.
To achieve this, it is essential that we automatically understand the peculiarities of our clients’ web pages as well as the behavior and needs of their users. This is where the ability of LLMS to understand text in a variety of languages plays a very relevant role in achieving this objective.
With this in mind, we have conducted a series of experiments to objectively compare the automatic classification capabilities of different models, as well as strategies for generating prompts for automatic text classification based on their content. Our goal is to continually improve the performance of our platform, allowing our clients to provide the best possible user experience to their customers.
Experimental framework:
We asked ChatGPT3.5 to generate messages simulating user reviews, specifying which aspect of the hotel they should talk about and in which language they should be written.
With this premise, we generated a dataset consisting of 450 messages, written in 5 languages (Spanish, French, Russian, German, and English), covering 9 categories related to the user’s stay in the hotel (“Hotel Location”, “Amenities”, “Customer Service”, “Cleanliness”, “Price”, “Comfort”, “Safety”, “Accessibility”, and “Food and Restaurants”).
For each of these language and category conditions, we generated ten alternatives, asking the generator to make the content of the different messages diverse and varied.
We tasked the LLMS models with classifying both the language and category of each message.
Results obtained:
Model comparison:
We compared the classification results in different languages among Meta based alpaca-lora 7B [1], GPT 3 (text-davincy-003), ChatGPT3.5 (gpt3.5-turbo), and Google flan_ul [2].
The results show a significant improvement (10%) in ChatGPT3.5 compared to flan-ul2 and GPT3. The results of flan_ul are comparable or slightly better than those obtained with GPT3. The reduced alpaca-lora model shows results that are not comparable to those of ChatGPT and flan_ul. Additionally, it is notably sensitive to the language of the message being processed.
Strategy for prompt generation:
The experiments conducted to measure the effectiveness of different prompting strategies were done only with the gpt3.5-turbo model.
When asked to classify the language in which the message is written, chatGPT3.5 identifies it without any problems (accuracy 98.9%).
It is advisable to write the prompt in English (we obtain an improvement of around 10% in accuracy compared to writing it in Spanish), and this improvement is even observed when the message to be processed is written in Spanish. Results far above the inherent variance of the method were obtained.
To estimate the variance of the method, we repeated the same prompts multiple times and measured the variation of accuracies in the classifications obtained per language. The results showed that the variance of the method was around 0.6% of the average accuracy, indicating high levels of consistency in the measurements.
The temperature does not affect the classification result, running experiments with temperature of 0, 0.4 and 0.7 provided exactly the same average accuracies.
A slight improvement (around 1%) in accuracies is obtained if, in addition to writing the prompt in English, we provide an English translation of the message to be processed. In our experiments, mixing the translation with the classification in a single prompt gives much worse results, so it is recommended to translate the message (if it is not in English) in a separate request, and use the result of that translation in the classification request.
In our study, we conducted experiments to aggregate responses obtained from different models or with different parameters on the same models. We determined the category by voting and the results are displayed in the figure below. Interestingly, we observed that applying voting methods to different responses did not improve the results obtained by writing the prompt, categories, and message in English.
Furthermore, we investigated the effect of introducing phrases such as “act as a marketing specialist” or “act as an expert classifier” on the system. Surprisingly, we found that these phrases had no effect on the classifiers’ response.
It is worth noting that all prompts used in our experiments required explicit texts that indicated adherence to the names of permitted categories, while not allowing additional comments or texts. Moreover, the obtained responses were slightly processed by uppercasing, trimming texts, and removing characters such as quotes and punctuation marks for proper functioning in automatic processing systems.