Turbocharge your user intent classification NLU pipeline with efficient machine learning and fast embeddings (no GPU required)

8 min readJun 30, 2024

First of all, thank you for taking the time to read this article. If you found it helpful and interesting, please give it a few claps 👏 and follow me 🚀 for more insights on efficient machine learning techniques and natural language understanding innovations.

Introduction

In the ever-evolving world of natural language understanding (NLU), classifiers for intent detection play a crucial role in numerous applications, from chatbots to virtual assistants. However, large language models (LLMs) that excel at this task are often slow, costly, and require substantial computational resources. While these models offer the advantage of not needing training or fine-tuning, their performance is typically optimal only with a top-of-the-line or proprietary solutions (and a huge amount of VRAM).

My approach seeks to bridge this gap by leveraging machine learning (ML) classifiers combined with embeddings as a proxy for NLU. By doing so, I achieve a user intent classifier that runs over 20 times faster on a CPU and more than 100 times faster on a GPU compared to a sufficiently powerful LLM. This method maintains competitive accuracy without the hefty resource demands, making it an attractive alternative for efficient and fast intent classification.

This approach is also useful and more practical than training a BERT or SBERT classifier, as the embeddings and this ensemble do not require significant computational power. By leveraging pre-trained embeddings and an efficient ensemble of traditional machine learning models, we achieve high accuracy without the need for extensive computational resources.

The motivation

Conversational AI pipelines for retrieve-augmented generation (RAG) are pivotal. These pipelines are designed to enhance conversational interactions by retrieving relevant information and generating coherent responses. Detecting user intent is crucial for effective RAG, particularly in stages such as censorship, intent detection, and even sub-intent detection. These tasks often necessitate multiple calls to large language models (LLMs) and sophisticated prompt engineering, which can be both time-consuming and costly.

My motivation stems from the need to drastically reduce the time and number of calls to LLMs within these pipelines. By implementing a fast and efficient solution, I can achieve a high classification rate without the resource-intensive overhead of LLMs. This solution, although currently focused on classifying requests and questions, lays the foundation for future models that can support multi-turn conversations and chat history.

To illustrate the utility of my approach, I have developed an ensemble model specifically designed for the Spanish language, using embeddings from jinaai/jina-embeddings-v2-base-es. This model efficiently classifies the following intents: censorship, others, lead (for potential comercial leads), contact (when user required contact information), directions (when a user needs to know how to get somewhere), meet, negation, affirmation, and casual chat.

At the end of the article, I will share the training dataset link, the training and evaluation notebook and the trained model, making it accessible for those interested in exploring and utilizing this approach. By focusing on these specific intents, my model ensures a streamlined and cost-effective solution for intent classification in Spanish-speaking contexts. This not only improves the efficiency of RAG pipelines but also sets the stage for more advanced conversational AI systems in the future.

The model

To address the challenge of efficient and accurate user intent classification, I leveraged a combination of three powerful machine learning algorithms: support vector machine (SVM), logistic regression, and k-nearest neighbors (k-NN). These algorithms are particularly well-suited for working with high-dimensional text embeddings due to their distinct strengths and complementary characteristics.

Support vector machine (SVM): SVM is renowned for its effectiveness in high-dimensional spaces and its robustness in scenarios with limited data. Given the synthetic nature of the dataset generated using a propietary LLM, which is not excessively large, the SVM’s ability to perform well with fewer data points was crucial. Its kernel trick allows it to operate effectively in high-dimensional spaces, making it an excellent candidate for handling the text embeddings produced by the jinaai/jina-embeddings-v2-base-es model. Moreover, SVMs are known for their generalization capabilities, which helped achieve good results despite the dataset’s size.

Logistic regression: Logistic regression is a fundamental and interpretable algorithm that excels in binary and multiclass classification tasks. Its linear decision boundary works well with text embeddings, capturing the underlying patterns in the data. Logistic regression’s simplicity and efficiency make it a reliable choice for quick predictions, especially when computational resources are a concern. By calibrating the probabilities, I ensured that the model’s output probabilities were well-calibrated, enhancing its reliability in real-world applications.

k-Nearest neighbors (k-NN): k-NN is a non-parametric algorithm that makes predictions based on the closest training examples in the feature space. Its intuitive approach of looking at the nearest neighbors makes it robust to outliers and noise, which can be beneficial when dealing with synthetic data. k-NN’s ability to capture local patterns in the data complements the global perspective offered by SVM and logistic regression. By calibrating its probabilities using isotonic regression, I further improved its probabilistic predictions.

Ensemble model: Combining these three classifiers into an ensemble model using a soft voting mechanism significantly boosted the overall performance. Each algorithm contributed its strengths: SVM’s robustness with limited data, logistic regression’s efficiency and interpretability, and k-NN’s local pattern recognition. The ensemble model aggregated the calibrated probabilities from each classifier, leading to a more accurate and reliable final prediction. This approach reduced the variance and improved the generalization of the model, resulting in superior performance compared to any single classifier.

The synthetic training/testing dataset was generated with a propietary LLM providing a diverse range of examples, allowing the ensemble model to learn and generalize well. By sharing the training dataset and pre-trained model, I aim to enable others to explore and benefit from this efficient approach to user intent classification.

Dataset

To give you a better understanding of the dataset used for training the model, I will include an image below that illustrates the size and composition of the dataset. This visual representation will provide insight into the distribution of different intents and the overall structure of the data, highlighting how the synthetic examples generated with the LLM played a crucial role in building an effective classification system.
It’s important to note that the dataset has an intentional class imbalance.

Some intents have fewer linguistic expressions to convey their meaning, necessitating fewer samples. For example, the “censorship” intent requires a larger and more diverse set of samples covering various questionable topics to ensure comprehensive detection. This imbalance was carefully designed to reflect the real-world scenario where certain intents are more complex and varied than others.

Classifier performance

The performance of the ensemble classifier, combining logistic regression, SVM, and k-NN, demonstrates remarkable accuracy across various intents.

Below, you will find the classification report and confusion matrix which detail the model’s precision, recall, and f1-score for each intent. This comprehensive evaluation highlights the model’s capability to accurately classify a diverse set of user intents, even with the intentional class imbalance designed to reflect real-world scenarios.

Below is the detailed classification report and confusion matrix for further interpretation:

The class "Desconocido" was programmatically injected for all test samples that got less than a .5 prob score on the classification process.

Some manual shots generated for testing by me

Quality of the embeddings

To evaluate the quality of the embeddings used in my model, I employed UMAP (Uniform Manifold Approximation and Projection) to reduce the high-dimensional embeddings to a 2D space. This visualization technique allows us to see how well the embeddings capture the underlying structure of the data.

The UMAP visualization below shows that the embeddings separate the different classes quite distinctly, indicating that the jinaai/jina-embeddings-v2-base-es model is highly effective in capturing semantic nuances. This clear separation of classes in a reduced dimensional space suggests that the embeddings are of high quality, significantly contributing to the overall performance of the classifier.

The distinct clusters corresponding to each intent class affirm the embeddings’ robustness, providing a solid foundation for the SVM, logistic regression, and k-NN classifiers to perform their tasks effectively. The embeddings’ ability to delineate classes with minimal overlap is crucial for the success of the ensemble model, leading to the impressive accuracy and reliability observed in our results.

Conclusions

While I have achieved a model that classifies user intents with high accuracy, there is always room for improvement, particularly in the training dataset. Enhancing the dataset to better generalize could further improve the model’s performance, making it more robust to a wider range of linguistic variations. Although extensive quality tests were not conducted, it is possible that certain areas of the language are not fully generalized between the embeddings and the training samples.

Despite these limitations, I have laid out a strategy and model architecture that is both cost-effective and holds significant potential. By leveraging an ensemble of logistic regression, SVM, and k-NN, calibrated for probability accuracy, this system runs significantly faster and is more resource-efficient compared to large language models, event with your notebook CPU.

This approach demonstrates that high-performing intent classification can be achieved without the heavy computational burden of LLMs or other expensive Transformers. The solution provides a foundation for further enhancements, such as supporting multi-turn conversations and integrating more comprehensive language data.

Repositories and final words

I encourage you to clone the repository, test the model, and join me in improving the dataset. Together, we can refine and enhance this model, making it even more robust and generalizable. Let’s collaborate to push the boundaries of what’s possible in user intent classification and create a more powerful, efficient system.

Feel free to reach out with your feedback, suggestions, or contributions. Let’s work together to make this model the best it can be. Thank you, and happy coding!

Git link: https://github.com/puppetm4st3r/intentclassification

HF link: https://huggingface.co/prudant/es_intent_classification

From Latam with ❤️