Two minutes NLP — Effective intents identification in short texts with unsupervised learning
LDA, USE, Sentence-BERT, PCA, UMAP, and HDBSCAN
There are mainly two unsupervised learning approaches to understand what is talked about in short texts: topic modeling and clustering of embeddings.
Topic Modeling
Topic Modeling is used to discover latent topics in a collection of documents. A very common topic modeling algorithm is LDA (Latent Dirichlet Allocation). Note that a hyperparameter of the LDA algorithm is the number of topics to be found, which can be optimized by maximizing/minimizing a suitable metric, such as the coherence metric. LDA is used by Airbnb for this purpose.
However, intents are often more specific than topics, therefore clustering of embeddings can be a useful alternative.
Clustering of embeddings
Intents can be identified by finding precise and narrow clusters. This is done typically in three steps:
- Obtain an encoding from each document. Google’s Universal Sentence Encoder (USE) and Sentence-BERT are popular sentence encoders for this purpose.
- Reduce the dimensionality of the embedding. You can use techniques like PCA and UMAP. This step has been observed to improve clustering results at the next step.
- Cluster the embeddings. Typically density-based clustering algorithms are used, such as HDBSCAN.
Datasets
The PolyAI team published a banking dataset that contains 10000+ messages spanning 77 intents, which you can use to test your algorithms. Consider that in a real-world setting you would face additional challenges, such as identifying which message of each conversation contains the intent. https://github.com/PolyAI-LDN/task-specific-datasets.
It’s hard to find other public datasets as real-world data needs to be anonymized.
Code examples