Two minutes NLP — Topic Modeling and Semantic Search with Top2Vec
Top2Vec, Doc2Vec, UMAP, HDBSCAN, and topic vectors
Published in
3 min readDec 4, 2021
Top2Vec is an algorithm for topic modeling and semantic search. It can automatically detect topics present in documents and generates jointly embedded topics, documents, and word vectors. It’s implemented in Python in this open-source repository.
How it works
- Create jointly embedded document and word vectors using sentence embeddings models, such as Doc2Vec, Universal Sentence Encoder, or BERT Sentence Transformer.
- Reduce the dimensionality of the embeddings of the documents using the general non-linear dimension reduction algorithm UMAP. Since document vectors in high dimensional space are very sparse, dimension reduction helps for finding dense areas.
- Find clusters of documents using the clustering algorithm HDBSCAN. HDBSCAN performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.
- For each cluster, calculate the centroid of document vectors with the non-reduced dimensions: we call this vector the topic vector.
- Find n-closest word vectors to the topic vector. The closest word vectors in order of proximity become the topic words.
What can be done with the Top2Vec library
- Get hierarchical topics from a set of documents.
- Search topics by keywords.
- Search documents by topic or keywords.
- Find similar documents.
Considerations
- Top2Vec automatically finds the number of topics, differently from other topic modeling algorithms like LDA.
- Because of sentence embeddings, there’s no need to remove stop words and for stemming/lemmatization.
- Top2Vec creates jointly embedded topic, document, and word vectors.
- Since Top2Vec creates jointly embedded topic, document, and word vectors, they can be used interchangeably for search.