Two minutes NLP — Topic Modeling and Semantic Search with Top2Vec

Top2Vec, Doc2Vec, UMAP, HDBSCAN, and topic vectors

Published in

NLPlanet

3 min readDec 4, 2021

Top2Vec is an algorithm for topic modeling and semantic search. It can automatically detect topics present in documents and generates jointly embedded topics, documents, and word vectors. It’s implemented in Python in this open-source repository.

How it works

Create jointly embedded document and word vectors using sentence embeddings models, such as Doc2Vec, Universal Sentence Encoder, or BERT Sentence Transformer.

Image from https://github.com/ddangelov/Top2Vec

Reduce the dimensionality of the embeddings of the documents using the general non-linear dimension reduction algorithm UMAP. Since document vectors in high dimensional space are very sparse, dimension reduction helps for finding dense areas.

Find clusters of documents using the clustering algorithm HDBSCAN. HDBSCAN performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.

For each cluster, calculate the centroid of document vectors with the non-reduced dimensions: we call this vector the topic vector.

Find n-closest word vectors to the topic vector. The closest word vectors in order of proximity become the topic words.

What can be done with the Top2Vec library

Get hierarchical topics from a set of documents.
Search topics by keywords.
Search documents by topic or keywords.
Find similar documents.

Considerations

Top2Vec automatically finds the number of topics, differently from other topic modeling algorithms like LDA.
Because of sentence embeddings, there’s no need to remove stop words and for stemming/lemmatization.
Top2Vec creates jointly embedded topic, document, and word vectors.
Since Top2Vec creates jointly embedded topic, document, and word vectors, they can be used interchangeably for search.

Two minutes NLP related posts

Two minutes NLP — Effective intents identification in short texts with unsupervised learning

LDA, USE, Sentence-BERT, PCA, UMAP, and HDBSCAN

medium.com

Two minutes NLP — Basic taxonomy of Topic Tagging models and elementary use cases

LDA, NMF, Top2Vec, and WikiData