Two minutes NLP — Topic Modeling and Semantic Search with Top2Vec

Top2Vec, Doc2Vec, UMAP, HDBSCAN, and topic vectors

Fabio Chiusano
NLPlanet
3 min readDec 4, 2021

--

Photo by Universal Eye on Unsplash

Top2Vec is an algorithm for topic modeling and semantic search. It can automatically detect topics present in documents and generates jointly embedded topics, documents, and word vectors. It’s implemented in Python in this open-source repository.

How it works

Image from https://github.com/ddangelov/Top2Vec
  • Reduce the dimensionality of the embeddings of the documents using the general non-linear dimension reduction algorithm UMAP. Since document vectors in high dimensional space are very sparse, dimension reduction helps for finding dense areas.
Image from https://github.com/ddangelov/Top2Vec
  • Find clusters of documents using the clustering algorithm HDBSCAN. HDBSCAN performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.
Image from https://github.com/ddangelov/Top2Vec
  • For each cluster, calculate the centroid of document vectors with the non-reduced dimensions: we call this vector the topic vector.
Image from https://github.com/ddangelov/Top2Vec
  • Find n-closest word vectors to the topic vector. The closest word vectors in order of proximity become the topic words.
Image from https://github.com/ddangelov/Top2Vec

What can be done with the Top2Vec library

  • Get hierarchical topics from a set of documents.
  • Search topics by keywords.
  • Search documents by topic or keywords.
  • Find similar documents.

Considerations

  • Top2Vec automatically finds the number of topics, differently from other topic modeling algorithms like LDA.
  • Because of sentence embeddings, there’s no need to remove stop words and for stemming/lemmatization.
  • Top2Vec creates jointly embedded topic, document, and word vectors.
  • Since Top2Vec creates jointly embedded topic, document, and word vectors, they can be used interchangeably for search.

--

--

Fabio Chiusano
NLPlanet

Freelance data scientist — Top Medium writer in Artificial Intelligence