Dimensionality Reduction with UMAP
UMAP is like t-SNE, but faster and more general-purpose.
When it comes to visualizing high dimensional data, there are a number of options available. The most tried-and-true technique is PCA, which stands for Principle Component Analysis. PCA has been around for over a century. It is fast, deterministic, and linear. Being deterministic and linear means that it’s also reversible. However, this linearity puts a limit on its usefulness in complex domains like natural language or images, where non-linear structure is the norm.
A more recent technique that does capture non-linear structure is t-SNE, which stands for t-distributed Stochastic Neighbor Embedding. This technique is great at capturing the non-linear structure in high dimensional data, at least at a local level, meaning that if two points are close together in the high dimensional space, they have a high probability of being close together in the low dimensional embedding space.
UMAP stands for Uniform Manifold Approximation and Projection. It’s the new kid on the dimensionality reduction block (in 2018), and it is very similar to t-SNE. If you compare visualizations created with t-SNE and UMAP, you might have a hard time telling them apart.
However, UMAP appears to have some significant advantages over t-SNE:
- It’s faster than t-SNE.
- It captures global structure better than t-SNE.
- Best of all, while t-SNE doesn’t have much use outside of visualization, UMAP is a general-purpose dimensionality reduction technique that can be used as preprocessing for machine learning.
- UMAP also has a solid theoretical backing as a manifold approximation technique, whereas t-SNE is primarily a visualization heuristic.
The main disadvantage of UMAP is its lack of maturity. It is a very new technique, so the libraries and best practices are not yet firmly established or robust. However, if you’re willing to be an early adopter, UMAP has a lot to offer.
If you enjoyed this blog post, check out my side project Calculist.