Dimensionality Reduction with UMAP

UMAP is like t-SNE, but faster and more general-purpose.

Dan Allison
Sep 22, 2018 · 2 min read

When it comes to visualizing high dimensional data, there are a number of options available. The most tried-and-true technique is PCA, which stands for Principle Component Analysis. PCA has been around for over a century. It is fast, deterministic, and linear. Being deterministic and linear means that it’s also reversible. However, this linearity puts a limit on its usefulness in complex domains like natural language or images, where non-linear structure is the norm.

A more recent technique that does capture non-linear structure is t-SNE, which stands for t-distributed Stochastic Neighbor Embedding. This technique is great at capturing the non-linear structure in high dimensional data, at least at a local level, meaning that if two points are close together in the high dimensional space, they have a high probability of being close together in the low dimensional embedding space.

UMAP stands for Uniform Manifold Approximation and Projection. It’s the new kid on the dimensionality reduction block (in 2018), and it is very similar to t-SNE. If you compare visualizations created with t-SNE and UMAP, you might have a hard time telling them apart.

However, UMAP appears to have some significant advantages over t-SNE:

  • It’s faster than t-SNE.
  • It captures global structure better than t-SNE.
  • Best of all, while t-SNE doesn’t have much use outside of visualization, UMAP is a general-purpose dimensionality reduction technique that can be used as preprocessing for machine learning.
  • UMAP also has a solid theoretical backing as a manifold approximation technique, whereas t-SNE is primarily a visualization heuristic.

The main disadvantage of UMAP is its lack of maturity. It is a very new technique, so the libraries and best practices are not yet firmly established or robust. However, if you’re willing to be an early adopter, UMAP has a lot to offer.

Check out the paper on ArXiv and the corresponding Python package on GitHub.

If you enjoyed this blog post, check out my side project Calculist.

Dan Allison

Written by

Data Engineer, Coffee Drinker, Sketchbook Doodler

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade