UMAP: An alternative dimensionality reduction technique

Published in

MCD-UNISON

4 min readMay 30, 2023

UMAP, short for Uniform Manifold Approximation and Projection is a powerful dimensionality reduction technique that has gained significant attention in the field of data analysis and visualization. But unfortunately it is not very well-known yet.

It is similar to t-SNE, so it is a non linear method for dimension reduction, it aims to represent high-dimensional data in a lower-dimensional space while preserving both local and global structure. However, UMAP utilizes a different mathematical approach than t-SNE, which can lead to different trade-offs and results.

An example how we can represent data from a high-dimensional space into a low-dimensional space.

UMAP is based on the concept of constructing a fuzzy topological representation of the high-dimensional data and then optimizing the low-dimensional representation to be as close as possible to this fuzzy topological structure. It leverages ideas from manifold learning, graph theory, and Riemannian geometry.

Here are some key features of UMAP:

Preserves Local and Global Structure: UMAP aims to preserve both local and global structure in the data when mapping it to a lower-dimensional space.
Flexibility in Parameter Tuning: UMAP provides several parameters that can be adjusted to control the trade-off between preserving local versus global structure, as well as other aspects of the embedding.
Scalability: UMAP is known for its scalability and ability to handle large datasets efficiently. It utilizes an approximate nearest neighbor search algorithm, which allows it to scale to millions of data points.
Speed: UMAP is generally faster than some other dimensionality reduction techniques, such as t-SNE.

Using UMAP in python

I will show you how to implement UMAP in python, that’s very easy! You only have to follow the next steps:

How can I use it? Of course, you have to install it!

!pip install umap-learn

2. Yeah! it’s installed now, but we have to import the library. I some problems using the library due I was importing badly, but with the next code you should be able to use it.

import umap.umap_ as umap

3. For this example, now we have to import an extra libraries to create an example and plot results.

import numpy as np
import matplotlib.pyplot as plt

4. We generate a random data as an example, we use 100 samples and create a 10-dimensional dataset.

np.random.seed(0)
n_samples = 100
n_features = 10
data = np.random.rand(n_samples, n_features)

5. We create our UMAP object with next parameters. Don’t worry if you know what this parameters mean. I will you explain this later.

umap_obj = umap.UMAP(n_neighbors=5, min_dist=0.3, random_state=42)

6. In this step you only have to apply the classical fit_transform.

embedding = umap_obj.fit_transform(data)

7. Now you have you data in the embedding object, which it’s 2-dimensional datset. We plot with matplotlib.

plt.scatter(embedding[:, 0], embedding[:, 1])
plt.title('UMAP Embedding')
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')
plt.show()

I we execute the code, we’ll get a plot as shown below, all of the points now we are in a 2-dimensional space, that’s so cool!

UMAP parameters

We have some parameters that we can adjust depending the data that we are using. For example in the code above we use only three parameters.

n_neighbors: This parameter determines the number of nearest neighbors used to construct the neighborhood graph. Increasing n_neighbors can capture more global structure, but may also increase computation time. A typical value is between 5 and 50, depending on the size and density of the dataset.
min_dist: It controls the minimum distance between points in the low-dimensional embedding. A smaller value of min_dist allows for tighter clustering, but can result in overfitting and crowded visualizations. Increasing min_dist encourages more even spacing between points.
n_components: This parameter specifies the dimensionality of the low-dimensional embedding. By default, it is set to 2, which allows for visualization in a 2D plot. However, you can choose a higher value to obtain a higher-dimensional embedding if needed.
metric: UMAP supports various distance metrics to measure similarity between data points in the high-dimensional space. The default is Euclidean distance but you can try with other metrics like manhattan, cosine, mahalanobis …
random_state: if we need to have reproducibility in our results, we can set a seed by using random_state.

If you want to know how some of this parameters works, I built a streamlit app, so you can play with some random datasets by changing the value of those parameters. You can follow the next link.

Feel free to use it!

Final thoughts

UMAP is a powerful dimensionality reduction technique that we can use, it’s my favorite but as other techniques it is not prefect for all datasets, so we should try differents methods depending on which dataset we are working. If you want to know more about this method you can go to the next page.

UMAP: An alternative dimensionality reduction technique

Written by Fernando Luna