Explainability for Text Data: 3D Visualization of Token Embeddings using PCA, t-SNE, and UMAP

Madhu Rajesh
9 min readAug 2, 2023

--

Token embeddings play a crucial role in natural language processing (NLP) tasks, as they encode the contextual information of words and phrases into dense numerical vectors. These embeddings are often high-dimensional, making it challenging to gain meaningful insights directly from the data.

To address this issue, dimensionality reduction techniques such as Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP) are employed to transform the embeddings into lower-dimensional representations while preserving their essential characteristics.

The process of visualizing token embeddings in a reduced dimensional space is a powerful tool for understanding the relationships, clusters, and patterns hidden within text data. In this context, the aim is to capture semantic similarities and differences among tokens, which can aid in tasks like text classification, sentiment analysis, and information retrieval.

  1. PCA (Principal Component Analysis): PCA is a linear dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional representation by finding orthogonal axes (principal components) that capture the most significant variance in the data.
  2. t-SNE (t-Distributed Stochastic Neighbor Embedding): t-SNE is a nonlinear dimensionality reduction method that preserves local similarities among data points in the original high-dimensional space, making it effective for visualizing complex data patterns and identifying clusters.
  3. UMAP (Uniform Manifold Approximation and Projection): UMAP is a nonlinear dimensionality reduction technique that preserves both local and global structures in the data by approximating the manifold on which the data points lie, making it well-suited for visualizing high-dimensional data with complex relationships.

Here is a comparative study on PCA, t-SNE, UMAP

Choosing the right visualization:

Selecting the best dimensionality reduction technique for your data depends on the specific goals and characteristics of your data, as well as the insights you want to gain from the visualization. Each technique has its own strengths and weaknesses, and the “best” one is context-dependent. Here are some factors to consider when choosing the most suitable technique for your sample data:

Data Structure and Dimensionality: Consider the inherent structure of your data and its dimensionality. If your data has a linear structure, PCA might be effective in preserving variance. For complex nonlinear relationships, t-SNE or UMAP might be more suitable.

Local vs. Global Structure: Decide whether you want to emphasize local or global structures in your visualization. t-SNE is known for preserving local structures, while UMAP aims to balance local and global preservation. If global patterns are essential, UMAP could be preferred.

Clustering and Separability: If you are interested in finding distinct clusters in your data, t-SNE and UMAP often perform well in creating well-separated groups. Consider how well the technique separates different clusters.

Runtime and Efficiency: Some techniques might be computationally more expensive than others, especially for large datasets. Consider the time and resources available for computation.

Interpretability and Simplicity: PCA is interpretable and provides easily understandable principal components, whereas t-SNE and UMAP are more complex and may require further exploration to interpret.

Robustness to Hyperparameters: Some techniques, like t-SNE, have hyperparameters (e.g., perplexity, learning rate) that can significantly impact the results. Consider how sensitive the chosen technique is to these parameters.

Reproducibility: Certain techniques, like t-SNE, can give slightly different results in different runs due to its stochastic nature. If you need reproducibility, consider using UMAP or applying t-SNE with a fixed random seed.

Domain Knowledge: Finally, consider your domain knowledge and the context of your data. The most suitable technique might vary based on the specific characteristics of the data and the insights you seek.

A common approach is to try multiple techniques and visually inspect the resulting plots to gain insights and identify patterns. Additionally, you can evaluate the effectiveness of each technique based on how well it aligns with your knowledge of the data and the insights it reveals. There’s no one-size-fits-all answer, so it’s often beneficial to experiment with different techniques and choose the one that provides the most meaningful and interpretable representation of your data.

Visualization

Let’s demonstrate the step-by-step process of visualizing token embeddings obtained from the BERT (Bidirectional Encoder Representations from Transformers) language model.

Before performing dimensionality reduction, we preprocess the data by removing special tokens, subword tokens, and stopwords. This ensures that the visualizations focus on the most meaningful and informative tokens in the text.

By creating interactive 3D scatter plots with Plotly, the code facilitates an exploratory approach to understanding the token embeddings in reduced dimensions. Users can interact with the plots, zoom in on specific clusters, and hover over data points to examine individual tokens. The comparative visualization using PCA, t-SNE, and UMAP empowers users to evaluate the strengths and limitations of each technique for their specific use cases.

import numpy as np
import plotly.express as px
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap
from transformers import BertTokenizer, BertModel
import pandas as pd
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# Load BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)


text = "The diligent student diligently studied hard for his upcoming exams He was incredibly conscientious in his efforts and committed himself to mastering every subject"

# Tokenize and get BERT embeddings
tokens = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
with torch.no_grad():
outputs = model(**tokens)
embeddings = outputs.last_hidden_state.squeeze(0).numpy() # Shape: (num_tokens, 768) for BERT-base

# Get the list of token labels without special tokens and subword tokens
labels = [tokenizer.convert_ids_to_tokens(id) for id in tokens.input_ids[0].tolist()]
filtered_labels = [label for label in labels if not (label.startswith('[') and label.endswith(']')) and '##' not in label]

# Remove stopwords from labels and embeddings
stop_words = set(stopwords.words('english'))
filtered_labels = [label for label in filtered_labels if label.lower() not in stop_words]
filtered_embeddings = embeddings[:len(filtered_labels)]

# Perform PCA for dimensionality reduction (3D)
pca = PCA(n_components=3)
embeddings_pca = pca.fit_transform(filtered_embeddings)

# Convert embeddings and labels to DataFrame for Plotly
data_pca = {'x': embeddings_pca[:, 0], 'y': embeddings_pca[:, 1], 'z': embeddings_pca[:, 2], 'label': filtered_labels}
df_pca = pd.DataFrame(data_pca)

# Plot PCA in 3D with Plotly (interactive)
fig_pca = px.scatter_3d(df_pca, x='x', y='y', z='z', text='label', title='PCA 3D Visualization of Token Embeddings',
labels={'x': 'Dimension 1', 'y': 'Dimension 2', 'z': 'Dimension 3'}, hover_name='label')
fig_pca.update_traces(marker=dict(size=5), textfont=dict(size=8))
fig_pca.show()

# Perform t-SNE for dimensionality reduction (3D)
tsne = TSNE(n_components=3, perplexity=5, learning_rate=200)
embeddings_tsne = tsne.fit_transform(filtered_embeddings)

# Convert embeddings and labels to DataFrame for Plotly
data_tsne = {'x': embeddings_tsne[:, 0], 'y': embeddings_tsne[:, 1], 'z': embeddings_tsne[:, 2], 'label': filtered_labels}
df_tsne = pd.DataFrame(data_tsne)

# Plot t-SNE in 3D with Plotly (interactive)
fig_tsne = px.scatter_3d(df_tsne, x='x', y='y', z='z', text='label', title='t-SNE 3D Visualization of Token Embeddings',
labels={'x': 'Dimension 1', 'y': 'Dimension 2', 'z': 'Dimension 3'}, hover_name='label')
fig_tsne.update_traces(marker=dict(size=5), textfont=dict(size=8))
fig_tsne.show()

# Perform UMAP for dimensionality reduction (3D)
umap_model = umap.UMAP(n_neighbors=3, min_dist=0.1, n_components=3)
embeddings_umap = umap_model.fit_transform(filtered_embeddings)

# Convert embeddings and labels to DataFrame for Plotly
data_umap = {'x': embeddings_umap[:, 0], 'y': embeddings_umap[:, 1], 'z': embeddings_umap[:, 2], 'label': filtered_labels}
df_umap = pd.DataFrame(data_umap)

# Plot UMAP in 3D with Plotly (interactive)
fig_umap = px.scatter_3d(df_umap, x='x', y='y', z='z', text='label', title='UMAP 3D Visualization of Token Embeddings',
labels={'x': 'Dimension 1', 'y': 'Dimension 2', 'z': 'Dimension 3'}, hover_name='label')
fig_umap.update_traces(marker=dict(size=5), textfont=dict(size=8))
fig_umap.show()

Hyperparameters

# Perform PCA for dimensionality reduction (3D)
pca = PCA(n_components=3)
embeddings_pca = pca.fit_transform(filtered_embeddings)

n_components: The number of components to reduce the data to. Here, we set it to 3 to visualize the data in a 3D space.

# Perform t-SNE for dimensionality reduction (3D)
tsne = TSNE(n_components=3, perplexity=5, learning_rate=200)
embeddings_tsne = tsne.fit_transform(filtered_embeddings)
  • n_components: The number of components to reduce the data to. Here, we set it to 3 for 3D visualization.
  • perplexity: A parameter that balances the emphasis on local and global structure in the data. Higher perplexity values consider more points as neighbors, resulting in a more global structure in the visualization.
  • learning_rate: Controls the step size during optimization. A higher learning rate can lead to more aggressive optimization and potentially better convergence.
# Perform UMAP for dimensionality reduction (3D)
umap_model = umap.UMAP(n_neighbors=3, min_dist=0.1, n_components=3)
embeddings_umap = umap_model.fit_transform(filtered_embeddings)
  • n_components: The number of components to reduce the data to. Here, we set it to 3 for 3D visualization.
  • n_neighbors: The number of neighbors to consider when constructing the local neighborhood for each data point. Higher values can capture more global structure.
  • min_dist: Controls how tightly UMAP packs points in the low-dimensional space. Smaller values result in more compact clusters.

For all three techniques, we use filtered_embeddings, which is the reduced set of token embeddings after removing special tokens, subword tokens, and stopwords from the original paragraph.

Finally, the reduced embeddings are converted to a DataFrame using Pandas, and we use Plotly to create interactive 3D scatter plots to visualize the token embeddings in the reduced 3D space. The hover_name parameter in the Plotly scatter plots allows us to see the corresponding token labels when hovering over data points in the visualization.

In summary, this code provides a hands-on demonstration of how dimensionality reduction techniques can be applied to visualize token embeddings in 3D space.

The resulting visualizations enable data scientists, NLP researchers, and practitioners to gain valuable insights into the structure and relationships within the text data, thus contributing to more effective and accurate NLP applications.

Conclusion

In this article, we explored the visualization of token embeddings using dimensionality reduction techniques — PCA, t-SNE, and UMAP. By leveraging these methods, we successfully transformed high-dimensional token embeddings into lower-dimensional representations, facilitating a more intuitive understanding of the underlying patterns and relationships within the text data.

Through a comparative study of PCA, t-SNE, and UMAP visualizations, we gained valuable insights into the strengths and limitations of each technique. PCA, being a linear method, provided a clear reduction in dimensions while preserving global variances, but may struggle to capture intricate local patterns. On the other hand, t-SNE’s nonlinear approach excelled at revealing local structures and clusters, making it an ideal choice for visualizing complex and densely packed data points. UMAP, with its focus on both local and global structures, showcased a balanced representation, particularly suitable for handling large datasets.

One of the key benefits of data visualization in this context was model explainability. The visual representation of token embeddings allowed us to observe how words and phrases clustered together based on their contextual similarities. This clustering aided in understanding the semantic relationships among tokens, enabling us to interpret why certain tokens were grouped together or distinct from others. The transparency provided by visualizations empowered us to comprehend how the NLP model’s decision-making process was influenced by the input tokens.

Furthermore, the interactive nature of the plots allowed for an exploratory approach, enabling researchers to identify specific tokens of interest and explore their context in the dataset. This level of granularity and interpretability is invaluable when analyzing the model’s performance and fine-tuning the NLP tasks for better results.

In conclusion, the insights gained from this study, made possible by data visualization, not only improved our understanding of token embeddings and their relationships but also enhanced the model’s explainability. With this enhanced explainability, we can confidently apply and interpret the NLP model’s predictions, thereby building more transparent, robust, and reliable natural language processing applications. The combination of dimensionality reduction and data visualization serves as a powerful tool for researchers and practitioners seeking to unlock the mysteries of text data and make well-informed decisions in the field of NLP.

#TokenEmbeddings #DimensionalityReduction #PCA #tSNE #UMAP #BERT #NLP #NaturalLanguageProcessing #DataVisualization #MachineLearning #TextAnalytics #DataScience #Plotly #Python #Colab #InteractivePlots #ModelExplainability #SemanticSimilarity #TextClustering #WordEmbeddings #VisualizingWordEmbeddings #DataPreprocessing #ExploratoryDataAnalysis #ModelInterpretability

--

--