Visualizing Higher Dimensional Data Using t-SNE On TensorBoard

Chiranjeevi Vegi
5 min readFeb 16, 2018

--

https://www.quora.com/What-is-the-curse-of-dimensionality

With data increasing at an exponential rate, the datasets have million observations and attributes/features. One might argue, more the data the merrier. But this is not the case always. For example, when working on text classification problem, reducing the dimensions of my dataset from 114,432 to 40,000 resulted in a 1% increase in accuracy. Datasets with high dimensions/features are subjected to what is called as “curse of dimensionality.” This problem pushed researchers to explore dimensionality reduction procedures. A couple of the dimensionality reductions are

Principal Component Analysis

https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/principal_component_analysis.html

t-distributed stochastic neighbor embedding (t-SNE)

https://zaburo-ch.github.io/post/parametric-tsne-keras/

The math behind some of this dimensionality reduction is elaborately explained in this article and more intuitively in this video.

Now that we established that dimensionality reduction is important. Let’s think about how one can visualize higher dimension data(more than 3. Obviously!)

http://www.startalkallaccess.com/

More importantly, visualizing higher dimensional data can help see understanding the results(clusters) from unsupervised learning and thus help make improvements to the model. While many methods are available for evaluating supervised learning results, very limited methods are available to help visually assess unsupervised results. To aid our cause, t-SNE does an outstanding job visualizing higher dimensional data into 3-D. For this, we have well-established libraries in Python and R. However, the visualizations are static, have limited features and dull. Fortunately, we have TensorBoard that help can us visualize higher dimensional data using PCA and t-SNE in very minimal code or no code at all. Here’s an example of visualization with TensorBoardrd

http://motojapan.hateblo.jp/entry/2017/09/04/083814

Let’s get started generating t-SNE visualization on tensorboard with our own data. Steps involved

Required Libraries: TensorFlow, Pandas, Numpy, sklearn( PCA, StandardScaler). You can also create an environment using the .yml file found here. To run the .yml, run the following command,

conda env create -f filename.yml” in terminal(mac) or conda prompt(windows)

Before jumping into the code to visualize higher dimensional data,

  1. Apply standard scaler and Create dummy variable for categorical data
  2. For better results with t-SNE, apply dimensionality reduction to reduce your data set to 50 features or PCA components that explain at least 80% of the variance in your data.
  3. If your data is not labeled, predict clusters/labels using unsupervised learning methods. In fact, this visualization method helps immensely in understanding our clustering results.

One can generate t-SNE visualizations on TensorBoard using two methods

First method: The Pythonic Way

Running the code below generates necessary files such as embeddings for data, metadata, checkpoints and TensorFlow variables that TensorBoard reads during startup.

CODE

## Importing required Libraries
import os
import tensorflow as tf
from tensorflow.contrib.tensorboard.plugins import projector
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
## Get working directory
PATH = os.getcwd()
## Path to save the embedding and checkpoints generated
LOG_DIR = PATH + '/project-tensorboard/log-1/'
## Load data
df = pd.read_csv("scaled_data.csv",index_col =0)
## Load the metadata file. Metadata consists your labels. This is optional. Metadata helps us visualize(color) different clusters that form t-SNEmetadata = os.path.join(LOG_DIR, 'df_labels.tsv')# Generating PCA and
pca = PCA(n_components=50,
random_state = 123,
svd_solver = 'auto'
)
df_pca = pd.DataFrame(pca.fit_transform(df))
df_pca = df_pca.values
## TensorFlow Variable from data
tf_data = tf.Variable(df_pca)
## Running TensorFlow Session
with tf.Session() as sess:
saver = tf.train.Saver([tf_data])
sess.run(tf_data.initializer)
saver.save(sess, os.path.join(LOG_DIR, 'tf_data.ckpt'))
config = projector.ProjectorConfig()
# One can add multiple embeddings.
embedding = config.embeddings.add()
embedding.tensor_name = tf_data.name
# Link this tensor to its metadata(Labels) file
embedding.metadata_path = metadata
# Saves a config file that TensorBoard will read during startup.
projector.visualize_embeddings(tf.summary.FileWriter(LOG_DIR), config)

Now, open the terminal and run the following command

tensorboard --logdir=C:\Users\name\Desktop\Files\project-tensorboard/log-1 --port=6006

The Result:

We see three clusters being formed. However, our unsupervised learning did not do a good job identifying these clusters. Thus, this helps us really on visual aid alongside popular unsupervised performance metrics to improve our model.

In the above visualization, different colors result from metadata(label) embeddings. Tensorboard supports multiple embeddings such as images, text etc. Check out the pre-built examples on visualizing with multiple embeddings

Second Method of obtaining above plots is by manually loading the data, metadata(labels) on the tensorboard platform.

Points to remember about t-SNE:

Distill published the best document there is on how to use t-SNE effectively using interactive visualization. Here is the link for the article. The important points from the article are

  1. Hyperparameters really matter

t-SNE has two important parameters,

a. Perplexity which is loosely “ how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has

b. Epsilon: Also known as Learning rate

c. Iteration or steps: Always train till the model stabilizes

2. t-SNE algorithm doesn’t always produce similar output on successive runs, for example, and there are additional hyperparameters related to the optimization process

3. Cluster sizes in a t-SNE plot mean nothing

4. Distances between clusters might not mean anything

Summary

Visualizing our data using t-SNE helps to understand how our unsupervised learning models are performing. In the video, the colored labels are predicted from Unsupervised learning methods, and the t-SNE shows significant overlap between clusters. This also is verified by the silhouette coefficient for the model trained which is 0.1765 in this case. Other use cases can be in the evaluation of features generated in NLP by different methods such as a bag of words, Tfid, Word2Vec, etc.

So, that’s all for now. Hope you find this helpful and fits on our data science arsenal

You can find code for using images as embedding on my GitHub. This article is co-authored by Ashish Khan, who is a Machine Learning enthusiast and interests in the android app, web design, and data science. Check out his website here for fun and exciting things one could do with DATA

--

--

Chiranjeevi Vegi

Creating "Value" from data. In my 9-5 role, leading a Data Science & Data Engineering team @ BMC Software.