Interactive data visualization
Prerequisites for this post: basic knowledge of Python and Jupyter Notebook; basic understanding of machine learning.
In this post, I will introduce how to build fanciful and interactive data visualization like this.
But before we jump into that, let’s start by understanding why we need to visualize data.
Why We Need Data Visualization?
Data visualization is a group of techniques which can convert giant data with hundreds or thousands of dimension into a 2D/3D representation so that we tiny human beings can have a good understanding.
There are usually two reasons for using it:
- Visualize raw data, once the data is represented in 2D/3D space, it is easy to identify the patterns. If there are categorical labels for each data point, we can easily tell which categories are similar and which are further apart. And it can give us a sense of how complicated a model we need to build for such dataset. For example, if on our 2D/3D graph, different categories are already in their respective chunks, we probably do not want to use Excalibur to kill such a chicken.
- Visualize encoded features from raw data/embeddings, we would like to know if these features make sense, or if the encoder model has learned the correct things. This is usually more applicable in image datasets, where from the graph we can tell if the encoded features are corresponding to those semantic features that recognizable by humans.
In this post, I will introduce the techniques for data visualization, give some hands-on examples on it, and of course, make it fun by making it interactive
Data Visualization Techniques
There are many ways to visualize data, the two most popular techniques among them are PCA and t-SNE:
- PCA: Principal Component Analysis, as the name suggests, it finds the most important and relevant components to represent the data. In essence, this algorithm reduces dimensionality in a way that the information loss is minimum. There is a more detailed explanation for those who are curious.
- T-SNE: t-Distributed Stochastic Neighbor Embedding, a more complicated name. Essentially, this algorithm looks at the data distribution and tries to represent the same distribution with fewer dimensions. Again there some more detailed information.
T-SNE algorithm is more computationally expensive and time consuming compared to PCA. However, PCA has its limitation: it is a linear dimension reduction technique. In general, t-SNE can model the true distribution of the data better than PCA.
The Static Data Visualization
I would like to start with the non-interactive data visualization first as it is simpler to implement. Another reason is that the benefits of interactive data visualization may be better appreciated once you have some experience with the non-interactive version.
But if you feel adventurous and ready to take the challenge, feel free to jump to the next section :)
Here is the github link to the full code, but please allow me to guide you step by step.
The python packages we need for static data visualization are numpy, matplotlib and sklearn, please pip install them if you have not done so, and import them into our project
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import offsetbox
from sklearn import manifold, datasets, decomposition
and we will use the classical MNIST handwritten digits datasets to demonstrate raw data visualization, so let’s get the data ready
digits = datasets.load_digits(n_class=10)
X = digits.data
y = digits.target
Let’s define the plotting function to use later. This simple function mainly does three things. At first, it scales the values to fit the plot. Then it formats the data point representation so that each data point will show as a colored text. Finally it adds an AnnotationBox for some samples which shows the corresponding image of the data point.
def plot_MNIST(X, title=None):
x_min, x_max = np.min(X, 0), np.max(X, 0)
X = (X - x_min) / (x_max - x_min) # scale the values to fit
plt.figure(figsize= (10,10))
ax = plt.subplot(111)
for i in range(X.shape[0]):
plt.text(X[i, 0], X[i, 1], str(digits.target[i]),
color=plt.cm.Set1(y[i] / 10.),
fontdict={'weight': 'bold', 'size': 9})
if hasattr(offsetbox, 'AnnotationBbox'):
## only print thumbnails with matplotlib > 1.0
shown_images = np.array([[1., 1.]]) # just something big
for i in range(digits.data.shape[0]):
dist = np.sum((X[i] - shown_images) ** 2, 1)
if np.min(dist) < 5e-3:
## don't show points that are too close
continue
shown_images = np.r_[shown_images, [X[i]]]
imagebox = offsetbox.AnnotationBbox(
offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r),
X[i])
ax.add_artist(imagebox)
plt.xticks([]), plt.yticks([])
if title is not None:
plt.title(title)
Then we use the sklearn function to compute the PCA, as you can guess, the “n_components=2” indicates the result should consist of 2 dimensions.
print("Computing PCA projection")
t0 = time()
X_pca = decomposition.TruncatedSVD(n_components=2).fit_transform(X)
plot_MNIST(X_pca,
"Principal Components projection of the digits (time %.2fs)" %
(time() - t0))plt.show()
Below is the result of PCA data visualization. The time needed for processing it is blazing fast, only 0.01s. We can tell that digits in the same category are gathered together, but the clusters are overlapping each other, and there is a big mess at the center part of the graph.
Let’s try t-SNE then. Again we use a sklearn pre-built function for computing t-SNE, besides our old friend “n_components”, there is another parameter “init=’pca’”, this is to initialize the distribution with pca to better keep the global structure of the distribution.
print("Computing t-SNE embedding")
t0 = time()
X_tsne = manifold.TSNE(n_components=2, init='pca').fit_transform(X)
plot_MNIST(X_tsne,
"t-SNE embedding of the digits (time %.2fs)" %
(time() - t0))
plt.show()
The result of t-SNE has the different handwritten digits well separated, it is a clearer representation compared to PCA, and also a clear indication that this dataset does not need too complicated a model to do the classification.
The Interactive Data Visualization
We have successfully implement PCA and t-SNE data visualization in the previous section. But such visualization is static, we cannot zoom in to the area we are interested, we cannot do those cool things like rotating the data points in 3D space, and it definitely lacks a beautiful interface.
Tensorboard is the magical tool that provides all we need. It is a visualization tool in the Tensorflow family. It natively supports Tensorflow, and recently has opened its border to Pytorch too. We will use Tensorflow in this post, here is the link to the full code, and below is the step-by-step tutorial.
Besides the packages we already installed earlier, we need to install Tensorflow too, so please do pip install it if you have not done so.
%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import osfrom tensorflow.contrib.tensorboard.plugins import projector
from tensorflow.examples.tutorials.mnist import input_data
In order to use tensorboard for data visualization, we need to prepare three things for it:
- the data in ckpt form. These are the data points we are going to visualize.
- the metadata in tsv form (optional). The metadata can be the categorical label so that we can color the points by label later. It can also be in text format, so if we are visualizing a word embedding, we can check which is the word the data point is representing.
- the sprite in png form (optional). This is for image dataset. When we want to see the image samples instead of just points, we need this.
All of them need to be placed in a single log directory, so let’s define the directory and filenames, and configure the projector (which will be used by the Tensorboard) to point to the right paths.
LOG_DIR = os.getcwd()+'/mnist_log'
path_for_mnist_checkpoint = os.path.join(LOG_DIR, "model.ckpt")
path_for_mnist_metadata = os.path.join(LOG_DIR,'metadata.tsv')
path_for_mnist_sprites = os.path.join(LOG_DIR,'mnistdigits.png')
tensor_name = 'mnist_embeddings'summary_writer = tf.summary.FileWriter(LOG_DIR)
config = projector.ProjectorConfig()
embedding = config.embeddings.add()
embedding.tensor_name = tensor_name
embedding.metadata_path = path_for_mnist_metadata
embedding.sprite.image_path = path_for_mnist_sprites
embedding.sprite.single_image_dim.extend([28,28])
projector.visualize_embeddings(summary_writer, config)
Let’s load the MNIST dataset.
Samples_to_visualize = 500
mnist = input_data.read_data_sets("MNIST_data/", one_hot=False)
batch_xs, batch_ys = mnist.train.next_batch(Samples_to_visualize)
Let’s first create the data in ckpt (checkpoint) form。
embedding_var = tf.Variable(batch_xs, name=tensor_name)
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
saver = tf.train.Saver()
saver.save(sess, path_for_mnist_checkpoint, 1)
For the metadata, we will use the target labels, and write them into the tsv format.
with open(path_for_mnist_metadata,'w') as f:
f.write("Index\tLabel\n")
for index,label in enumerate(batch_ys):
f.write("%d\t%d\n" % (index,label))
Then, we create the sprite image. This sprite image is like a big container for all the sample images in it. As we already set the sample image dimension in the project, tensorboard will scan through the big sprite image using our specified dimension as the window size. This sprite image needs to be square, but not necessary to be completely filled. That’s why you see blanks at the bottom right corner of the image.
to_visualise = batch_xs
# Reshapes mnist digit embedding shape (batch,28*28) to image shape (batch,28,28)
to_visualise = np.reshape(to_visualise,(-1,28,28))
# invert black white
to_visualise = 1-to_visualiseto_visualise = np.array(to_visualise)
img_h = to_visualise.shape[1]
img_w = to_visualise.shape[2]
n_plots = int(np.ceil(np.sqrt(to_visualise.shape[0])))
# create big sprite template
sprite_image = np.ones((img_h * n_plots ,img_w * n_plots ))
# fill the sprite templates with the handwritten digits
for i in range(n_plots):
for j in range(n_plots):
this_filter = i * n_plots + j
if this_filter < to_visualise.shape[0]:
this_img = to_visualise[this_filter]
sprite_image[i * img_h:(i + 1) * img_h,
j * img_w:(j + 1) * img_w] = this_img
# save the sprite image
plt.imsave(path_for_mnist_sprites,sprite_image,cmap='gray')
plt.imshow(sprite_image,cmap='gray')
Once all the files are prepared, the log directory should look something like this.
Finally, we cd into the directory that contains the log directory, and type the command :
tensorboard --logdir=mnist_log
The output should be :
TensorBoard 1.12.0 at http://YOUR_PC_NAME:6006 (Press CTRL+C to quit)
Congrats! The Tensorboard is running successfully, and all you need to do is go to your localhost:6006 (this port may be different, just follow the one showing in the output) to check out your work.
One more thing needs change is the “Coloyby” option, we can choose our metadata label as the color indicator.
And after that, you will see this:
If you switch the PCA tab to t-SNE tab at the left-bottom corner, it will show something like this
It keeps changing because the training of t-SNE is still ongoing when you visualizing it. Once it is relatively stable, you can click the “Pause” button at the bottom-left corner to freeze the dancing data.
Tensorboard is rather powerful and there are quite a few things that you can tune. I will not go into details for this part, just happy playing around!
Conclusion
In this post, I have briefly introduced data visualization, demonstrated both static and interactive data visualization.
So far what we have done is visualizing raw data. As I do not want this post to get too lengthy, I have left out the part of visualizing embeddings/encoded features. For visualizing embeddings, the major change is that instead of using the raw data, we need to use an encoder to convert the raw data into feature space, and use the converted data for visualization. For those who like to try it, can check out word2vec which converts words to embeddings, or vgg/resnet pretrained model which converts images to embeddings. Happy exploring :)