Learning a useful representation for words

Published in

DeepDreaming

7 min readApr 26, 2020

T-SNE visualisation of word embeddings generated using 19th century literature (Siobhán Grayson)

The field of representation learning is articulated around the idea that the way data are represented can strongly impact the performance of the tasks applied to them. In this article, I am going to show you an example of learning a useful representation for English words.

How we define the “usefulness” of a representation? Perhaps there is not a unique representation of data that is better, it depends on what we want to do with them. In general, we can say that a good representation of data is one that makes it easier to extract useful information when building classifiers or other predictors. But a representation can also be useful if it captures semantical meaning or it is easily interpretable by humans.

Let’s make a toy example: suppose we want to recognize the status of a traffic light from a photo of it. If we encode the image in raw pixel values, we would need a complex model to predict it, probably a CNN. But what if we encoded just the color present in the image in RGB? We would immediately recognize the red and the green from a high value of their respective channel and we might get the yellow by exclusion. Each image could be represented by just 3 values (instead of a huge number of pixels) and the predictive model would be very simple.

Word embedding

Word embedding is a way of representing words as vectors in a N-dimensional space. Why do we want to do such a thing? Let’s first try to understand how words are represented when we feed them into a learning model. The easiest thing might be: supposing that we have X words in our vocabulary, each word can be represented as a vector with all zeros and a 1 in the position corresponding tho his index in the vocabulary. This representation is called one-hot encoding and is extremely inefficient since it is like working in a huge dimensional space (the size of the vocabulary) where vectors are very sparse and can occupy only discrete positions. One can instead design a lower dimensional space where to map the words and try to position them by looking at their semantic meaning. This is the space created by a word embedding algorithm and usually goes from 100 to 1000 dimensions. I’m not going to go into the details of word embedding, instead I will show a simple method to obtain one.

The position of word vectors in the embedding space encode semantical informations.

Obtaining the embedding

In this example, I have obtained a 16-dimensional embedding of the words from the IMBD movie reviews dataset, by using a shallow neural network trained on classifying positive vs negative reviews. The notebook is inspired by this original TensorFlow documentation example. The network architecture is shown below.

The words on the dataset are tokenized, which means that each word has been assigned a value from the dictionary of words present in the dataset. In the dataset we used, the dictionary contains 8185 sub-words (some words are split in order to obtain a smaller vocabulary), so each one will be associated with an integer from 1 to 8185. Each movie review is then a sequence of words of the same length, since we padded with zeros the shorter ones. Therefore the input is a vector of sequence length containing a tokenized review.

The network architecture used to extract the word embedding. All the tensors shown additionally have a dimension for the batch size.

The first layer is the Embedding layer which is designed exactly for this purpose. We can imagine it to be like a hash table of parameters of size vocab_size x embedding_dimension. For each word in the vocabulary, it will contain the corresponding embedding vector. In our case, we fixed the embedding dimension to 16. Vectors are initialized randomly and with backpropagation, at every step of the training, these vectors will be updated. Afterward, the output matrix of this layer is averaged out by columns by the “global avg pooling” layer, to obtain a 16D vector. This vector is then passed through two fully connected layers to obtain the final output, a probability identifying whether each review is positive.

Word embeddings are usually obtained with unsupervised methods, where the network is trained on identifying neighbor words in sentences. Here, by doing a supervised training, I will obtain a representation that is biased toward separating the words associated with positive reviews from the ones associated with negative reviews. This shows how supervised representation learning provides a way to obtain very task-specific representations. The notebook is available below; running it all will take a few minutes.

Google Colaboratory

Edit description

colab.research.google.com

Visualizing the representation

At the end of the notebook, the vector coordinates for the subwords, and their values are written into two TSV files (vecs.tsv and meta.tsv) that you can upload on the amazing Tensorflow Embedding Projector to visualize. With our network we derived vectors in a 16-D space, so how do we visualize them? The Embedding Projector will perform Principal Component Analysis (PCA) on the dataset to obtain the linearly-separable components that best describe the data (maximum variance). It will then visualize the word vectors projected into the first 3 principal components, in order to obtain a nice 3D representation. A full description of PCA is available in my article here. You should now click on the “Load” button and upload the 2 TSV files, and then select the “Sphereize data” option which applies a normalization on the data.

The Embedding Projector interface. Use the Load button to load the 2 TSV files and then check the Sperize data option.

The resulting representation should look like the sphere below, where each point represents a sub-word present in the dictionary.

3D PCA visualization of the 16-D word embedding. Each point represents a word.

Words are clustered around 2 opposites regions, which is the result of our supervised training on binary classifying positive vs negative reviews. You can scroll around, click on points to see the words, and search for them. For example, if you search for the words “good” and “bad” you will notice they appear on the opposite sides. The neighbors of “good” are words like “wonderful”, the neighbors of “bad” are words like “boring”. Of course, this is just a lower-dimensional representation of the original 16D embedding, which should be a lot more expressive

Two distinct clusters are recognizable containing words associated with positive and negative reviews.

Applications of word embedding

All of this looks pretty cool, but is it useful? First of all, it used a lot in natural language processing (NLP): tasks like automatic summarization, machine translation, sentiment analysis, named entity recognition, and speech recognition. NLP models benefit a lot from using a meaningful word representation. This approach does also extend to paragraphs of text or even entire documents, allowing for tasks like content-based information retrieval. On this matter, in an article published on Nature [1], some researchers used this technique to discover a new chemical compound with certain proprieties by applying embedding to a dataset containing text of scientific papers. The authors have used an unsupervised word embedding algorithm, so they managed to acquire knowledge from data without additional labeling.

Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure– property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. [1]

A graph showing how the context words of materials predicted to be thermoelectrics connect to the word thermoelectric. The width of the edges between ‘thermoelectric’ and the context words (blue) is proportional to the cosine similarity between the word embeddings of the nodes, whereas the width of the edges between the materials and the context words (red, green and purple) is proportional to the cosine similarity between the word embeddings of context words and the output embedding of the material. [1]

But the concept of embedding and in general of learning a latent representation can also be ported to different domains like images, movie titles, songs, or even shopping items. Researchers from Alibaba have built a recommendation system based on the similarity present in a learned embedding of items [2]. In the image below, for example, is highlight how different categories of shoes have been successfully separated by the representation.

Visualization of badminton, table tennis and football shoes. Items in gray do not belong to any of the three categories. [2]

In conclusion, we have seen what a useful representation is how we can extract one for words, and how this concept can be applied to other fields with important practical results.

References

[1] Tshitoyan, Vahe, et al. “Unsupervised word embeddings capture latent knowledge from materials science literature.” Nature 571.7763 (2019): 95–98.

[2] Wang, Jizhe, et al. “Billion-scale commodity embedding for e-commerce recommendation in alibaba.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.