Embeddings, briefly.

Published in

Artificialis

6 min readFeb 25, 2022

One time, not long ago, I was chatting in the restaurant with two friends, fellow programmers about model I was working on back then. In my attempt to try to explain them the possible solution of a task I found myself lacking words in explaining embedding concept. Thing is they are traditional programmers, not ML oriented. In order to get them more familiar I got myself trapped because in my head it all seemed pretty clear but my definition was not quite understandable. It was (for me, at least) a bit emberassing moment, so I dedicate this article to them as I want to get this sorted once and for all. I’m certain lots of people will have gain out of reading this.

There are all sorts of texts, blogs and articles on this matter, I would certanly recomand the one where I first got familiar with the embeding concept regarding Transformers in NLP. It’s written by very well known AI Researcher Jay Alamar in his famous and oftenly citated article “ The Illustrated Word2Vec” https://jalammar.github.io/illustrated-word2vec/

Clean data — accurate model

Yep, you know the rule, the cleaner data chances are you get model that gives very high probability of accurate result eventually. There are various inputs that neural networks use as an input feature it can be

categorical features are commonly solved with one-hot-encoding, or more efficiently with dimensionality embedings.
images — to find the pattern in images, all the data has to be turned into numbers, so to speak, more precicely embedded vectors
textual data inputs for NLP
time series data
audio — sound waves and signals (frequency information)

For data representation we need numeric value to supply machine learning model, or neural network, depending on complexity of the task.

I will take the example of one-hot-encoding and why it isn’t practical in more complex models, I recon it’s the best way to get more introduced with the dimensionality embedings.

One-Hot-Encoding on a contrary to Dimensionality Embeddings

Using this method of encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results.

One-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value. Let’s see it on basic example:

*binary variables are often called “dummy variables”*

Problem #1

When data is encoded this way we’d have serious issue if there is more dimensions then three, like in this example in particular. What if we have many, many more dimensions?

Problem #2

The second problem is that one-hot-encoding treats categorical variables independently.

Solution:

The Embeddings pattern solution of the dimensionality problem places data densely in a lower dimensions by passing the input data through an embedding layer that has trainable weights.

Main advantage in using embedding method is bgecause it captures closness relationships in the input data, so such layer can be used as a replacement for clustering techniques and dimensionality reduction in principal component analysis (PCA).

Text Embedings in NLP

Given the cardinality of vocabulary, one-hot-encoding isn’t practical, and that would create insanely large high-dimensional and sparse matrix for training.

In NLP its essential that we have similar words close and unrelated words in their meaning and also position in the sentence to be far away in the embedding space.

Therefore, we use a dense word embedding to vectorize the thext imput before passing it to our model.

Tokenization: is the process of representing raw text in smaller units called tokens. These tokens can then be mapped with numbers to further feed to an NLP model.

Tokens are the building blocks of Natural Language

*Tokens are the building blocks of Natural Language.*

More details about this process that is essential before tokenized sentences or charachters are passed trough embeding layer is here expalined in understandable manner, I higly encurage you to read this post The evolution of Tokenization in NLP — Byte Pair Encoding in NLP by @Harshit Tyagi

The evolution of Tokenization in NLP — Byte Pair Encoding in NLP

Introduction to different tokenization algorithms

towardsdatascience.com

So, to conclude, the tokenizer in most cases will normalize the string into a lowercase characters, remove punctation and it will trunckate it into subparts of a word. Those are called n-grams. You search it up, we gotta go back to embedding magic.

Input Embedding and Positional Encoding

Suppose we need to perform embedding for the following sentence:

The black cat set on the couch and the brown dog slept on the rug.

Let’s focus on two words in the sentence black and brown. The word embedding vectors of these two words should be similar. I suppose you are familiar with cosine Similarity that uses Euclidean (L2) distance, if not I suggest you give it a quick look in this short video Euclidean Distance & Cosine Similarity by @datasciencedojo

Euclidean Distance & Cosine Similarity

When we've got real values - and this is sort of a primer for the Bootcamp, a reminder for those of you who've been out…

online.datasciencedojo.com

However, a big chunk of information is missing because no additional vector or information indicates a word’s position in a sentence!

Positional encoding:

If ge bo back to the sentence we took as an example, we can see that black is in positition pos=2 and brown is in position pos=10. By applying the sine and cosine funcition literally for pos=2 we obtain the size of the encoding vector (in a BERT case it is size=512). Cosine similarity comes handy for having better visualization of the proximity of the positions.

cosine_similarity(pc(black), pc(brown)) = [[0.9627094]]

If you wish to explore more you can find the Google Colaboratory notebook here:

Transformers-for-Natural-Language-Processing/positional_encoding.ipynb at main · PacktPublishing/Transformers-for-Natural-Language-Processing

Transformers-for-Natural-Language-Processing/positional_encoding.ipynb at main ·…

Transformers for Natural Language Processing, published by Packt …

github.com

Image and Audio Data in Embedding layer

While text deals with very sparse input, other data types (images or audio) consists of a dense, high-dimensional vectors. That kind of data has multiple channels containing raw pixel or frequency information. In a setting like this, Embedding deals with relevant low-dimensional representation of the input.

An embedding is created by concatenating the outputs of each tower. It is subsequently fed into the decoder which consists of fully connected layers which output mel-scale spectrogram, and a post-processing network which outputs linear-scale spectrogram.

The Embeding layer is just another hidden layer in the neural network. The weights are learned trough the process of gradient decent. This means that the resulting vector embeddings represent the most efficient low-dimensional representation of feature values with respect to the learning task.

Autoencoders

Autoencoders provide one way to get around the need for large labeled dataset. The most typical autoencoder architecture consists of the hidden, commonly known “bottleneck” layer, and essentialy it is the layer with embedding function.

Encoder maps the high-dimensional input into a lower-dimensional embedding layer, while the decoder maps representation back to the higher dimension, as the original. Latent vector is essentially the embedding layer

When training autoencoder, both the feature and the label are the same and the loss is the reconstruction error. In this sense, the autoencoder achives the nonlinear dimension reduction.

On Autoencoders and the typical ‘bottleneck’ shringking you can find more in this blog post by @Cedric De Boom

Shrinking Variational Autoencoder Bottlenecks On-the-Fly

Using constraint-based optimization to achieve smaller bottlenecks — check https://arxiv.org/abs/2003.10901

medium.com

Trade-offs and Conclusion

There is cetanly information loss in going from high to low-dimensionality representations, but in return we gain information about closesness and context of the items. The dimensionality of embedding space is something that is learned trough a practice and experiment. Choosing very small output dimension means that too much information is compressed in small vector space so the context can be lost. On the other hand choosing embedding dimension that is too large model tend to lose the contextual importance.

References:

Citation and resources:

Ephrat, Ariel & Halperin, Tavi & Peleg, Shmuel. (2017). Improved Speech Reconstruction from Silent Video.
Aurelen Geron, “Hands on Machine Learning with Sckit-Learn, Tensorflow and Keras” (2019)
Denis Rothman, “Transformers for Natrual Language Processing” (2021).