An Exploratory Look at Vector Embeddings

The Art of Turning Data Into Language

Ryan Partridge
8 min readJul 31, 2023

This article is part of a series about the Transformer architecture. If you haven’t read the others, refer to the introductory article here.

In today’s modern era, we use data in various forms, such as videos, images, audio, and documents. Technology has become so elegant that we can open them on our computers or stream them online at the click of a button. However, what they represent to a computer and how we perceive them in the real world are very different.

Take images and videos, for example. We can represent coloured images as matrices of pixel values, one for each colour channel (RGB), and videos as a collection of these images stacked together. Likewise, sound and text have no meaning to a computer. Instead, they need to be converted into separate numeric representations to be interpreted. Things become more complex when we apply this information to Deep Learning (DL) models, where each data type presents unique challenges for capturing its inherent characteristics.

Usually, they require separate specialist models to effectively capture their distinct attributes, but what if we could convert them into a single format? Think of the possibilities! We could use different data sources for the same model and even replace specialist models with a single architecture (e.g., the Transformer)!

So, what is this single format I speak of? It is none other than the legendary Vector Embeddings! Without further ado, let’s dive right in!

Vector Embeddings

Since the Transformer was first introduced in the “Attention Is All You Need” (Vaswani et al., 2017) paper, vector embeddings have become a standard for training text-based DL models. A vector embedding is an object (e.g., an image, video, or document) represented as a row of numeric values, where each value acts as a unique characteristic. For example, given three animals — a cat, a dog, and a horse — we may have attributes related to the length of their tails, the number of legs they have, or even what they eat for breakfast! Using another example, the word ‘great’ can have a positive and negative meaning, such as ‘Wow, this is great!’ and ‘I’ve broken it, great’. In this case, we can allocate values for both sentiments.

In short, the embeddings act as quantifiable metrics to identify how close (semantically similar) the objects are. In practice, we assess their proximity within a vector space.

Figure 1.2. An example of animals in a 3-dimensional vector space.

Creating Embeddings

A PyTorch implementation of word and patch embeddings can be found on GitHub here.

We can create vector embeddings manually, but that can become exhausting fast. Imagine creating 100 unique characteristics for a dataset and assigning them numeric values based on personal preference. It not only introduces bias but would take a long time to complete. Instead, it’s much easier to incorporate them as trainable parameters into our model or use an embedding model.

Word Embeddings

With text data, we don’t need a complex solution. We can simply map the data to a set of Gaussian distributed weights and treat them as a lookup table for our vector embeddings. Then, during training, we add the embeddings as extra learnable parameters to optimise.

Figure 2.1.1. An example of converting documents into word embeddings.

This approach is fantastic for model flexibility, especially when juggling between different problems. Admittedly though, sometimes it isn’t the best solution. Who wants to manually train embeddings all the time?! Instead, why not use a set of embeddings that are already trained? Sometimes, this can be easier and much faster. In this case, models such as Word2Vec, GLoVE and FastText are effective options (Ganegedara, 2021; Pennington et al., 2014; Bojanowski et al., 2017).

Patch Embeddings

What about images and audio files? Well, this is where things become a little more complicated. Recall that images are a matrix of numeric values, typically with three channels (colours). Condensing them down into a single dimension and mapping them to a set of embeddings would be extremely inefficient and result in losing a lot of meaningful information. So, what’s the alternative? Traditionally, Computer Vision tasks use several Convolutional layers to extract significant features by iterating over the image using a fixed-sized box (kernel). Surely we can do something similar?

Well, guess what? We can! Dosovitskiy et al. (2021), the founders of Vision Transformers (ViTs), created a solution using this exact process. They found that by converting an image into patches, they can retain the image information and also pass it into Transformer-based models. They used a single Convolutional layer to convert the image into patches, flattened the output and transposed it to create the patch embeddings. Each one captures the visual information of the image while maintaining a unique association with its own segment.

However, they didn’t stop there! They also added a token (‘<cls>’, representing classification) at the beginning of the sequence of patch embeddings. But why bother? Well, this is where things get really crazy! It acts as a representation of the entire image by providing global context and semantic information. In other words, we can think of this token as a marker to inform the model that it needs to make predictions based on the whole image. Pretty cool, right? But wait, there’s more! The token also allows us to treat the patch embeddings as word embeddings, meaning we don’t need to change the model architecture! Furthermore, it serves as a placeholder for token-level predictions, meaning we can adapt it for other tasks such as object detection or semantic segmentation — not just classification! And it enables Transfer Learning capabilities between models. Who would have thought that one small component would have such an impact!

Figure 2.2. An example of converting images into patch embeddings.

For audio, we can use the same process. Given an image of an audio spectrogram, we can turn it into patch embeddings and pass it into a Transformer-based model just like before — no extra steps needed. A simple, quick and easy approach for converting audio into language!

Selecting An Embedding Size

One of the big problems with Vector Embeddings is selecting the best size for your corpus. Ones that are too small are typically unable to capture all possible characteristics, while large ones have the risk of overfitting. Moreover, large dimensions can increase model complexity, training speed and inferential latency, potentially constraining the model’s applicability.

A common rule of thumb is to select embeddings arbitrarily between 50 to 1000 based on specific factors, such as data size, language complexity, the type of task the model is trained for, and the availability of computational resources. Additionally, you would typically trial different dimensions in increments, such as 50, 100, 200 and 300, then adjust further if required. Personally, I find this approach very counterintuitive. Why randomly select an embedding dimension without any clear indication as to how it impacts a model and its dataset? It provides no benefit and can even create more problems when optimising the model for maximum performance. Surely, there’s a technique for finding that desired sweet spot?

Fortunately, there is; use an embedding loss. One example is the Pairwise Inner Product (PIP) loss, a metric designed to measure the dissimilarity between embeddings using their unitary invariance (Yin and Shen, 2018). In this instance, we can identify the equivalence of an embedding after performing a unitary (transpose and multiplication) operation, as shown in equation 2.3.1.

Equation 2.3.1. The formula for the Pairwise Inner Product (PIP) loss for a single vector embedding E (ibid.).

Then, to compare two or more embeddings, we compute the normalized equivalence difference between them (equation 2.3.2).

Equation 2.3.2. The formula for comparing the PIP loss between two vector embeddings E1 and E2 (ibid.).

Seems simple enough, right? Well, not entirely. For this loss to work, we need a pre-trained set of embeddings for our corpus. Realistically, this makes sense. Imagine searching for the optimal embedding dimension using randomly initialised ones. The embeddings would lack context between them. Thus, it would be impossible to identify which are associated. So how do we use this method effectively? My advice, find the simplest embedding model for pre-training, use it to obtain the optimal dimension, and then use that dimension to gain the best embedding fit for your corpus of data.

Yin and Shen (2018) accompany their research with a code implementation on GitHub here. The repository includes embedding algorithms, such as Word2Vec, GloVe, and Latent Semantic Analysis (LSA), to use with their PIP loss implementation. It uses NumPy and works incredibly well but lacks clarity on the operations performed. As such, I’ve adapted and converted the simplest algorithm (LSA) and PIP loss implementations with PyTorch and guided comments for more flexibility. You can view the code in my PyTorch Transformer GitHub repository here. And with that, you now have a better understanding of Vector Embeddings and have a method for finding the optimal embedding dimension!

What’s Next?

Interested in learning more about the other Transformer components? Check out my introductory article here that acts as a roadmap through the architecture.

References

Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., and Auli, M., 2022. Data2Vec: A General Framework For Self-Supervised Learning in Speech, Vision and Language. arXiv.org. Available from: https://arxiv.org/abs/2202.03555.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T., 2017. Enriching Word Vectors with Subword Information. arXiv.org. Available from: https://arxiv.org/abs/1607.04606v2.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N., 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv.org. Available from: https://arxiv.org/abs/2010.11929.

Mikolov, T., Chen, K., Corrado, G., and Dean, J., 2013. Efficient Estimation of Word Representations in Vector Space. arXiv.org. Available from: https://arxiv.org/abs/1301.3781.

Pennington, J., Socher, R., and Manning, C. D., 2014. GloVe: Global Vectors for Word Representation. Available from: https://nlp.stanford.edu/pubs/glove.pdf

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I., 2017. Attention is All You Need. arXiv.org. Available from: https://arxiv.org/abs/1706.03762.

Yin, Z., and Shen, Y., 2018. On The Dimensionality of Word Embedding. arXiv.org. Available from: https://arxiv.org/abs/1812.04224.

WRITER at MLearning.ai // Control AI Video // Personal AI Art Model

--

--

Ryan Partridge

Hi there! I'm Ryan, a Software Developer who shares insights on Python, React, and AI techniques through visual and practical examples