Deep dive in embeddings

Published in

neoxia

15 min readMar 21, 2024

One Hot Encoding

In the domain of natural language processing (NLP), the intricate structure of language poses a significant challenge for computational analysis. While humans effortlessly comprehend linguistic expressions, computers inherently operate in a numerical realm, necessitating the translation of textual data into mathematical representations. These representations, termed as vectors or embeddings, encapsulate the semantic and syntactic attributes of words within a multi-dimensional space. The concept of “embedding” originates from mathematical topology, where it signifies the mapping of entities into a structured mathematical space. This process essentially involves embedding linguistic elements into a mathematical framework, facilitating computational analysis and manipulation. Over the course of NLP’s evolution, considerable advancements have been made in devising techniques for generating embeddings, ranging from traditional statistical methodologies to cutting-edge deep learning approaches.

As an example, let’s consider the N-gram model, a fundamental technique in NLP. N-grams represent sequences of N tokens (e.g., words, characters) within a corpus, capturing the contextual dependencies between adjacent tokens. For instance, in a bigram model (N=2), the sentence “The cat sat on the mat” would generate the following bigrams: “The cat”, “cat sat”, “sat on”, “on the”, “the mat”. By analyzing the frequency of these sequences, N-gram models provide insights into the structure and patterns of language usage. However, N-gram models often suffer from sparsity issues and struggle to capture long-range dependencies in language, leading to limited performance in certain tasks. Embeddings address these limitations by representing words as dense, continuous vectors, enabling more nuanced analysis and interpretation of textual data. By incorporating embeddings into N-gram models, we can enhance their ability to capture richer contextual information and improve overall performance in various NLP applications.

However, this isn’t the same kind of ‘embedding’ we’re discussing in AI, because this method doesn’t reflect the meaning of words. There’s no way to tell from their mathematical representations that the words ‘BAD’ and ‘WORSE’ are more closely related in meaning than ‘BAD’ and ‘BED’.

Word Embedding

What we call context aware embedding permit to represent the input word in a numerical value with a list of fundamental principles such as:

Uniqueness: Each word is represented by a unique vector, containing a list of numerical values. This unique representation ensures that each word’s semantic meaning is encoded distinctly, enabling models to differentiate between them effectively.
Multidimensionality: Word embeddings are typically multidimensional allowing for the nuanced representation of word meanings, capturing various semantic nuances and relationships.
Semantic Capture: Each dimension within the embedding vector contributes to different aspects of the word’s meaning, allowing models to understand and infer relationships between words based on their embeddings.
Similarity Preservation: Words with similar meanings or contexts should have embeddings that are closer together in the embedding space, facilitating tasks such as word similarity comparison and semantic analysis.

Different techniques are employed to generate a context aware word embeddings; These methods consider the context in which words appear, generating embeddings that capture contextual information. Examples include ELMo, Transformer, BERT, and Transformer-XL, which utilize sophisticated neural network architectures to produce embeddings sensitive to context.

Bert Model

In 2018, a significant breakthrough in Natural Language Processing (NLP) came with the publication of Bidirectional Encoder Representations from Transformers (BERT) by the Google AI team. This landmark paper introduced a novel approach to language modeling, employing bidirectional training within the Transformer architecture. BERT quickly gained recognition for its pragmatic design and exceptional performance, achieving state-of-the-art results across various NLP tasks. Unlike traditional models, BERT’s bidirectional training enables it to grasp intricate language context and relationships more effectively. For instance, consider the word “bank” in the sentences “The man was accused of robbing a bank” and “The man went fishing by the bank of the river.” While conventional models might assign the same representation to “bank” in both sentences, BERT’s context-aware embeddings capture the distinct meanings, reflecting the semantic nuances of the word in each context. This adaptability to varying contexts enhances the accuracy of feature representations and contributes to BERT’s superior model performance.

Transformers — Encoder

BERT leverages the Transformer architecture, initially presented in 2017 by Vaswani & al. The Transformer comprises two core components: an encoder and a decoder. The encoder is tasked with processing the input text, while the decoder generates predictions tailored to specific tasks.

Both encoding and decoding architecture are based on a succession of attention mechanism permitting to discern contextual relationships among words or sub-words within a given text. The attention mechanism permit to evaluate at each prediction an attention weight on all the previous world. this mechanism allows the model to ‘focus’ on the most relevant parts of the input sequence when generating each element of the output sequence. The attention mechanism operates on the principle of mapping a query alongside a series of key-value pairs to produce an output, where each component is represented as a vector.

Let’s consider a practical example using a machine translation task, where we aim to translate a sentence from English to French.

Suppose we have the English sentence: “How was your day” and we want to translate it into French.Suppose we have the English sentence: “How was your day” and we want to translate it into French.

1. Encoding: First, each word in the input sentence is converted into its corresponding token. Tokenisation mean splitting text into smaller units called tokens (e.g., words or word segments) in order to turn an unstructured input string into a sequence of discrete elements that is suitable for the transformers model.

2. Query, Key, Value: In the attention mechanism, for each word in the input sentence, three vectors are derived: the query vector, the key vector, and the value vector. These vectors are generated by linear transformations of the word embeddings.

Query: The query vector represents the current word we want to focus on during translation.
Key: The key vector represents all words in the input sentence, providing information about their relationships with the current word.
Value: The value vector holds the actual information associated with each word in the input sentence.

3. Attention Scores: These scores (based on Query, Key and Value) represent the relevance or importance of each word in the input sentence with respect to the current word.

4. Softmax and Weighted Sum: The attention scores are then passed through a softmax function to obtain normalized weights. These weights determine the contribution of each word’s value vector to the output.

5. Weighted Sum: Finally, the weighted sum of the value vectors, using the computed weights, is calculated. This results in a context vector that encapsulates the relevant information from the input sentence for the current word being translated.

6. Decoding: The context vector, along with the embeddings of the previous translated words (if any), is used as input to the decoder to generate the next word in the target language (French in this case).

Concerning BERT architecture, knowing that the primary aim is to develop a language embedding model, which requires a deep understanding of textual context. Therefore, BERT selectively utilizes only the encoder mechanism from the Transformer architecture, meaning the encoding part. By focusing solely on the encoder, BERT streamlines its efforts towards comprehensively capturing and representing intricate language nuances. For that the first part is to tokenize (embed) the input word.

Tokenisation

BERT utilizes only the encoding component of the Transformer architecture. In this framework, the input to the encoder consists of a sequence of tokens. To facilitate this process, BERT employs a method called WordPiece tokenization to decompose and enhance the initial input. This involves three key steps:

Token embeddings: BERT introduces special tokens, such as [CLS] and [SEP], to the input sequence. The [CLS] token is appended to the beginning of the first sentence, while a [SEP] token is inserted at the end of each sentence. These tokens serve as markers to delineate the boundaries between sentences and to indicate the start and end of each sequence.
Segment embeddings: To enable the encoder to differentiate between sentences, BERT adds a segment embedding to each token. This embedding acts as a marker, indicating whether a token belongs to Sentence A or Sentence B. By incorporating segment embeddings, BERT ensures that the model can discern the contextual information associated with each sentence within the input sequence.
Positional embeddings: BERT includes positional embeddings for each token to convey its positional information within the sentence. These embeddings encode the sequential order of tokens, enabling the model to understand the relative positions of words within the input sequence. By incorporating positional embeddings, BERT enhances its ability to capture the contextual relationships between tokens in the input text.

Training

When training language models, there is a challenge of defining a prediction goal. Many models predict the next word in a sequence (e.g. “The child came home from ___”), a directional approach which inherently limits context learning. To overcome this challenge, BERT uses two training strategies

Masked LM

The core concept involves picking 15% of the words in each training sequence and substituting them with a [MASK] token. During training, the model endeavors to predict the original masked word based on the contextual cues provided by the entire sequence. In practice, the BERT implementation is slightly more intricate and doesn’t replace all of the 15% masked words. This is because with a simplistic masking approach like this, the model tends to focus solely on predicting when the [MASK] token appears in the input, whereas the aim is to predict the correct tokens regardless of their presence. To address this, out of the 15% tokens selected for masking:

80% of the tokens are indeed replaced with the [MASK] token.
10% of the time, tokens are substituted with a random token.
10% of the time, tokens remain unchanged.

The BERT loss function exclusively considers the prediction of the masked values and disregards predictions of the non-masked words.

Next Sentence Prediction (NSP)

In BERT’s next sentence prediction, the model learns to understand the relationship between two sentences. It’s crucial for tasks like question answering. During training, BERT is fed pairs of sentences and learns to predict if the second sentence follows the first in the original text, with the second sentence occurring after the first 50% of the time, while the remaining 50% constitutes a random sentence from the corpus.

Input = [CLS] You are currently [MASK] a notion on embedding [SEP]

In this [MASK], BERT key concept are presented [SEP]

Label = IsNext

Input = [CLS] You are currently [MASK] a notion on embedding [SEP]

Most sharks have to keep [MASK] to pump water over their gills [SEP]

Label = NotNext

BERT then predicts whether the second sentence is random or not. This prediction is made using the Transformer-based model, where the output of the [CLS] token is turned into a 2×1 shaped vector through a simple classification layer, and the IsNext-Label is determined using softmax. The overall model is trained with both Masked LM and Next Sentence Prediction together to minimize loss.

Bert architecture

BERT architecture encompasses two primary variants: BERT-Base and BERT-Large, with distinct layer configurations and parameter counts:

BERT-Base comprises 12 transformer encoder layers, each housing 768 hidden nodes and 12 attention heads, summing up to 110 million parameters.
BERT-Large is equipped with 24 transformer encoder layers, 1024 hidden nodes, 16 attention heads, and 340 million parameters.

During training, BERT-Base utilized 4 cloud TPUs over 4 days, while BERT-Large employed 16 TPUs for the same duration. Hidden states within the model are structured in a multidimensional object, with 13 layers, encompassing input embeddings and outputs from BERT’s layers. This architectural design enables BERT to proficiently capture contextual nuances in various natural language processing tasks.

How to use BERT

As see in the training process, the output of BERT is trained to contextualized input tokens or sentences, capturing rich semantic information. Specifically, BERT provides token-level and sentence-level embeddings. Token-level embeddings represent the contextualized representation of each token in a sequence, considering its surrounding context within the sentence. Sentence-level embeddings aggregate information from all tokens in the sentence to provide a comprehensive representation of the entire sentence.

To utilize BERT embeddings in a specific embedding task, such as text classification, sentiment analysis, or named entity recognition, you can follow these steps:

Tokenization: First, tokenize the input text into BERT-compatible tokens. BERT requires special tokenization, including adding [CLS] token at the beginning of the sentence and [SEP] tokens to separate sentences or indicate the end of a sentence.
Passage through BERT Model: Feed the format input into the BERT model. The output consist of an embeddings, which include the contextualized representations of tokens and sentences.
Feature Extraction: Depending on the specific task, extract relevant features from the BERT embeddings. For instance, for text classification, you might use the embedding corresponding to the [CLS] token as the sentence-level representation and feed it into a classifier to predict the class label.
Task-Specific Processing: Finally, incorporate the extracted features into the desired downstream task pipeline, such as feeding them into a neural network for classification, regression, or any other task-specific operation.

By following these steps, BERT embeddings can be used in various tasks

Image Embedding

Now that we’ve delved into word embedding and established a solid understanding of its principles, let’s extend our exploration to image embedding. While word embedding captures semantic relationships and contextual information within text data, image embedding aims to represent visual content in a similar manner. Just as words are encoded into dense vectors in a high-dimensional space to preserve their meaning, image embedding involves transforming images into compact representations while retaining essential visual features.

One key advancement in image embedding technology is the Clip model, which has gained significant attention for its ability to understand and interpret images in a semantically meaningful way. Clip plays a crucial role in enabling the creation of models such as Stable Diffusion and DALL-E. These models rely on the quality and richness of image embeddings to generate coherent and contextually relevant outputs.

Clip Model

CLIP (Contrastive Language–Image Pre-training) diverges from the typical fine-tuning of pretrained models, opting for zero-shot learning instead. This means the model can perform tasks it wasn’t explicitly trained on. The CLIP model was trained on a vast dataset of 400 million image-text pairs, using a simplified version of the ConVIRT model. Its core idea is to learn visual representation from natural language data, showing that simple pre-training tasks can significantly boost zero-shot learning performance.

The main aim of CLIP is to associate images with text snippets. For example, given an image, it predicts which of 32,768 text snippets best describes it. This relies on learning the connections between visual data and language from a large corpus.

Architecture

CLIP employs a dual-model approach, comprising a 12-layer text transformer for generating text embeddings and either a ResNet or Vision Transformer (ViT) for producing image embeddings. Authors test both ResNet and ViT architectures for CLIP’s image encoder. The ViT-based encoder was shown to be better in terms of both efficiency and performance.

While the text encoder transformer architecture has been discussed previously, let’s delve into the image encoding components.

ResNet

ResNet, short for Residual Network, stands as a prominent deep learning model in computer vision tasks. Utilizing a Convolutional Neural Network (CNN) architecture, ResNet employs a sequence of convolution, pooling, activation, and fully-connected layers to progressively reduce the dimensionality of the input (image) to yield a single vector (embedding).

ResNet’s key feature lies in its utilization of residual blocks within the convolution process, aimed at enhancing the stability of training convolutional networks by mitigating issues like gradient vanishing. The essence of residual blocks revolves around incorporating input from the preceding layer alongside the current layer’s output.

Vision Transformers

Introduced in 2021 by Dosovitskiy et al., Vision Transformers adopt a novel approach by applying traditional transformers. Instead of treating words as input, Vision Transformers decompose images into patches, tokenizing them to feed into a transformer encoder.

*Vision Transformers Model overview [11]*

Contrastive pre-training

Firstly, CLIP begins by assembling paired texts and images as input data. These paired instances are then sampled randomly at each iteration from a comprehensive dataset, laying the groundwork for subsequent processing. During training, the text encoder and image encoder operate in tandem, each tasked with transforming their respective inputs into feature vectors. The text encoder generates text feature vectors (T1, T2, …, TN), while the image encoder produces corresponding image feature vectors (I1, I2, …, IN).

*CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. [1]*

CLIP’s training lies in evaluating the cosine similarities between these text and image feature vectors. Notably, matching pairs of texts and images are identified by highlighting their corresponding similarity values in blue. The objective during contrastive pre-training is to elevate these highlighted similarity values, indicating a close alignment between matched texts and images within the feature space. Crucially, this alignment must be achieved while preserving clear distinctions between dissimilar pairs. To operationalize this objective, CLIP employs a softmax classifier to maximize the probability of correct matches for each pair of images and texts. The loss function used in CLIP training is the Cross Entropy Loss, which is minimized across all pairs of images and texts in the batch.

Zero-Shot Classification

Zero-shot image classification, as implemented by the Contrastive Language Image Pre-training (CLIP) model, relies on a sophisticated process to accurately predict class labels for input images without prior training on specific datasets. The methodology behind CLIP involves several intricate steps:

Text Encoding: Initially, potential label text classes are encoded using a pre-trained text encoder. This encoder transforms the textual descriptions into high-dimensional feature vectors (T1, T2, T3, …, TN), capturing semantic information and contextual nuances.
Image Encoding: Simultaneously, the input image is passed through a pre-trained image encoder, generating an image feature vector (I1). This vector represents the visual characteristics and features extracted from the image.
Cosine Similarity Calculation: The next crucial step involves computing the cosine similarity between each text feature vector and the image feature vector. Cosine similarity measures the cosine of the angle between two vectors, indicating their degree of similarity.
Label Prediction: The text feature vector that exhibits the highest cosine similarity with the image feature vector is selected as the predicted label for the input image. Essentially, this process identifies the text description that aligns most closely with the visual content of the image.

*Convertion of a dataset’s classes into captions such as “a photo of a dog” and predict the class of the caption CLIP estimates best pairs with a given image. [1]*

By leveraging the relationship between textual descriptions and visual features encoded in both the text and image spaces, CLIP achieves remarkable performance in zero-shot image classification tasks. This approach enables the model to generalize across diverse datasets without the need for explicit training on each individual dataset. Additionally, CLIP’s sensitivity to the nuances of textual descriptions ensures nuanced and accurate predictions, even in scenarios where multiple textual descriptions may correspond to the same visual content. Overall, CLIP represents a groundbreaking advancement in image classification, offering engineers a powerful tool for tackling complex real-world problems in computer vision.

How to use CLIP

Lets take an the example of image retrieval, Image retrieval is a computer vision task of browsing, searching, filtering and querying from large datasets of images. To apply image retrieval using the CLIP model and Hugging Face, the following key points are essential:

Image Preprocessing: Preprocess the images to be queried or retrieved to ensure compatibility with the CLIP model’s input requirements. This typically involves resizing, normalization, and converting images to tensors.
Text Encoding: Encode textual descriptions of images using the CLIP model’s text encoder. These descriptions serve as queries for image retrieval.
Image Encoding: Encode the preprocessed images using the CLIP model’s image encoder.
Similarity Calculation: Compute the similarity between the encoded text embeddings (query) and image embeddings (database) using a similarity metric such as cosine similarity.
Retrieval: Rank the images based on their similarity scores with the query. The images with the highest similarity scores are retrieved as the top results.

Conclusion

To summarize, the advancements in embedding techniques represent a significant leap forward in AI capabilities, particularly within the realms of NLP and computer vision. Models like BERT have showcased the effectiveness of contextual embeddings in capturing intricate language nuances, while CLIP’s innovative zero-shot image classification approach highlights the potential of integrating textual and visual information. These developments empower powerful tools such as sentiment analysis, text prediction or image generation.

References:

https://openai.com/research/clip
https://openclassrooms.com/fr/courses/6532301-introduction-to-natural-language-processing/8082110-discover-the-power-of-word-embeddings
https://jalammar.github.io/illustrated-bert/
https://blog.research.google/2021/12/a-fast-wordpiece-tokenization-system.html?m=1
DEVLIN, Jacob, CHANG, Ming-Wei, LEE, Kenton, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
VASWANI, Ashish, SHAZEER, Noam, PARMAR, Niki, et al. Attention is all you need. Advances in neural information processing systems, 2017, vol. 30.
WU, Yonghui, SCHUSTER, Mike, CHEN, Zhifeng, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
JAIN, Abhilash, RUOHE, Aku, GRÖNROOS, Stig-Arne, et al. Finnish Language Modeling with Deep Transformer Models. arXiv preprint arXiv:2003.11562, 2020.
RADFORD, Alec, KIM, Jong Wook, HALLACY, Chris, et al. Learning transferable visual models from natural language supervision. In : International conference on machine learning. PMLR, 2021. p. 8748–8763.
HE, Kaiming, ZHANG, Xiangyu, REN, Shaoqing, et al. Deep residual learning for image recognition. In : Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. p. 770–778.
DOSOVITSKIY, Alexey, BEYER, Lucas, KOLESNIKOV, Alexander, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.