The Art of Embeddings: Transforming Text for Vector Databases (Part 2)

David Gutsch
12 min readJul 28, 2023

--

In the first part of this series, we embarked on a journey through the cosmos of vector databases. We explored how these databases transform data into vectors in a multi-dimensional space, enabling a nuanced understanding of data and the ability to compare almost anything. Now, as we continue our exploration, we’ll delve deeper into a crucial component of this process: embeddings. Then in the last segment of our series, we’ll take a look at the algorithms that store, organize, and search these embeddings in the database itself.

Embeddings are a fundamental concept in deep learning that enable us to capture rich context in ones and zeros. They are powerful, flexible, and while their implementations are necessarily complex, their purpose is beautifully simple. As Roy Keyes succinctly puts it, “Embeddings are learned transformations to make data more useful.” This definition encapsulates three key aspects of embeddings: they are learned, they transform data, and they make data more useful. They are learned usually via some variation of a neural network. They are transforming raw data, whether it be: image, text, entity data, product data, or audio. And they make the data more useful by capturing meaning and context in a machine readable and index-able format.

In this article we will cover the following topics:

  • How text is tokenized starting with the Word2Vec method, compare it with newer sub-word methods, and analyze their trade offs along the way.
  • We’ll then introduce how transformer models are used to take these tokenized words and phrases and transform them into their final embeddings that are stored in vector databases.
  • Finally we will explore the relationship between tokenization and embeddings, namely how the tokenization method chosen will influence the size and efficacy of the transformer model’s embeddings on search.

Tokenization

The process of transforming text into embeddings begins with tokenization, which is the process of breaking down text into smaller parts, or “tokens.” These tokens can be as small as individual characters or as large as entire sentences, However, in most cases they represent individual words or sub-words. A pioneering method that has evolved this process is Word2Vec, which was developed at google in 2013. It operates by grouping the vectors of similar words together in a vector space. This is achieved by creating dense vector representations of word features, such as the context of individual words. Given enough data and a variety of contexts, Word2Vec can make accurate predictions about a word’s meaning based on its past appearances. For instance, it can infer that “man” is to “boy” what “woman” is to “girl” based on the contexts in which these words appear.

Word2Vec uses a neural network to train words against other words that neighbor them in the input corpus. It does this in one of two ways: either using the context to predict a target word, a method known as Continuous Bag of Words (CBOW), or using a word to predict a target context, which is called Skip-Gram. For example, in the sentence “The quick brown fox jumps over the lazy dog,” the CBOW model would take “The,” “quick,” “brown,” “fox,” “over,” “the,” “lazy,” “dog” as context words to predict the target word “jumps.” Conversely, the Skip-Gram model would use “jumps” to predict the surrounding context words. When the feature vector assigned to a word cannot accurately predict that word’s context, the components of the vector are adjusted, refining the model’s understanding of semantic relationships. This iterative process of adjustment and refinement is at the heart of Word2Vec’s power and effectiveness.

CBOW
skip gram

Despite the novel approach, Word2Vec has some limitations. It cannot handle polysemy, which is when a single word or phrase has multiple meanings (e.g. river “bank”, money “bank”), which prevents it from differentiating between multiple meanings of a word based on context. Additionally it must store a vector for every unique word in the vocabulary which causes the size of the model to grow with the size of the corpus vocabulary, becoming a limiting factor for larger data sets. It also struggles with handling out-of-vocabulary words, or words that were not present in the training corpus which can lead to inaccurate representations. Lastly, Word2Vec does not account for morphological variations of words. For instance, it treats “run,” “runs,” and “running” as entirely separate words with no inherent relationship, which can lead to a loss of semantic understanding.

You might be wondering if Word2Vec doesn’t scale and cannot make simple delineations between words and contexts that we perform so easily as humans, how do we better encode word context into our embeddings? We do this with sub-word tokenization, which is a hybrid approach between word-level and character-level tokenization. It is based on the principle that frequently used words should not be split into smaller sub-words, but rare words should be decomposed into meaningful sub-words. For instance, a rare word like “annoyingly” might be decomposed into “annoying” and “ly”. This is especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together sub-words.

simple sub-word tokenization example

Sub-word tokenization allows the model to have a reasonable vocabulary size while being able to learn meaningful context-independent representations. For instance BERT and GPT-2 limit the vocabulary size to 30,000 to 50,000 tokens by using WordPiece and Byte Pair Encodings respectively. In addition, sub-word tokenization enables the model to process words it has never seen before, by decomposing them into known sub-words. For instance, if a model trained with sub-word tokenization encounters the word “unseenword”, it could potentially break it down into known sub-words like “un”, “seen”, and “word”. Now there are a number of different methodologies that use the sub-word approach to tokenize words. Too many I’m afraid to address in this article, but for those of you interested I will include them in an appendix at the end of the article.

Transformer Models

Now that we have a deeper understanding of the mechanisms by which words are tokenized we need to address how text is transformed into their final embeddings, while preserving the semantic meaning of the text on a larger scale. Forgive me in advance for there will be some hand waving at some of the intricacies of transformer models in the interest of focusing on how they enable vector databases to preserve meaning.

Vector databases generally use encoder-only transformer models; an example of this would be BERT (Bidirectional Encoder Representations from Transformers). We only need to encode the text so that it can be compared with the other embedded bodies of text in the database. Once we know which embeddings are most similar we can use their unique ids to look up the original raw text. These models leverage the power of self-attention mechanisms and positional encodings to understand the context and semantics of words in a sentence. Let’s break down this process into its key steps:

  1. Tokenization: We’ve already addressed this part of the process in depth. The number of tokens fed to the model at one time can range anywhere from the size of a sentence, a paragraph, all the way up to a small document, for instance GPT-3 currently can accommodate a maximum of 4k tokens.
  2. Embedding Lookup: Once the text is tokenized, each token is mapped to an initial embedding. These embeddings are not random but are pre-trained representations learned during the pere-training phase of the Transformer model. They serve as the starting point for understanding the semantics of each token.
  3. Positional Encoding: Transformers, by design, lack an inherent understanding of the order of tokens. To overcome this, positional encodings are added to the initial embeddings. These encodings provide information about the position of each token within the sequence, enabling the model to understand the order of words, while freeing us from the constraint of sequential processing of a text that limited processing in pre-transformer NLP models like RNNs.
  4. Self-Attention Mechanism: The next step involves the application of the self-attention mechanism. This mechanism allows each token to ‘look’ at other tokens in the sequence and weigh their influence based on their relevance. Essentially, it enables the model to determine which tokens contribute significantly to the meaning of each individual token.
  5. Aggregation: Following the self-attention step, the outputs for each token are aggregated, typically by summing them up. This aggregation results in a new set of embeddings for each token. These embeddings capture both the individual meanings of the tokens and their context within the sequence. The aggregation step combines the context-aware embeddings from the self-attention mechanism into a single vector.
  6. Feed-Forward Neural Network: The final step in the process involves passing these aggregated embeddings through a feed-forward neural network. This network processes the embeddings to produce the final set of embeddings and is shared across all positions. The feed-forward neural network further transforms these embeddings, enabling the model to learn more abstract representations, and it helps to generalize the model to unseen data.
One Layer of the original Transformer

The resulting embeddings are rich in semantic and contextual information, making them incredibly useful for a wide range of natural language processing tasks. In the context of vector databases, these embeddings serve as the high-dimensional vectors that are stored and queried to retrieve semantically similar results. Now let’s take a look at how the tokenization method chosen for the embedding process influences the performance of the embedding process as well as quality and utility of the embeddings relative to one another.

Effect of Tokenization on Embedding Efficacy

The choice of tokenization method can greatly influence the size and effectiveness of a model’s embeddings. There are three areas which are most affected: vocabulary size, handling of out-of-vocabulary words, and the overall quality and usefulness of the embeddings.

Choice in tokenization method leads to varying vocabulary sizes. A larger vocabulary equates to more embeddings, which in turn increases the model’s size and the computational resources needed for training and inference. This is why models like BERT and GPT use various sub-word tokenization methods in order to train on a huge corpus of text, while keeping the number of tokens relatively low.

The issue of out-of-vocabulary words can also impact the quality of the embeddings. This is where sub-word tokenization comes into play, as it allows the model to construct representations for unseen words from the sub-word units it has encountered. Furthermore, certain tokenization methods may be more suitable for specific tasks or languages. This can result in more accurate embeddings and improved performance on tasks such as search. If you’re interested in the specific trade offs of sub-word tokenization methods check out the appendix at the end of the article.

Tokenization method can significantly affect the size and effectiveness of a transformer model’s embeddings. It’s a crucial consideration when designing and training these models, and it should be guided by the specific requirements of the task and the characteristics of the language of the text. Many vector databases make this determination for you, but depending on your use case you may achieve superior performance in vector search by experimenting with different tokenization methods and transformer models.

Recap

Today we introduced how word tokenization was twice revolutionized first by Word2Vec allowing us to transform words into machine readable vectors, then once more by breaking down text into sub-words to cut down on the vocabulary size and provide better context.

Then we introduced the encoder only transformer models which are fundamental to transforming the tokenized words into indexable and comparable context of a larger corpus of text. Through a series of steps — tokenization, embedding lookup, positional encoding, self-attention mechanism, aggregation, and a feed-forward neural network — these models create embeddings that capture both the semantic meaning of each token and the context in which it appears in the sequence.

Lastly we refined our understanding of the effect our choice in tokenization method has on the overall utility of the embeddings the transformer models will produce. By choosing the right tokenization method we can create a nuanced understanding of text that captures both the meaning of individual words and the relationships between them. We’ve seen how this choice can affect the vocabulary size, handling of out-of-vocabulary words, and the overall quality and usefulness of the embeddings.

Conclusion

In this comprehensive guide to embeddings, we’ve journeyed through the intricate process of transforming raw text into useful embeddings, the key component for transforming human readable text into a format about which computers may reason. As we conclude this part of our series, we hope you’ve gained a deeper understanding of the embedding processes that underpin vector databases. These processes, while complex, are fundamental to the AI revolution, enabling machines to understand and process language in ways that were once the exclusive domain of humans.

Zooming out from this article, in our first article we explained how these embeddings can be compared to produce a query result of similar text data, in this article we learned about how the vector embeddings themselves are created, so there is one last component which we left unexplored: the Algorithm that stores and searches vector databases.

Tune in next week for Part 3, the last installment of this series where we address the Algorithms themselves that allow us to store and query these embeddings!

Appendix: Specific sub-word tokenization methods

  1. Byte Pair Encoding (BPE)
  • How it works: BPE starts with a vocabulary of individual characters and iteratively merges the most frequent pair of symbols to produce a new symbol. This process continues until a predefined number of merges have been made.
  • Advantages: BPE can handle out-of-vocabulary words and morphological variations. It’s flexible and can adapt to the specificities of the language it’s trained on.
  • Disadvantages: BPE can sometimes produce strange splits of words, especially for languages with complex morphology. It also requires a two-step process of first learning the BPE merges and then training the model.
  • Used in: GPT-2, RoBERTa.
  • Example: Given the word “lowers” and the most frequent pair is (“o”, “w”), BPE will merge them into a new symbol “ow”. The word “lowers” will be tokenized into “l”, “ow”, “e”, “r”, “s”.

2. Byte-Level Encoding

  • How it works: Byte-level encoding uses a vocabulary of all possible byte values (256 unique bytes). It can handle any string of bytes, making it particularly useful for multilingual models or models that need to handle non-text inputs.
  • Advantages: Byte-level encoding can handle any kind of input and doesn’t require any special handling for out-of-vocabulary words. It’s also very memory-efficient.
  • Disadvantages: Byte-level encoding can sometimes produce very long sequences for languages that use multi-byte characters (like Chinese or Japanese).
  • Used in: GPT-3.
  • Example: The word “hello” will be tokenized into the corresponding byte values of each character: 104, 101, 108, 108, 111.

3. Word Piece

  • How it works: Word Piece is similar to BPE but it prefers to keep whole words intact. It starts with a base vocabulary of individual characters and then learns a fixed number of merges, similar to BPE.
  • Advantages: Word Piece can handle out-of-vocabulary words and it’s less likely to split words in strange ways compared to BPE.
  • Disadvantages: Word Piece can still produce unexpected splits and it requires a two-step process of first learning the merges and then training the model.
  • Used in: BERT, DistilBERT.
  • Example: Given the word “lowers” and the most frequent pair is (“low”, “ers”), Word Piece will merge them into a new symbol “lowers”.

4. Unigram

  • How it works: Unigram tokenization is a subword regularization method that learns a subword vocabulary by minimizing the loss of the likelihood of the training data.
  • Advantages: Unigram can handle out-of-vocabulary words and it’s more flexible than BPE or Word Piece because it allows for multiple possible segmentations of a word.
  • Disadvantages: Unigram can sometimes produce unexpected splits and it requires a two-step process of first learning the merges and then training the model.
  • Used in: SentencePiece.
  • Example: Given the word “I love machine learning” [“I”, “ “, “a”, “d”, “o”, “r”, “e”, “ “, “machine”, “ “, “learning”]

5. SentencePiece

  • How it works: SentencePiece is a language-independent subword tokenizer and detokenizer. It treats the input as a raw input string, so you don’t need to pre-tokenize the text. SentencePiece implements both BPE and unigram language model with the extension of direct training from raw sentences.
  • Advantages: SentencePiece allows for the flexibility of BPE and unigram language model while also being able to handle multiple languages in one model. It doesn’t require any pre-tokenization.
  • Disadvantages: SentencePiece can sometimes produce unexpected splits, and the choice between BPE and unigram may not be clear for every application.
  • Used in: Multilingual BERT, T2T (Tensor2Tensor).
  • Example: Given the sentence “This is a test.”, SentencePiece might tokenize it into [“▁This”, “▁is”, “▁a”, “▁test”, “.”], where “▁” represents a space.

--

--