Text Embedding — What, Why and How?

Fanghua (Joshua) Yu
8 min readApr 22, 2023

--

Introduing GPT-3 Text Embeddings to Your Next Knowledge Project

The magnificent ceiling of The Grand Hall of National Galary of Victoria, which is the world’s largest stained-glass ceiling. Photo by the author.

Embedding Is the Knowledge for AI

If I ask people what is the relationship between the 1st and 2nd line below:

#1 What is text embedding?

#2 [-0.03156438, 0.0013196499, -0.01716885, -0.0008197554, 0.011872382, 0.0036221128, -0.022915626, -0.005925469, … (1528 more items to go here …]

I believe none can answer. Apparently the first line is a question in English asking for meaning of embedding, but the second line seems to be totally meaningless to human being.

In fact, the 2nd line is the embedding of the 1st line, produced by OpenAI GPT-3’s embedding model, i.e. text-embedding-ada-002.

Embedding is a machine learning processs to convert complex, high-dimensional data e.g. text, image etc. into lower-dimensional representations while preserving essential relationships and structure. Embedding is the knowledge for AI as it is produced, understood and used by various algorithms and AI systems.

Basically ANY DATA can be emdedded. For text, some well-known text embeddings include:

  1. Word2Vec: Developed by researchers at Google, Word2Vec is a family of algorithms that learns word embeddings from large text corpora using either the continuous bag-of-words (CBOW) or the skip-gram architecture.
  2. GloVe (Global Vectors for Word Representation): Developed by researchers at Stanford University, GloVe is another popular word embedding technique that learns vector representations by considering global co-occurrence statistics in a text corpus.
  3. FastText: Developed by Facebook AI Research, FastText is an extension of Word2Vec that learns embeddings for words, subwords, and even out-of-vocabulary terms. FastText is especially effective for languages with rich morphologies, as it can handle morphological variations by leveraging subword information.
  4. ELMo (Embeddings from Language Models): Developed by researchers at the Allen Institute for Artificial Intelligence, ELMo is a contextual word embedding technique that learns embeddings by training a bidirectional LSTM (Long Short-Term Memory) language model on large text corpora. ELMo embeddings are context-dependent, capturing different meanings for the same word based on the surrounding context.
  5. BERT (Bidirectional Encoder Representations from Transformers): Developed by researchers at Google, BERT is a powerful pre-trained contextual embedding technique based on the Transformer architecture. BERT embeddings are bidirectional, meaning they consider both the left and right context of a word, resulting in richer representations that capture complex language structures and relationships.
  6. GPT (Generative Pre-trained Transformer): Developed by OpenAI, GPT is a series of models (GPT, GPT-2, and GPT-3) that use the Transformer architecture to generate text. While primarily known for text generation, GPT models can also produce contextual embeddings by using the pre-trained weights and the internal hidden states of the model.

Why Generating and Storing Embeddings?

In the past, we use computer system to capture and store data created by and used for human beings. Now generating and storing text embeddings have obvious reasons and benefits too.

Efficient search and retrieval. Text embeddings can be used to search and retrieve similar or relevant documents quickly in large databases, as they represent the semantic meaning of the text. This is particularly useful for applications such as search engines, content recommendation systems, and document clustering.

Reduced storage requirements. Text embeddings are often represented as fixed-size numerical vectors, which are more compact than storing raw text. This can help reduce storage requirements, especially when dealing with large volumes of text data.

Faster computation. Many machine learning and natural language processing tasks can be performed more efficiently using text embeddings, as they are numerical representations that can be easily processed by machine learning algorithms.

Improved performance. Text embeddings can capture the semantic relationships between words and phrases, which can lead to better performance in various natural language processing tasks. By leveraging pre-trained embeddings or fine-tuning them on domain-specific data, it is possible to achieve higher accuracy and more relevant results.

Transfer learning. Pre-trained text embeddings can be used as a starting point for training domain-specific models. This transfer learning approach can save time and computational resources, as it enables the new model to benefit from the knowledge gained during the pre-training phase.

Language-agnostic processing. Text embeddings can be used to work with multiple languages, as they can capture semantic similarities even across different languages. This enables the development of multilingual applications and systems that can process and analyze text data from various sources.

Interoperability. Text embeddings can be used as a common representation for different text sources and formats, making it easier to combine and process data from various sources. This can help facilitate interoperability between different systems and applications.

Conceptually, saving embedding of text for future search should be cheaper than generating it everytime, but there is no empirical study on the comparison of the two yet. This is also greatly impacted by the model used to generate embeddings. If you are interested in knowing more, I found this article quite informative:

Things to Consider When Introducing Embeddings into Your Solution

In my previous artiles, I’ve demonstrated several use cases of having GPT-3 embeddings to enrich features of knowledge graph based solutions. Below are the links to them:

Here I am summarizing some findings from actual implementation.

Which Model to Choose?

If you don’t bother to train your own language model, and simply want to call some API to get embedding for a piece of text without worrying about how it is done behind the screen, OpenAI GPT-3 Embedding API is a good choice.

For now, there are several embedding models to select, but the older version 1 ones are not recommended, so I used the text-embedding-ada-002 model. As per OpenAI, it has the best performance as compared to earlier models:

Source: OpenAI, as of Mar. 2023

BEIR (Benchmarking-IR) is a heterogeneous benchmark containing diverse IR tasks.

Source: BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. https://arxiv.org/abs/2104.08663

There is also the MTEB Leader Board (Massive Text Embedding Benchmark) published on Hugging Face for public to assess performance of various language models against each natural language task. So putting text-embedding-ada-002 into context, below is a chart of how it (in bold blue font) ranks in various testing datasets / tasks, and compared with other top models.

Performance chart of text-embedding-ada-002, as of Mar. 2023. Data collected from https://huggingface.co/spaces/mteb/leaderboard

Even if text-embedding-ada-002 didn’t top most of the tasks, it is not far away from the best, and always appears in the upper part of all the bars.

Storage Type

Embeddings are vectors, i.e. an array/list of numbers represented in computer system as float which usually takes 4 bytes for single precision value. Vector as a data type can be stored and processed by most of the databases today, but it requires certain capabilities to make the search of vectors more efficient.

There are so-called vector databases in the market today. With the help from Graph Data Science library, Neo4j Graph Database can also store and query vectors in an efficient way. You may find more details from my another post.

Retrieval

In fact, you can’t really search for a vector, but rather find the ones that are most similar to it. The most commonly used method to decide similarity between 2 vectors is the Cosine Similarity.

For example, for the question at the beginning of this article:

Question: what is text embedding?

Answer #1: Embedding is a machine learning processs to convert complex, high-dimensional data e.g. text, image etc. into lower-dimensional representations while preserving essential relationships and structure.

Answer #2: Embedding is to implant (an idea or feeling) so that it becomes ingrained within a particular context. (Source: Oxford Dictionary)

We can use the following function from Neo4j Graph Data Science library to calculate simiarity between question and each answer:

// Assume text and embeddings are already stored in the database
WITH (q1:Question) WHERE id(q1) = 1234
MATCH (a1:Answer) WHERE id(a1) = 5678
RETURN gds.similarity.cosine(q1.embedding, a1.embedding) AS similarity

The similarity between the question and answer #1 is 0.8643625, and between question and answer #2 is 0.823399. A higher score means semantically two pieces of text are more relevant / similar, so answer #1 is the better one.

Cost

There is a cost everytime the OpenAI Embedding API is called, which is based on the number of tokens in the request, and which model is used. For text-embedding-ada-002, it is 0.04 cent for every 1000 tokens, roughly 4000 characters.

For the arxiv repository, there are about 2.2 millions of papers. The worksheet below gives the cost estimate for generating embeddings for title and abstract of all of them.

Cost estimate worksheet sample. Unit of min, max and avg in the tale is Number of Characters.

Even for the paid clients, there is initially a monthly usage cap of U$120, which is apparently not enough for even half of the job above. Request and approval are required to increase the cap.

Throughput

Another consideration to use the OpenAI Embedding API is its througput. Based on my test, for normal paid account, I can get about 60 request / minute for each session, and capped at about 2 requests per second. OpenAI also puts a limit on the number of requests from each account.

The workaround is to batch texts in the request. Instead of sending one piece of text in each API call, it is possible to send a collection / list of texts, and hereby receive a list of vectors in the response. However, the total size of texts are also limited by the API to be within 8191 tokens ~ 6000 words in English, which means some truncking logic needs to be put in place too.

Once texts are batched, the throughput can achieve 100 ~ 200 sentences per second. It is a feature that can easily be missed, so I share it here:

Source: OpenAI API Reference, https://platform.openai.com/docs/api-reference/embeddings

Summary

Embedding is the knowledge for AI. With continuous advancements in techniques and applications, AI systems will become more complex and capable. The role of embeddings in capturing essential relationships, semantics, and patterns within data will become even more crucial.

--

--

Fanghua (Joshua) Yu

I believe our lives become more meaningful when we are connected, so is data. Happy to connect and share: https://www.linkedin.com/in/joshuayu/