OpenAI GPT-3 Text Embeddings - Really a new state-of-the-art in dense text embeddings?

12 min readJan 28, 2022

This week, OpenAI announced an embeddings endpoint (paper) for GPT-3 that allows users to derive dense text embeddings for a given input text at allegedly state-of-the-art performance on several relevant tasks. In this post, I will be reviewing how good these new GPT-3 embeddings really are. Are they really a new state of the art?

Encoding costs & average performance on 14 sentence embeddings tasks of OpenAI GPT-3 Embeddings models in comparison to open alternatives. Full results

Dense text embeddings are useful for many tasks, including clustering, topic modeling, deduplication, paraphrase mining and semantic search. As part of my research, I’ve worked on dense text embeddings since 2019 and released my research as part of the sentence-transformers framework, which provides open & free state-of-the-art text embedding models for many use-cases.

OpenAI provides endpoints for three different use-cases:

Text Similarity — E.g. useful for clustering, deduplication and topic modeling.
Text Search — E.g. useful for retrieving information from a large corpus
Code Search — E.g. useful for finding a function for a given search query

I wanted to investigate how well these GPT-3 based embeddings would work so I benchmarked the text similarity on 14 datasets and text search embeddings on 6 datasets from various domains: Twitter, StackExchange, Reddit, emails, news, scientific publications and many more.

Summary

While I was excited about OpenAI’s new release, the results were not what I expected:

The OpenAI text similarity models perform poorly and much worse than the state of the art (all-mpnet-base-v2 / all-roberta-large-v1). In fact, they perform worse than the models from 2018 such as the Universal Sentence Encoder. They are also 6 points weaker than extremely small models with just 22M parameters that can run in your Browser.
The text search models perform quite well, giving good results on several benchmarks. But they are not quite state-of-the-art compared to recent, freely available models.
The embedding models are slow and expensive: Encoding 10 million documents with the smallest OpenAI model will cost about $80,000. In comparison, using an equally strong open model and running it on cloud will cost as little as $1. Also, operating costs are tremendous: Using the OpenAI models for an application with 1 million monthly queries costs up to $9,000 / month. Open models, which perform better at much lower latencies, cost just $300 / month for the same use-case.
They generate extremely high-dimensional embeddings, significantly slowing down downstream applications while requiring much more memory.

Available Models & Dimensionality

Via a REST API endpoint, you can access four types of models from OpenAI:

Ada (1024 dimensions)
Babbage (2048 dimensions)
Curie (4096 dimensions)
Davinci (12288 dimensions)

Davinci is claimed to be the most capable model (and most expensive), while Ada is the least capable but cheapest model.

12288 dimensions for Davinci is extremely high-dimensional. For comparison, all-MiniLM-L12-v1 produces embeddings of 384 dimensions, Universal Sentence Encoder of 512 dimensions, and all-mpnet-base-v2 of 768 dimensions.

Dimensions are not for free: Assume you want to build a semantic search engine over the English Wikipedia, which has about 21 million passages you need to encode. Using float16 (and no further compression techniques) and 384 dimensions, the resulting embeddings have a size of about 16GB, which can fit easily on a decently sized server (like an n2-highmem-4 for about $150/month on Google Cloud). Using 12288 dimensions, you need at least 516 GB of memory to store the embeddings, increasing your compute cost to $3,000/month for an n2-highmem-80 instance.

Further, any downstream task like clustering or search in 12288 dimensions is a lot slower than in lower dimensional vector spaces. Hence, I would only find Ada (1024 dim) and maybe Babbage (2048 dim) to be practical for most scenarios. Curie and Davinci produce vectors just too high dimensional to work for any larger scale task. Dimensionality reduction techniques like PCA cannot solve this, as they significantly impact downstream performance.

Computing Embeddings

OpenAI has made it easy to compute embeddings by a REST-API:

import openai
response = openai.Embedding.create(
    input="This is an example",
    engine="text-similarity-davinci-001")

I used the endpoint in December 2021, when it was still in beta.

Computing embeddings for the open-source framework sentence-transformers is similarly easy and runs on your local machine or server:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
sentence_embeddings = model.encode('This is an example')

Sentence-Embedding Performance

First I tested the OpenAI embeddings model for their ability to encode sentences in a semantic vector space. For this, I created a benchmark that consists of 14 complex tasks:

Paraphrase Detection: Scoring of text pairs if they are a paraphrase. Here I evaluate on SprintDuplicateQuestions (questions pairs from the Sprint technical forum), and on TwitterSemEval2015 & TwitterURLCorpus (annotated paraphrase pairs of Tweets).
Clustering: Clustering of similar sentences into topics. Here I evaluate on Reddit (cluster reddit titles into subreddits), 20NewsGroups (cluster email subjects into topics), and StackExchange (cluster questions into topics).
Retrieval: Given a sentence, find related sentences. Here I evaluate on: CQADupStack (finding similar questions on different StackExchange forums), Quora (find related questions on Quora), AskUbuntu (find helpful questions on AskUbuntu), StackOverflowDupQuestions (find duplicate questions on StackOverflow), and SciDocs (find related scientific papers based on titles).
Semantic Textual Similarity: STS is the process of annotating text pairs on a scale 0 to 1 on their similarity. Here I use BIOSSES (biomedical text pairs), STSBenchmark and SICK-R (general text pairs from many domains).

To excel on this benchmark, text embedding models must be able to understand text from various domains and create vector spaces that require different properties, e.g. properties for clustering vs. properties for retrieval.

Baselines / Comparison

For comparison, I include the following models:

all-MiniLM-L6-v1: An extremely small (80 MB) and fast model, with only 6 layers and producing embeddings with 384 dimensions.
all-mpnet-base-v2: A bert-base sized model (418 MB) with 12 layers and 768 dimensions.
all-roberta-large-v1: A model based on RoBERTA-large (1.3 GB) with 24 layers and 1024 dimensions.
Universal Sentence Encoder — Large: A transformer-based version of Universal Sentence Encoder with 6 layers and 512 dimensions, published in 2018
Sentence-T5: The most recent text embedding model from Google published in August 2021.

Results

Encoding costs & average performance on 14 sentence embeddings tasks of the OpenAI embeddings models in comparison to open alternatives. Full results

As the results show, the sentence similarity models from OpenAI perform a lot worse than models such as the Universal Sentence Encoder, which was published in March 2018, and also much worse than the state-of-the-art models from sentence-transformers & Sentence-T5. In fact, the largest model (davinci) with 175B parameters is around 10 points weaker than the all-MiniLM-L6-v2 with just 22M parameters — a model that you can easily run in your browser.

In the paper, OpenAI evaluated the model on SentEval, a benchmark to test sentence embedding models for text classification.

First, this comparison leaves out many relevant models from 2020 and 2021, which are substantially better than the models they compare against. Second, SentEval tests sentence embeddings for a rather small use case.

SentEval tests sentence embeddings for their ability to do text classification by adding a softmax classification head on top and fine-tuning only this head on the available training data. This only makes sense if you want to run many different classifiers on the same text. By pre-computing and sharing the text embeddings across classifiers, you can save a lot of compute time. However, if you only run a single text classifier, it makes much more sense to fully fine-tune your network. For instance, for the Microsoft Research Paraphrase Corpus (MRPC) dataset, a tiny model like MiniLMv2 with just 30M parameters (~60MB in size) achieves an accuracy of 88.7. Using the largest embedding model from OpenAI, cpt-text XL, with 175B parameters (~350 GB in size), you achieve an accuracy of just 78.1.

Furthermore, from SentEval, we cannot conclude how well a model will perform for the advertised downstream applications like clustering, semantic search, or paraphrase mining. Even text encoders with random parameters perform well on SentEval while being unusable for vector space tasks like clustering & search.

From the paper it appears that the text similarity model was trained using a nearly identical approach to DeCLUTR using consecutive texts in documents as positive pairs. While it is interesting to see how these approaches scale to billion parameter models, the produced models are significantly weaker than models which exploit more structure from the data. For example, Sentence-T5 and all-mpnet-base-v2 used question-answer pairs, conversation pairs, and title-body pairs crawled from the web, which yields significantly better models.

If we compare the OpenAI models only to models trained on unstructured data, they perform a bit better than the strongest unsupervised model (princeton-nlp/unsup-simcse-bert-large-uncased), which achieves an average of 60.83 on the above benchmark.

Text Search Performance

The next area of focus is on text search, where OpenAI provides dedicated models. Unfortunately the paper does not clarify how these models were trained and on which datasets.

In the paper we find numbers for 11 out of 18 datasets on BEIR, a benchmark that tests models for zero-shot information retrieval which my research group developed last year. Why 7 datasets from the benchmark were left out is not clear.

The results on retrieval look much better than the results for text similarity indicating a strong model. In December I tested the model exposed via the API on the FiQA dataset, but sadly got different results than what was reported in the paper:

It might be that the paper used a different model or that the model behind the API had been different when I tested it in December. Or the authors did some different pre-processing. (Update 2022–02–09: The difference in performance is due to different truncation. GPT-3 just supports inputs up to 2048 word pieces. Sadly the API doesn’t offer a truncation service and trying to encode text longer than 2048 word pieces results in an error. It is up to you to figure out how much text you can encode. I used a simple truncation strategy where I only encoded the first two thousand characters. The author later provided a script that uses a GPT-2 tokenizer and iteratively remove words from the end until they are below 2040 word pieces. With this advanced truncation strategy, results are supposed to be re-producable)

I tested the models available via the API a bit further on several (query, document) retrieval datasets:

TREC-Deep Learning (DL) 2019 & 2020: Queries from the Bing search engine annotated with relevant passages from the web.
FiQA: Financial question answering
TREC-COVID: Retrieval on COVID-19 scientific papers. As discussed in BEIR, the dataset contains a high number of incomplete articles for which only the title is available. Here, I just tested the models on papers that have all fields available.
TREC-News: Given a news article, retrieve relevant other news articles providing context and background information.
Robust04: It contains especially challenging queries for a large collection of documents.

The average results are depicted below. I was just able to test the ada & babbage, as my access was restricted to run further experiments.

Average Performance on 6 semantic search tasks. Full results.

The OpenAI Models perform comparably to open dense models. However, the biggest difference is in terms of costs.

Operation Costs

When building a search application, two factors are highly relevant: Operation costs, i.e. how much does it cost to setup & run the index, and latency, i.e. how quickly does the search return the results.

I assume we want to do semantic search on the English Wikipedia with about 1 million queries per month. As a free comparison system, I use SpladeV2, a sparse embedding model that performs well for semantic search. According to the OpenAI paper, SpladeV2 and the OpenAI GPT-3 embedding models perform in the following way on BEIR:

As we see, the largest OpenAI model with 175 billion parameters is just 0.1 points better than SpladeV2 which has just 66 million parameters. How the results will change when evaluated on all 18 BEIR datasets remains open.

Encoding Costs

The English Wikipedia had in 2020 around 6 million articles with about 2 billion tokens. When broken down into paragraphs of 100 tokens each, this yields 21M paragraphs.

Using the OpenAI Davinci model, it would cost us over $1 million to encode all English Wikipedia articles. In contrast, SpladeV2 is based on a distilbert-base model, which can encode about 300 paragraphs per second on a T4-GPU. Using a preemptive T4 GPU on Google Cloud, we have costs of $0.13 per hour (as of 27.01.2022). Hence, encoding Wikipedia with SpladeV2 might cost as little as $2.50.

Operating Costs

Besides encoding, we have operating costs: Search queries must be encoded and queried against your index.

When we assume 1 million queries per month, each with on average 10 tokens, we get the following monthly costs:

The 175B Davinci model would cost us about $6,000 on a monthly basis. Estimating the costs of SpladeV2 is much harder, as you can run it on your own server. Here it depends how much compute you use to encode queries. But in general, SpladeV2 can be run on a CPU server, making it rather cheap. In the above table, I used an n1-standard-2 instance, which costs about $50 / month and can encode around 100 queries / second. With further model quantization, it can encode up to 500 queries / second. When your number of queries doubles, the costs for the OpenAI will double, while your n1-standard-2 instance can handle it with ease.

Finally, we also need an index server that stores the embeddings for our 21 million Wikipedia paragraphs. As mentioned above, the Davinici model yields 12288 dimensional vectors, hence we need at least 516 GB of memory to store the embeddings. This adds to your operation costs $3,000/month for an n2-highmem-80 instance.

In contrast, Spladev2 has about 250 non-zero elements, so storing the sparse embeddings requires about 21GB of memory. Here, you could use an n2-highmen-8 for about $300/month.

As you want to quickly search through these vector spaces, you would need further memory to build a respective index. I left-out the memory requirement for this index, as it is non-trivial to compute and depends on many trade-offs like recall, latency, index build time and many more.

Code Search

OpenAI also provides an endpoint for code-search. I did not run any tests on it, but the mentioned issues (slow, too many dimensions, extremely expensive) remain the same. But luckily there is a free alternative to use: st-codesearch-distilroberta-base

Would be interesting how this model performs on suitable benchmarks.

Conclusion

The text similarity models are weaker than e.g. Universal Sentence Encoder from 2018 and much weaker than text embedding models from 2021. They are even weaker than the all-MiniLM-L6-v1 model, which is so small & efficient that it can run in your browser.

The text-search models perform much stronger, achieving good results. But they are just on-par with open models like SPLADEv2 or multi-qa-mpnet-base-dot-v1.

The biggest downside for the OpenAI embeddings endpoint is the high costs (about 8,000–600,000 times more expensive than open models on your infrastructure), the high dimensionality of up to 12288 dimensions (making downstream applications slow), and the extreme latency when computing embeddings. This hinders the actual usage of the embeddings for any search applications.

Disclaimer

I ran the experiments in late December 2021, when the embedding endpoint was in beta and not yet publicly announced. At that time, using the endpoint could be used without charge. I cannot tell if the endpoint / deployed models has changed with the official release. Maybe the models got significantly better since December. Running the tests now would costs $1,000,000+.