OpenAI GPT-3 Text Embeddings - Really a new state-of-the-art in dense text embeddings?

Encoding costs & average performance on 14 sentence embeddings tasks of OpenAI GPT-3 Embeddings models in comparison to open alternatives. Full results
  • Text Similarity — E.g. useful for clustering, deduplication and topic modeling.
  • Text Search — E.g. useful for retrieving information from a large corpus
  • Code Search — E.g. useful for finding a function for a given search query


  • The OpenAI text similarity models perform poorly and much worse than the state of the art (all-mpnet-base-v2 / all-roberta-large-v1). In fact, they perform worse than the models from 2018 such as the Universal Sentence Encoder. They are also 6 points weaker than extremely small models with just 22M parameters that can run in your Browser.
  • The text search models perform quite well, giving good results on several benchmarks. But they are not quite state-of-the-art compared to recent, freely available models.
  • The embedding models are slow and expensive: Encoding 10 million documents with the smallest OpenAI model will cost about $80,000. In comparison, using an equally strong open model and running it on cloud will cost as little as $1. Also, operating costs are tremendous: Using the OpenAI models for an application with 1 million monthly queries costs up to $9,000 / month. Open models, which perform better at much lower latencies, cost just $300 / month for the same use-case.
  • They generate extremely high-dimensional embeddings, significantly slowing down downstream applications while requiring much more memory.

Available Models & Dimensionality

  • Ada (1024 dimensions)
  • Babbage (2048 dimensions)
  • Curie (4096 dimensions)
  • Davinci (12288 dimensions)

Computing Embeddings

import openai
response = openai.Embedding.create(
input="This is an example",
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
sentence_embeddings = model.encode('This is an example')

Sentence-Embedding Performance

  • Paraphrase Detection: Scoring of text pairs if they are a paraphrase. Here I evaluate on SprintDuplicateQuestions (questions pairs from the Sprint technical forum), and on TwitterSemEval2015 & TwitterURLCorpus (annotated paraphrase pairs of Tweets).
  • Clustering: Clustering of similar sentences into topics. Here I evaluate on Reddit (cluster reddit titles into subreddits), 20NewsGroups (cluster email subjects into topics), and StackExchange (cluster questions into topics).
  • Retrieval: Given a sentence, find related sentences. Here I evaluate on: CQADupStack (finding similar questions on different StackExchange forums), Quora (find related questions on Quora), AskUbuntu (find helpful questions on AskUbuntu), StackOverflowDupQuestions (find duplicate questions on StackOverflow), and SciDocs (find related scientific papers based on titles).
  • Semantic Textual Similarity: STS is the process of annotating text pairs on a scale 0 to 1 on their similarity. Here I use BIOSSES (biomedical text pairs), STSBenchmark and SICK-R (general text pairs from many domains).

Baselines / Comparison

  • all-MiniLM-L6-v1: An extremely small (80 MB) and fast model, with only 6 layers and producing embeddings with 384 dimensions.
  • all-mpnet-base-v2: A bert-base sized model (418 MB) with 12 layers and 768 dimensions.
  • all-roberta-large-v1: A model based on RoBERTA-large (1.3 GB) with 24 layers and 1024 dimensions.
  • Universal Sentence Encoder — Large: A transformer-based version of Universal Sentence Encoder with 6 layers and 512 dimensions, published in 2018
  • Sentence-T5: The most recent text embedding model from Google published in August 2021.


Encoding costs & average performance on 14 sentence embeddings tasks of the OpenAI embeddings models in comparison to open alternatives. Full results

Text Search Performance

  • TREC-Deep Learning (DL) 2019 & 2020: Queries from the Bing search engine annotated with relevant passages from the web.
  • FiQA: Financial question answering
  • TREC-COVID: Retrieval on COVID-19 scientific papers. As discussed in BEIR, the dataset contains a high number of incomplete articles for which only the title is available. Here, I just tested the models on papers that have all fields available.
  • TREC-News: Given a news article, retrieve relevant other news articles providing context and background information.
  • Robust04: It contains especially challenging queries for a large collection of documents.
Average Performance on 6 semantic search tasks. Full results.

Operation Costs

Encoding Costs

Operating Costs

Code Search






Research Scientist at Hugging Face working on Neural Search

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Generating (Mediocre) Pictures of Cars Using AI

MLOps Task_3

Double the ROI of your Machine Learning Investment with Transfer Learning

What is CatBoost Algorithm? Step-by-Step Tutorial

A Summary of Neural Network Layers

Word to Sentence Visual Semantic Similarity for Caption Generation: Lesson learned

Deep Learning with Spark in Deep Java Library in 10 minutes

Super-Resolution trends at ECCV’18

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Nils Reimers

Nils Reimers

Research Scientist at Hugging Face working on Neural Search

More from Medium

EMNLP 2021: latest trends in NLP

NLP Trend: A few examples on how to process long documents

Learning how GPT-3 instruct models were (most likely) trained

A brief timeline of NLP from Bag of Words to the Transformer family