An Experiment with Similarity Search and a Vector Database — Part 2

Neel Phadnis
Notes and Points
Published in
3 min readJun 20, 2023
Photo by Steve Johnson on Unsplash

In this post, we describe the results of the simple experiment to use a vector database to store different text embeddings and perform similarity search using index on different similarity metrics.

This post is a continuation of Part 1, in which we described a similar experiment with similarity search without a vector database.

The experiment

The generated embeddings were stored and indexed in the Pinecone Vector Database. The embeddings were generated using the following text embedding models:

  1. all-MiniLM-L6-v2 (“MiniLM”)
  2. text-embedding-ada-002 (“Ada”)
  3. text-similarity-davinci-001 (Davinci”)

Similarity search results for different similarity metrics (cosine, dot product, and Euclidean Distance) were compared across the embeddings.

You can view the code and details in this Jupyter notebook.

Queries and dataset

We used two queries over a dataset. Each query sentence had 6 close semantic equivalents in the dataset. Five semantic equivalent of a seed result sentence were generated using ChatGPT by giving the following prompt:

Generate variations without changing the meaning of the sentence: “<original seed sentence>”. Provide the following variations: 1) Paraphrase, 2) Elaboration, 3) Simplification, 4) Synonym, 5) Summary in 20 words or less.”

The six target result sentences for each query were included in the dataset. The dataset also had many other sentences that are proximate in phrases and topics, but were semantically sufficiently different.

Between the two query sentences, the second query depended on greater semantic matching abilities — versus keyword or topic matching — to pick similar sentences. As discussed below, one of the models performed poorly on the second query.

Results summary

  1. The best fit sentence is different for different models. Also, the top 6 results and their order for different embeddings were different.
  2. All similarity metrics performed identically for each embeddings model. This can be attributed to the small number of searches and very small dataset.
  3. Ada embeddings performed better than MiniLM. This may be attributable to its higher dimension (1536 vs 384), and also ChatGPT, which also likely uses Ada or its close variant, was used to generate target result sentences.
  4. Davinci embeddings search did not return any of the target matches for the second query sentence in spite of the high dimensionality of its embeddings (12288). A likely reason is the multi-modal model may not be trained on an extensive text corpus as Ada.

Interesting specifics

Query 1: “The deer froze in the headlights of the car.”

Top results: They are very similar.

  • MiniLM: “The car skidded and stopped for the frozen deer in its headlights.“
  • Ada: “The car skidded to stop for the deer that stood frozen in the headlights of the car.”
  • Davinci: “The car skidded and stopped for the frozen deer in its headlights.”

Accuracy: The search on Ada embeddings returned perfect results.

  • MiniLM: Top 6 results had 5 expected sentences: 5/6 = 83%.
  • Ada: Top 6 results had all 6 expected sentences: 6/6 = 100%.
  • Davinci: Top 6 results had 5 expected sentences: 5/6 = 83%.

Query 2: “Dream to solve the world’s problems.”

Top results: MiniLM and Ada results were similar and expected.

  • MiniLM: “The 16-year-old genius, having earned a PhD, contemplated global challenges and recognized science as the solution.“
  • Ada: “Having finished his PhD at 16, the boy genius contemplated the challenges the world faced, and intuited that science must be the solution.”
  • Davinci: “Finding sustainable energy solutions is crucial for a greener future.”

Accuracy: Davinci performed poorly as it did not return any of the expected results.

  • MiniLM: Top 6 results had 5 expected sentences: 5/6 = 83%.
  • Ada: Top 6 results had all 6 expected sentences: 5/6 = 83%.
  • Davinci: Top 6 results had 0 expected sentences: 0/6 = 0%.

You can find the details in this Jupyter notebook.

--

--

Neel Phadnis
Notes and Points

Technologist, engineering leader, and outdoor enthusiast.