An Experiment with Similarity Search and a Vector Database — Part 2

Published in

Notes and Points

3 min readJun 20, 2023

In this post, we describe the results of the simple experiment to use a vector database to store different text embeddings and perform similarity search using index on different similarity metrics.

This post is a continuation of Part 1, in which we described a similar experiment with similarity search without a vector database.

The experiment

The generated embeddings were stored and indexed in the Pinecone Vector Database. The embeddings were generated using the following text embedding models:

all-MiniLM-L6-v2 (“MiniLM”)
text-embedding-ada-002 (“Ada”)
text-similarity-davinci-001 (Davinci”)

Similarity search results for different similarity metrics (cosine, dot product, and Euclidean Distance) were compared across the embeddings.

You can view the code and details in this Jupyter notebook.

Queries and dataset

We used two queries over a dataset. Each query sentence had 6 close semantic equivalents in the dataset. Five semantic equivalent of a seed result sentence were generated using ChatGPT by giving the following prompt:

Generate variations without changing the meaning of the sentence: “<original seed sentence>”. Provide the following variations: 1) Paraphrase, 2) Elaboration, 3) Simplification, 4) Synonym, 5) Summary in 20 words or less.”

The six target result sentences for each query were included in the dataset. The dataset also had many other sentences that are proximate in phrases and topics, but were semantically sufficiently different.

Between the two query sentences, the second query depended on greater semantic matching abilities — versus keyword or topic matching — to pick similar sentences. As discussed below, one of the models performed poorly on the second query.

Results summary

The best fit sentence is different for different models. Also, the top 6 results and their order for different embeddings were different.
All similarity metrics performed identically for each embeddings model. This can be attributed to the small number of searches and very small dataset.
Ada embeddings performed better than MiniLM. This may be attributable to its higher dimension (1536 vs 384), and also ChatGPT, which also likely uses Ada or its close variant, was used to generate target result sentences.
Davinci embeddings search did not return any of the target matches for the second query sentence in spite of the high dimensionality of its embeddings (12288). A likely reason is the multi-modal model may not be trained on an extensive text corpus as Ada.

Interesting specifics

Query 1: “The deer froze in the headlights of the car.”

An Experiment with Similarity Search and a Vector Database — Part 2

The experiment

Queries and dataset

Results summary

Interesting specifics

Written by Neel Phadnis