Cosin similarity is boon or bene in Retrieval and LLM prompt caching?

Karthik Ravichandran
CodeX
Published in
2 min readMar 31, 2024

I’ve started to face challenges with Cosine Similarity as I delve into Query Caching. My attempts to lower the expense of accessing the LLM API by utilizing prompt caching were hampered by this issue. Generally, systems like RAG or any vector-based system utilize caching to bypass typical retrieval processes or LLM pass by searching for a similar query that has been submitted previously. If the cosine similarity between the new query and the cached query (with the previous response) exceeds 0.90, then we retrieve the LLM response of the matching query and present it as the response. Here, caching aids in avoiding another LLM API call. However, the problem arises when even a slight change in a word (like Opp. words) within a lengthy sentence can alter the entire meaning, and cosine similarity fails to address this issue.

For example:

  • Context 1: “I’ve started to face the BENEFIT of Cosine Similarity as I delve into Query Caching. My efforts to reduce the expense of accessing the LLM API using prompt caching were hindered by this issue. In general, systems like RAG or any vector-based system use caching to bypass the usual retrieval process or LLM pass by searching for a similar query that has been submitted before. If the cosine similarity between the new query and the cached query (with the previous response) is above 0.90, then we retrieve the LLM response of the matching query and present it as the response. Here, caching aids in avoiding another LLM API call.”
  • Context 2: “I’ve started to face the DRAWBACK of Cosine Similarity as I delve into Query Caching. My efforts to reduce the expense of accessing the LLM API using prompt caching were hindered by this issue. In general, systems like RAG or any vector-based system use caching to bypass the usual retrieval process or LLM pass by searching for a similar query that has been submitted before. If the cosine similarity between the new query and the cached query (with the previous response) is above 0.90, then we retrieve the LLM response of the matching query and present it as the response. Here, caching aids in avoiding another LLM API call.”

The cosine similarity of Context 1 and Context 2 is 0.985. So, if we return the same response for both cases, we are losing the purpose of using LLM. Afaik see, using only “cosine similarity” poses a problem in query/prompt caching.

--

--

Karthik Ravichandran
CodeX
Writer for

Burgeoning data science researcher working in a Healthcare industry