Cosin similarity is boon or bene in Retrieval and LLM prompt caching?

Published in

CodeX

2 min readMar 31, 2024

I’ve started to face challenges with Cosine Similarity as I delve into Query Caching. My attempts to lower the expense of accessing the LLM API by utilizing prompt caching were hampered by this issue. Generally, systems like RAG or any vector-based system utilize caching to bypass typical retrieval processes or LLM pass by searching for a similar query that has been submitted previously. If the cosine similarity between the new query and the cached query (with the previous response) exceeds 0.90, then we retrieve the LLM response of the matching query and present it as the response. Here, caching aids in avoiding another LLM API call. However, the problem arises when even a slight change in a word (like Opp. words) within a lengthy sentence can alter the entire meaning, and cosine similarity fails to address this issue.

For example:

Context 1: “I’ve started to face the BENEFIT of Cosine Similarity as I delve into Query Caching. My efforts to reduce the expense of accessing the LLM API using prompt caching were hindered by this issue. In general, systems like RAG or any vector-based system use caching to bypass the usual retrieval process or LLM pass by searching for a similar query that has been submitted before. If the cosine similarity between the new query and the cached query (with the previous response) is above 0.90, then we retrieve the LLM response of the matching query and present it as the response. Here, caching aids in avoiding another LLM API call.”
Context 2: “I’ve started to face the DRAWBACK of Cosine Similarity as I delve into Query Caching. My efforts to reduce the expense of accessing the LLM API using prompt caching were hindered by this issue. In general, systems like RAG or any vector-based system use caching to bypass the usual retrieval process or LLM pass by searching for a similar query that has been submitted before. If the cosine similarity between the new query and the cached query (with the previous response) is above 0.90, then we retrieve the LLM response of the matching query and present it as the response. Here, caching aids in avoiding another LLM API call.”

The cosine similarity of Context 1 and Context 2 is 0.985. So, if we return the same response for both cases, we are losing the purpose of using LLM. Afaik see, using only “cosine similarity” poses a problem in query/prompt caching.

Reference :

[1] https://arxiv.org/pdf/2403.05440.pdf

Cosin similarity is boon or bene in Retrieval and LLM prompt caching?

Written by Karthik Ravichandran