Save 70% OpenAI costing by caching Results

Anshuman Tanwar
3 min readApr 23, 2023

--

Asking semantically the same question over your data will generate a similar answer but you need to pay the full bill. This not only increases your costing but also causes high latency in system (and today’s customer has least patience).

Platforms used

  1. OpenAI
  2. Langchain
  3. Pinecone

Langchain ?

Langchain is a wrapper around multiple platforms (of LLM ecosystem). So for example if you want to switch your vector DB from A to B then all you can easily do it in change in single line (rest of the heavy lifting would be taken care by langchain like data upload, search etc).

Example

For Pinecone

from langchain.vectorstores import Pinecone

db = Pinecone.from_existing_index(index_name =”reviews”,embedding=embed, namespace=NAMESPACE)

For Deeplake

from langchain.vectorstores import DeepLake

db = DeepLake.from_documents(texts, embeddings, dataset_path=”hub://yourhubname/indexname”)

Problem statement

OpenAI is cool but as imagine you have thousands of daily users then it can be costly.

Solution

Save your FAQs separately. If we save faqs (better to use cloud vector DB so that FAQs can use used by other users too) then we do not need to call OpenAI to generate answers.

LANGCHAIN

How?

What is a standalone question: We all know any conversation has very low meaning if we do not know the context. AI/ML platforms uses the concept of token which has a certain limit. In order to use tokens effectively, we can not send all information in a single go. With time not conversation length increases but its context also changes. For example, your first question could be on computer RAM but after 10 questions you might be talking about computer processor. Langchain single-handedly takes care of this by generating a standalone question that combines your current question with recent chats (uses summarization technique). You can go through its basic: https://towardsdatascience.com/4-ways-of-question-answering-in-langchain-188c6707cc5a

Context is important !!!

FAQ Caching. If asking the same question will not change the answer then why search?

Save your questions along with their answers somewhere. But different questions can have the same meaning (like hello, how are you, how you doing). In such cases, use vector DB (like Pinecone, Deeplake).

Work smart, not hard

Results

Cost reduced from $ 0.004 to $0.00033.

Response time reduced from 24 seconds to 5.5 seconds.

Answer remained same.

As you can can see first time when we did not find any suitable match in FAQ cache, we generated answer and uploaded the same on FAQ cache. Next time, there was no need to upload data on FAQ cache as we got match.

NOTE: At present langchain supports various types of cache (https://python.langchain.com/en/latest/modules/models/llms/examples/llm_caching.html) but there is no cache option available for ConversationalRetrievalChain (which is used for OpenAI chat apis). For this we did changes in the original code. You can download updated repo from https://github.com/anshumantanwar/langchain.

If GPTCache has similar functionality then why did not we use it here ? — We can discuss this in next article.

--

--