[Course Notes] LangChain for LLM Application Development: Part 6

2 min readMar 8, 2024

LangChain for LLM Application Development: Evaluation

LangChain for LLM Application Development by DeepLearning.AI

Table of Contents

Evaluation

When building a complex application using an LLM, one of the important but sometimes tricky steps is how do you evaluate how well your applcation is doing? Is it meeting accuracy criteria? And also, if you decide to change your implementation, maybe swap in a different LLM, or change the strategy of how you use a vector database or something else to retrieve chunks, or change some other parameters of your system, how do you know if you’re making it better or worse?

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch

index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([YOUR LOADER])

llm = ChatOpenAI(temperature = 0.0)
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

Hard-coded examples

examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

LLM-Generated examples

from langchain.evaluation.qa import QAGenerateChain
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI())
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]]
)
new_examples[0]

'''
output:
{'query': "What is the approximate weight of the Women's Campside Oxfords per pair?",
 'answer': "The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz."}
'''

data[0]
'''output: Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.", metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0})'''

# Combine Examples
examples += new_examples
qa.run(examples[0]["query"])

Manual Evaluation

import langchain
langchain.debug = True
qa.run(examples[0]["query"])

LLM assisted evaluation

# Turn off the debug mode
langchain.debug = False
predictions = qa.apply(examples)

from langchain.evaluation.qa import QAEvalChain
llm = ChatOpenAI(temperature=0)
eval_chain = QAEvalChain.from_llm(llm)
graded_outputs = eval_chain.evaluate(examples, predictions)
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['text'])
    print()

That’s it for today! Stay tuned for the next part😊

[Course Notes] LangChain for LLM Application Development: Part 6

Evaluation

Written by Chanan