Creating a Web Research Chatbot Using LangChain and OpenAI

Shubhamsantoki
Simform Engineering
4 min readOct 27, 2023

Learn how to create a chatbot to streamline your research process

What is LangChain?

LangChain is a framework for building applications with large language models (LLMs). Think of it as a developer's toolbox for working with LLMs. It can be used for various purposes, such as analyzing and summarizing documents, creating chatbots, and Analyzing structured data.

LangChain Use Cases

LangChain offers a wide variety of use cases, such as:

- Q&A over documents
- Analyzing structured data
- Interacting with APIs
- Code understanding
- Agent simulations
- Autonomous agents
- Chatbots
- Code generation
- Data extraction
- Graph data analysis
- Multi-modal outputs
- Self-checking
- Document summarization
- Content tagging

How Does the LangChain Web Researcher Chatbot Work?

The retrieval process of a LangChain web researcher is a multi-step approach to gathering and presenting relevant information from Google search results. Here’s how it operates:

Step 1: Generate Comprehensive Queries

At the core of the LangChain Web Researcher is a single call to a large language model (LLM). This call generates multiple search queries, ensuring a wide range of query terms that are key to obtaining comprehensive results.

Step 2: Execute Queries Parallelly

Once the queries are generated, they are executed individually but in parallel. This simultaneous execution of multiple search queries accelerates the data collection process, saving valuable time.

Step 3: Identify Top Links

As the queries return results, the system’s algorithms identify the top K links for each query. This prioritization ensures that the most promising and relevant sources of information are given precedence.

Step 4: Extract Content Swiftly

In parallel with the query execution, the LangChain Web Researcher selects webpages based on the identified links and rapidly scrapes their content.

Step 5: Index for Retrieval

The accumulated data from the web pages is then meticulously indexed into a dedicated vector store. This indexing process is crucial for efficient retrieval and comparison of information in subsequent steps.

Step 6: Match Queries to Relevant Documents

Lastly, the LangChain Web Researcher matches the original generated search queries with the most relevant documents stored in the vector store. This ensures that users are presented with accurate and contextually appropriate results.

Simple Usage Example

Vectorstore

A common method for handling unstructured data involves embedding it into vectors and storing these embeddings. When querying, the system embeds the query and retrieves the most similar vectors from the stored data. This approach simplifies data management, and a vector store handles the storage and retrieval of these embedded data.

We use chromadb, an in-memory vector database.

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings


vectorstore = Chroma(embedding_function=OpenAIEmbeddings(), persist_directory=”./chroma_db_oai”)

LLM

To initiate the language model, we use OpenAI’s GPT-3.5 Turbo, designed for natural language processing. We set the model name to “gpt-3.5-turbo-16k” with a 16,000 token limit. The temperature parameter is set to 0 for deterministic responses, with streaming enabled for real-time processing.

Ensure to replace “your_api_key” with your OpenAI API key.

from langchain.chat_models.openai import ChatOpenAI


llm = ChatOpenAI(model_name=”gpt-3.5-turbo-16k”, temperature=0, streaming=True,openai_api_key=”your_api_key”)

Memory

In many LLM applications, a conversational interface is essential. One crucial aspect of effective conversation is the ability to reference information introduced earlier in the discussion. At its core, a conversational system should be able to access a certain window of past messages directly.

LangChain provides a lot of utilities to handle the memory for LLMs based on different use cases.

from langchain.memory import ConversationSummaryBufferMemory

#memory for retriever
memory = ConversationSummaryBufferMemory(llm=llm, input_key='question', output_key='answer', return_messages=True)

Search

We use Google API for programmatically retrieving information

Set up the proper API keys and environment variables by creating the GOOGLE_API_KEY in the Google Cloud credential console and a GOOGLE_CSE_ID using the Programmable Search Engine.

Next, follow the instructions described here.

import os
from langchain.utilities import GoogleSearchAPIWrapper

os.environ["GOOGLE_CSE_ID"] = “your_cse_id”
os.environ["GOOGLE_API_KEY"] = "your_google_api_key"
search = GoogleSearchAPIWrapper()

Initialize

We integrate WebResearchRetriever from LangChain as a knowledge source for LLM.

from langchain.retrievers.web_research import WebResearchRetriever


web_research_retriever = WebResearchRetriever.from_llm(
vectorstore=vectorstore,
llm=llm,
search=search,
)

Run with Citations

RetrievalQAWithSourcesChain retrieves documents and provides citations.

from langchain.chains import RetrievalQAWithSourcesChain


user_input = “How do LLM Powered Autonomous Agents work?”
qa_chain = RetrievalQAWithSourcesChain.from_chain_type(llm,retriever=web_research_retriever)
result = qa_chain({“question”: user_input})

# we get the results for user query with both answer and source url that were used to generate answer
print(result["answer"])
print(result["sources"])

Complete Implementation Using Streamlit

A complete web application implementing the web research chatbot can be found here.

Web Researcher Chatbot Use Cases

  1. Accurate Information Retrieval: Ensure precise and up-to-date web information to reduce errors.
  2. Citation-Backed Answers: Provide answers with source citations for transparency and credibility.
  3. Stock Analysis Tool: Analyze financial data and market trends to aid informed investment decisions.
  4. Automated Content Aggregation: Save time by collecting and summarizing relevant web content.
  5. Legal and Compliance Research: Assist in legal research and compliance with changing laws.
  6. Healthcare Insights: Keep healthcare professionals updated with the latest medical information.
  7. Educational Assistant: Support students and educators with research and study materials.

These use cases showcase the versatility and time-saving benefits of a GPT-powered web researcher across various fields.

For more such insights and updates, stay tuned to the Simform Engineering blog.

Follow us: Twitter | LinkedIn

--

--