Querying a Wikipedia page with Langchain and Streamlit

Shashank Vats
4 min readMay 1, 2023

--

Photo by Luke Chesser on Unsplash

Introduction

Wikipedia is one of the most extensive knowledge bases available online, with millions of articles on almost every topic imaginable. However, finding the right information on Wikipedia can be a daunting task, especially when searching for obscure or specific topics. Traditional keyword-based search engines often yield unsatisfactory results, leaving researchers and students frustrated and overwhelmed. But what if there was a way to streamline the process and make Wikipedia research more efficient? Enter LLMs — the latest breakthrough in natural language processing technology that is revolutionizing the way we search for information.

LLMs, or Language Model AI, are deep learning algorithms that are trained on vast amounts of natural language data to understand and generate human-like text. They use a process called unsupervised learning, which enables them to learn and identify patterns in a language without being explicitly programmed.

Due to their huge impact, new developer tools are emerging everywhere under the term LLMOps. Langchain is one such example of a new tool.

What is Langchain?

LangChain is a powerful open-source framework for developing applications powered by language models. It connects to the AI models you want to use, such as OpenAI or Hugging Face, and links them with outside sources, such as Google Drive, Notion, Wikipedia, or even your Apify Actors. That means you can chain commands together so the AI model can know what it needs to do to produce the answers or perform the tasks you require.

Code Walkthrough

The code is written in Python and uses the Langchain framework to implement LLM and Streamlit framework for building interactive web applications.

Importing Libraries

We’ll begin with importing the required libraries for our chatbot.

import requests
import wikipedia
from bs4 import BeautifulSoup

import os
import time
import pickle
import streamlit as st
from datetime import datetime
from streamlit_chat import message

from langchain.vectorstores import FAISS
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.schema import AIMessage, HumanMessage, SystemMessage

from wiki_content import get_wikiPy

Scraping Wikipedia

The first step in building our chatbot is to access Wikipedia articles and extract content. The get_wiki function takes a search term and returns the full-page content and a summary of the Wikipedia article. The wikipedia.summary method retrieves the summary, while the requests module is used to access the article's URL. The BeautifulSoup module is used to parse the HTML content of the page, and the content_div.find_all('p') line extracts the text from the paragraphs on the page.

def get_wiki(search):
# set language to English (default is auto-detect)
lang = "en"

"""
fetching summary from wikipedia
"""
# set language to English (default is auto-detect)
summary = wikipedia.summary(search, sentences = 5)

"""
scrape wikipedia page of the requested query
"""

# create URL based on user input and language
url = f"https://{lang}.wikipedia.org/wiki/{search}"

# send GET request to URL and parse HTML content
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# extract main content of page
content_div = soup.find(id="mw-content-text")

# extract all paragraphs of content
paras = content_div.find_all('p')

# concatenate paragraphs into full page content
full_page_content = ""
for para in paras:
full_page_content += para.text

# print the full page content
return full_page_content, summary

Setting Up User Interface
Next, we set up the user interface using Streamlit. We start by creating a title:

st.markdown("<h1 style='text-align: center; color: Red;'>Chat-Wiki</h1>", unsafe_allow_html=True)

This creates a large, centered title for the chatbot.

Next, we create a text input box for the user to enter their OpenAI API key. This will allow users to enter their Open-AI key in order to use the bot:

buff, col, buff2 = st.columns([1,3,1])
openai_key = col.text_input('OpenAI Key:')
os.environ["OPENAI_API_KEY"] = openai_key

Once the user enters their OpenAI key we initialize the GPT model and ask them to enter their search query which will be used to scrape the concerned Wikipedia page. The get_wiki() function will return a summary of the searched query and scraped page. Now, if it has returned some value, the Q&A field gets activated and user can ask questions.

if len(openai_key):
chat = ChatOpenAI(temperature=0, openai_api_key=openai_key)
search = st.text_input("What's on your mind?")
if len(search):
wiki_content, summary = get_wiki(search)
if len(wiki_content):
st.write(summary)
user_query = st.text_input("You: ","", key= "input")
send_button = st.button("Send")

Now, we build index using FAISS

def build_index(wiki_content):
text_splitter = CharacterTextSplitter(
separator = "\n",
chunk_size = 1000,
chunk_overlap = 200,
length_function = len,
)
texts = text_splitter.split_text(wiki_content)
embeddings = OpenAIEmbeddings()
docsearch = FAISS.from_texts(texts, embeddings)
with open("./embeddings.pkl", 'wb') as f:
pickle.dump(docsearch, f)

return embeddings, docsearch

embeddings, docsearch = build_index(wiki_content)

After the index is created, we can query user’s request.

def get_bot_response(user_query, faiss_index):
docs = faiss_index.similarity_search(user_query, K = 6)
main_content = user_query + "\n\n"
for doc in docs:
main_content += doc.page_content + "\n\n"
messages.append(HumanMessage(content=main_content))
ai_response = chat(messages).content
messages.pop()
messages.append(HumanMessage(content=user_query))
messages.append(AIMessage(content=ai_response))

return ai_response

And there you have it! Your very own friendly bot that answers your query on wikipedia articles.

You can check my github repository for the full implementation.

--

--