LLMs

Chat With a Website

Using Langchain + OpenAI + Chroma to chat with a blog

Benedict Neo
bitgrit Data Science Publication

--

https://bneo.xyz

I’ve been writing everyday on my blog for a year now.

And I’ve stolen and captured a lot of good ideas.

Since we have these magical models now that can ingest a bunch of text, understand it, and generate it back to us (depending on which response has the highest softmax probability),

I decided to use LLMs to read my blog posts so that I can ask it questions.

This article is a short tutorial about this use case, and you can apply it to any other blogs or even websites.

Without further ado,

you can find the code on Deepnote and Github

#1 Load

First we have to scrape the text.

We use requests to get the text, and Beautiful Soup to gather the content.

For my blog, my posts the <a> tags so I use find_all() to filter them out.

Your website might have a different HTML structure, so you might need to spend some time for BeautifulSoup to get the right thing. Pro tip: use ChatGPT to help you here.

Then, a list comprehension to gather all the links that direct to my posts.

Let’s give them to LangChain.

The Langchain community has a WebBaseLoader that reads web URLs using urllib and BeautifulSoup to parse it to text.

We customize the parsing using the bs_kwargsparameter from BeautifulSoup, passing it a SoupStrainer that filters for <main> tags which contains the title and content of my blogs.

Here we’re creating a Document , an object with some page_content (str) and metadata (dict).

Let’s look at the document content using page_content

All my blogs combined has over 900k characters in total.

Most LLM models do not have a context size that (yet), so we’ll have to split it into chunks (think of it as smaller, manageable pieces, units of information).

#2 Split

Here we use RecursiveCharacterTextSplitter, which recursively splits a document using a common separator like new lines (\n) until each chunk is the appropriate size.

Our chunk size is 1000 which means our documents are of size 1000 characters. Let’s check it ourselves.

We set 200 characters of overlap between chunks, and this helps mitigate the possibility of not losing out important context when separating a statements.

We also set an add_start_index=True to preserve where the document was split.

The splits also have a metadata, which tells us what blog it’s from and the start index of the chunk.

You can also experiment with context based splitting, i.e. for markdown files, creating chunks within specific header groups: MarkdownHeaderTextSplitter.

Now it’s time to store these chunks.

#3 Store

We have 1376 text chunks in total and now we want to search over them.

The common process is as below:

  1. Embed contents of each document split (split embedding)
  2. insert them into a vector database
  3. take a new query and embed it (query embedding)
  4. perform similarity search to identify split embedding that is closest to our query embedding using cosine similarity

We embed and store all of our chunks using Chroma and the OpenAIEmbeddings model.

Feel free to try out other vector stores.

Now it’s time to implement the process we talked about earlier.

#4 Retrieve

Langchain provides us with Retriever, an interface that wraps an index to return relevant Documents given a query

Here we’re using the most common retriever: VectorStoreRetriever.

Using .as_retriever() we easily turn a VectorStore into a Retriever

You can invoke the retriver with a question.

It’ll embed it.

And it retrieve the most related documents based on the query.

Now we don’t only want the documents, we want an actual response.

So let’s implement the prompt and LLM part.

#5 Generate

First we define the prompt. I like Paul Graham’s writing so I asked it to write in his style.

Now we define the a chain using the LCEL Runnable protocol that does the following:

  1. takes a question
  2. retrieves relevant documents
  3. Constructs our prompt
  4. passes that to the model ChatOpenAI
  5. parses the output using StrOutputParser

| is piping in Python, and it basically passes the output of the previous function to the next. So func1 | func2 would mean passing the output of func1 to func2

We can call it using .invoke like before, or we could also stream it.

Now what if I want to know from which blogs did the LLM get its answers from?

Let’s add sources.

# 5.1 Adding sources

We’ll need to import RunnableParallel from LangChain.

What all it does is run multiple tasks in parallel and then make sure that all the outputs aer processed and formatted correctly to be combined into a single output.

This output isn’t very friendly, so I wrote a helper function to format it into markdown.

Now let’s test it out.

I think it looks great!

I wrote another helper function to ask questions easily

Time to test it on a few fun questions.

Question time!

Side note: I’m actually impressed by these answers, and it was fun to test it out!

Communities

Boston

Useful macos apps

Advice for 20s

Getting rich

Killing (not in my blog)

That’s all!

For more examples, please check out the notebook!

If you have any ideas or suggestions about this project, feel free to leave a comment and reach out!

Resources

I referenced these pages from Langchain for this article.

And technically, this isn’t chatting, so if you chatting (with chat history and all that jazz) and even turn it into an app, check out this video!

Thanks for reading!

Be sure to follow the bitgrit Data Science Publication to keep updated!

Want to discuss the latest developments in Data Science and AI with other data scientists? Join our discord server!

Follow Bitgrit below to stay updated on workshops and upcoming competitions!

Discord | Website | Twitter | LinkedIn | Instagram | Facebook | YouTube

--

--