Building a knowledge base — Part 1: extracting data by web scraping

Scraping Langchain’s documentation to use it as part of a Large Language Model knowledge base

Published in

CodeContent

6 min readApr 29, 2024

Building a knowledge base with LLMs is one of the hottest topics right now. It has a wide variety of useful applications. ChatGPT is only trained on data up to March 2021 and is not trained on private data (so it can’t answer any questions or summarise your private data). To many companies, the concept of having a ChatGPT equivalent on their private data is quite appealing since ChatGPT can perform a wide variety of NLP tasks such as question answering, code completion, text summarisation, text generation, and much more. Over the next few weeks, I will be documenting my journey of attempting to build a knowledge base using Langchain.

Ironically, the knowledge base that I will be experimenting with is based on Langchain’s documentation. I think having a question answering Large Language Model over Langchain’s documentation can be quite interesting to even help developers build more and more LLM-based applications. The first part of this series is about scraping the data from Langchain’s docs to train the model on.

To acquire this data, we will basically be going through several steps:

Building a knowledge base — Part 1: extracting data by web scraping

Scraping Langchain’s documentation to use it as part of a Large Language Model knowledge base

Written by Mostafa Ibrahim