Summarize Webpages in Ten Lines of Code with Unstructured + LangChain”

Benjamin Torres
3 min readJul 24, 2023

--

Have you ever had to read a multitude of documents to join a meeting or just to have to update yourself on a topic? Being able to get summaries quickly is one of the tasks that you can do with very little effort thanks to our library.

In this post, we will show you how easy it is to take the content of different web pages to obtain a summary of each of the sources using unstructured, langchain and OpenAI.

All the code below can be found in the following Colab notebook.

Getting info ready

First of all you’ll need a way to extract or download the content of a web page, and for this purpose will use UnstructuredURLLoader function from langchain. This function returns a loader, and after call .load() you get elements that you can filter to get only the useful information and get rid of the JS code and irrelevant content from the HTML. So, we define a function generate_document:

from langchain.document_loaders import UnstructuredURLLoader
from langchain.docstore.document import Document
from unstructured.cleaners.core import remove_punctuation,clean,clean_extra_whitespace
from langchain import OpenAI
from langchain.chains.summarize import load_summarize_chain

def generate_document(url):
"Given an URL, return a langchain Document to futher processing"
loader = UnstructuredURLLoader(urls=[url],
mode="elements",
post_processors=[clean,remove_punctuation,clean_extra_whitespace])
elements = loader.load()
selected_elements = [e for e in elements if e.metadata['category']=="NarrativeText"]
full_clean = " ".join([e.page_content for e in selected_elements])
return Document(page_content=full_clean, metadata={"source":url})

We just keep the “NarrativeText” elements, and make use of the cleaning bricks for deleting strange characters and stuff that isn’t useful. The last part of the function creates a Document object from langchain to store all the content obtained.

Creating the summarization pipeline

Next step is to create a pipeline for ingesting the documents, splitting them into pices to feed a language model, call the OpenAI API, get the result and store it. Sounds like a lot of work? Absolutely no, this is just a little function thanks to langchain:

def summarize_document(url,model_name):
"Given an URL return the summary from OpenAI model"
llm = OpenAI(model_name='ada',temperature=0,openai_api_key=openai_key)
chain = load_summarize_chain(llm, chain_type="stuff")
tmp_doc = generate_document(url)
summary = chain.run([tmp_doc])
return clean_extra_whitespace(summary)

Essentially we create a llm object for calling the API (with ‘Ada’ model) and consume the documents we generated obtaining the result from the model.

And…that’s all! In return for that function we’re going to get the summary for every URL we pass it.

One of advantages of our tool is that you can send to OpenAI a little amount of tokens, saving on the billing vs if you send them the entire HTML content of the page (But, sometimes it’s impossible to do that, since current models have limits on the amount of tokens you send them). Another useful information is that we have a lot of partition bricks

to get your data ready for this kind of tasks, so you can leverage the full potential of this LLMs with little effort. PDFs, DOCXs, emails…you name it!

Tips

  • Probably you want to use a mechanism to store previously summarized information in order to not reprocess URLS. For this purpose in the Colab notebook we use cachier, but will work better locally because you can store it in your computer.
  • Other providers for LLMs are available, check on langchain docs for more information.

--

--