AI engineering
Chat with multiple websites
Ask questions on various blogs/articles in one place
Say you have a list of articles on writing by Paul Graham that you want to chat with using LLMs.
You want to extract quotes, summarize, take notes, ask questions, etc.
How can you do that?
Maybe there are tools already (please comment below if there are!), but I decided to code something for this.
After spending a few hours, I finished coding it (most of the time was spent on the prompt); it’s very barebones and fits all the text into one API call, but the point of this article is more about the idea.
Let’s dive right in!
The code is on Deepnote.
Extracting text
First, we need a tool to extract text from an article.
I came across the Trafilatura Python Package from a Google search
From the website:
Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to commonly used formats
Here’s how I used it to fetch text from my blog.
The fetch_url
function is used to download the web page content specified by the URL.
And extract
to get the text only
note: The extraction targets the main text part of a page. To extract all text content in a html2txt manner, use
html2text()
But I also needed metadata like title, author, and URL, so I used a different function bare_extraction
to achieve this.
Processing 21 articles
Now, all you do is loop over each link using that function.
I also have it show the tokens to verify that it has extracted text and to ensure it doesn’t exceed the context length of 128k for GPT-4o.
Here’s what the output looks like.
We have 78k tokens in total!
Now that we have the content, we structure it in a way that’s clear for GPT4
Prompt
I spent a solid hour crafting the best prompt to achieve what I wanted, which I ended up with.
You have been provided with a set of articles. Please read the articles and answer the following question based on the content of the articles.
**Question/Instruction:** {question}
**Articles:**
{all_article_texts}
**Instructions:**
1. Answer this question **ONLY** based on the articles provided. Do not include any other content aside from the provided material
2. If the question does not relate to the theme of the articles, respond saying ‘I cannot answer this question based on the provided articles.’
3. When writing the essay, reference specific points from the articles to back up your claims.
4. Make it engaging and informative, ensuring that the essay is well-structured and coherent.
5. Make sure to cite your sources with [1], [2], in the essay, and list them at the end using the provided titles and URLs. in the format 1. [Title](URL)
6. Respond in markdown format.
OpenAI call
We chuck this prompt in the API call.
And use the markdown function to display the output.
Examples
What are the articles about?
50 reasons why someone should write
What should I write?
Ten good quotes on writing
Unconventional writing ideas
How to find my voice
How writing can stop climate change?
And to check that it’s not just answering any questions,
That’s it!
Next steps
The results are satisfactory. I can reference ideas from different articles, and I like how short and actionable the writing is.
My overall vision for an app is this
There’s exa.ai which helps you find good content, and Perplexity that searches content based on your query, and provides you with answers. And there are AI writing tools for research that helps you search papers, chat with papers, and write and cite papers for you, and you can add your own knowledge too.
What I want is something that exist in that mix of tools. Let’s say I’m diving into the rabbit hole of climate change, specifically in how AI can be used to transform the energy network, and I found 10 great articles talking about that.
It’s either I save these articles in another platform, and exa.ai can source even more related articles, I select the ones I want, and I can start a chat. And ideally this chat will continue to help me discover more related sources, and help me build a map of knowledge around that topic. And it should be something I can come back to and revisit.
A few improvements that came to mind:
- figure out how to extract from all platforms, i.e., you cannot extract hacker news comments
- chunk and use RAG for possibly better information retrieval (LLMs with larger context windows, i.e., Gemini 1.5 Pro with 1 million contexts, might make RAG obsolete, but who knows)
- make it chat-based so you can continue a conversation instead of starting a new API call for each question.
Reference this tutorial below for RAG and chat.
Let me know if you’re working on anything similar! Or have any suggestions and ideas!
Thanks for reading!
Be sure to follow the bitgrit Data Science Publication to keep updated!
Want to discuss the latest developments in Data Science and AI with other data scientists? Join our discord server!
Follow Bitgrit below to stay updated on workshops and upcoming competitions!
Discord | Website | Twitter | LinkedIn | Instagram | Facebook | YouTube