Streamlining Data Integration with LLMs and Prompt Engineering
In short — We are going to build a Scraped Data Consolidation System (SDCS) i.e. Raw web-scraped data from multiple sources is fed to an LLM (Large Language Model) and an intelligent+structured output is extracted from raw data by using Prompt Engineering.
Here’ what we’ll cover in this article —
- Need of such a system
- Understanding prompt engineering
- Practical example
- Cost feasibility
- High Level System Design
- Other use cases of a Large Language Model
Why do we need such a system?
Extracting website contents directly to scrape useful information would result in unstructured data with a lot of unnecessary content.
It’s a tedious process which requires constant maintenance. This is where SDCS comes into picture. This system would provide us advanced NLP capability to make sense out of the context and extract clean useful data in a structured manner.
All of this is not possible if HTML DOM and CSS rule-based scraping is used.
What is Prompt Engineering?
Prompt engineering refers to the process of designing and formulating effective prompts for generating desired outputs from language models like GPT (Generative Pre-trained Transformer).
It involves crafting input instructions or queries that guide the model towards generating specific and high-quality responses. In the context of GPT models, prompt engineering is particularly important because these models are generally fine-tuned on a specific task or dataset.
Prompt engineering is an iterative process that involves experimentation, testing, and refining prompts to achieve the desired outputs.
Now that we’ve grasped the technical terms, let’s explore a real-world example. We’ll use the scraped data from the initial websites when searching for ‘best practices for writing ChatGPT prompts’.
Practical Example
To understand better, let’s consider a use case. If we Google — “best practices for writing ChatGPT prompts”, we’ll get many similar articles. Say, we gather 5 such articles, and write their URLs, titles and content in a sensible JSON-format, and call it input-data. Content is a string with an article’s entire text. Each article text contains multiple best practices, and some of these best practices may overlap with best practices mentioned in another articles.
It would be helpful for us to have only one-list that combines best practices from all articles without any duplicates. Additionally, it would be even more useful if the ordering of the best practices in this one-list is based on the number of articles discussing them.
Now, notice the 2 following prompts, which will be passed to a LLM along with scraped data (using chatGPT APIs in this case). The first prompt focuses on extracting meaningful data from raw DOM content of a webpage, while the second prompt intelligently consolidates the data extracted from the first prompt.
Following image shows the expected output from an LLM after such a system is set up.
The data is well structured and free of duplicates. Additional cleaning like grammar and spellings can also easily be made a part of this process by simply indicating the same in the prompts.
Is using ChatGPT APIs for this purpose economically feasible?
ChatGPT provides access to multiple NLP models, each with different capabilities and price points. Prices are per 1,000 tokens. You can think of tokens as pieces of words, where 1,000 tokens is about 750 words.
Check current pricing here — https://openai.com/pricing
Using the current pricing for ChatGPT 3.5-turbo and assuming 750 word input prompt and 750 word output from model –
If we consider the text (including markup tags) of a typical webpage to be between 750 and 1000 words, the cost of scraping accurate and cleaned data from a webpage would cost inwards of 0.5 INR
Though this appears cheap, but the amount can be large when dealing with thousands of pages.
Also, self deployment and training of our own GPT-3 (175B parameters) model is not economically feasible.
So, what’s the solution?
Solution is to self host GPT2-large (774M parameters) or GPT2-xlarge (1.5B parameters) large language models which are open source and easy to fine tune if needed. These models can also easily be tested beforehand on a platform like Google Colab, even on free teir.
These are capable enough of extracting and consolidating data for our use case and accepts same prompt engineering as GPT-3.
Here’s how a complete in-house solution could look like —
Conclusion
This article describes a way to utilize AI and NLP advancements to improve our traditional processes. There are many other ways in which an in house LLM can help an organisation, for example — Customer Support and Chatbots, Content Generation, Data Analysis and Insights, Personalisation and Recommendations, Information Retrieval, Language Translation, Research and Development, etc. All this can be accessed from the model in an automated manner using prompt engineering.