Streamlining Data Integration with LLMs and Prompt Engineering

Harsh Adwani
4 min readSep 24, 2023

--

In short — We are going to build a Scraped Data Consolidation System (SDCS) i.e. Raw web-scraped data from multiple sources is fed to an LLM (Large Language Model) and an intelligent+structured output is extracted from raw data by using Prompt Engineering.

Photo by Markus Spiske on Unsplash

Here’ what we’ll cover in this article —

  • Need of such a system
  • Understanding prompt engineering
  • Practical example
  • Cost feasibility
  • High Level System Design
  • Other use cases of a Large Language Model

Why do we need such a system?

Extracting website contents directly to scrape useful information would result in unstructured data with a lot of unnecessary content.

It’s a tedious process which requires constant maintenance. This is where SDCS comes into picture. This system would provide us advanced NLP capability to make sense out of the context and extract clean useful data in a structured manner.

All of this is not possible if HTML DOM and CSS rule-based scraping is used.

What is Prompt Engineering?

Prompt engineering refers to the process of designing and formulating effective prompts for generating desired outputs from language models like GPT (Generative Pre-trained Transformer).

It involves crafting input instructions or queries that guide the model towards generating specific and high-quality responses. In the context of GPT models, prompt engineering is particularly important because these models are generally fine-tuned on a specific task or dataset.

Prompt engineering is an iterative process that involves experimentation, testing, and refining prompts to achieve the desired outputs.

Now that we’ve grasped the technical terms, let’s explore a real-world example. We’ll use the scraped data from the initial websites when searching for ‘best practices for writing ChatGPT prompts’.

Practical Example

To understand better, let’s consider a use case. If we Google — “best practices for writing ChatGPT prompts”, we’ll get many similar articles. Say, we gather 5 such articles, and write their URLs, titles and content in a sensible JSON-format, and call it input-data. Content is a string with an article’s entire text. Each article text contains multiple best practices, and some of these best practices may overlap with best practices mentioned in another articles.

It would be helpful for us to have only one-list that combines best practices from all articles without any duplicates. Additionally, it would be even more useful if the ordering of the best practices in this one-list is based on the number of articles discussing them.

Input

Now, notice the 2 following prompts, which will be passed to a LLM along with scraped data (using chatGPT APIs in this case). The first prompt focuses on extracting meaningful data from raw DOM content of a webpage, while the second prompt intelligently consolidates the data extracted from the first prompt.

Prompts

Following image shows the expected output from an LLM after such a system is set up.

Output

The data is well structured and free of duplicates. Additional cleaning like grammar and spellings can also easily be made a part of this process by simply indicating the same in the prompts.

Is using ChatGPT APIs for this purpose economically feasible?

ChatGPT provides access to multiple NLP models, each with different capabilities and price points. Prices are per 1,000 tokens. You can think of tokens as pieces of words, where 1,000 tokens is about 750 words.

Check current pricing here — https://openai.com/pricing

Using the current pricing for ChatGPT 3.5-turbo and assuming 750 word input prompt and 750 word output from model –

Cost calculation

If we consider the text (including markup tags) of a typical webpage to be between 750 and 1000 words, the cost of scraping accurate and cleaned data from a webpage would cost inwards of 0.5 INR

Though this appears cheap, but the amount can be large when dealing with thousands of pages.

Also, self deployment and training of our own GPT-3 (175B parameters) model is not economically feasible.

So, what’s the solution?

Solution is to self host GPT2-large (774M parameters) or GPT2-xlarge (1.5B parameters) large language models which are open source and easy to fine tune if needed. These models can also easily be tested beforehand on a platform like Google Colab, even on free teir.

These are capable enough of extracting and consolidating data for our use case and accepts same prompt engineering as GPT-3.

Here’s how a complete in-house solution could look like —

Source - self

Conclusion

This article describes a way to utilize AI and NLP advancements to improve our traditional processes. There are many other ways in which an in house LLM can help an organisation, for example — Customer Support and Chatbots, Content Generation, Data Analysis and Insights, Personalisation and Recommendations, Information Retrieval, Language Translation, Research and Development, etc. All this can be accessed from the model in an automated manner using prompt engineering.

--

--

Harsh Adwani
Harsh Adwani

Written by Harsh Adwani

Full stack developer, researcher, AI enthusiast. Software Engineer adept in offering cutting edge engineering solutions.

Responses (6)