Extract structured data with OpenAI API: Postman collection

Natalia Demianenko
4 min readNov 22, 2023

--

Introduction

In today’s web-driven world, data plays a crucial role in decision-making and information analysis. In this article, we’ll explore how to use Postman in conjunction with the OpenAI API to extract data from a web page and transform it into a convenient format.

I previously discussed the general approach to scraping data using AI in a prior article.

Process Schema for Scraping in Postman

I’d prepare simple schema of scraping process in Postman, dashed blocks are optional.

So to extract structured data from web page we need to:

  • Retrieve the HTML page via URL
  • Transform the HTML, retaining only the text (optionally, useful data may also be present in attributes and the page’s head, but for token limit considerations, we’ll stick to this approach for now)
  • Save the text in a variable
  • Utilize the OpenAI API to find data within the obtained text and represent it in JSON format
  • Visualize the data in the table form

Let’s begin!

Collection Creation

Firstly, let’s create a collection and define the necessary variables.

This includes the page URL and the fields we want to gather. Let’s add them in the corresponding tab. Additionally, we’ll need the OpenAI API token, which I’ve stored in an environment variable.

As a demo I’m going to scrape reviews from the book overview page.

Retrieving Page Text

Let’s create a request that simply goes to the URL from the collection’s variables. To extract text from this page, we’ll use HTML transformation via regular expressions in the Test tab script. We’ll save the obtained string in the collection variable for further use in the prompt.

const html_string = pm.response.text();
const body_inner_html = html_string.match(/<body[^>]*>([\\s\\S]*?)<\\/body>/i);
const body_string = body_inner_html[0]
.replace(/<script\\b[^<]*(?:(?!<\\/script>)<[^<]*)*<\\/script>/gi, '')
.replace(/<svg\\b[^<]*(?:(?!<\\/svg>)<[^<]*)*<\\/svg>/gi, '');
const clean_text = body_string.replace(/<.*?>/g, '');

pm.collectionVariables.set("page_text", clean_text);

Fetching Data with OpenAI API

Let’s create a chat completions request. For authentication, we’ll select Bearer Token with the token stored in a variable (in my case, an environment variable). In the Pre-request Script tab, we’ll create a prompt in the following manner:

const page_text = pm.collectionVariables.get("page_text");
const fields = pm.collectionVariables.get("fields");
const prompt = `You are the data manager. You need to get all feedbacks from the html in the following json format:
{results: [{${fields}}]}.
HTML is: ${page_text}
`
pm.collectionVariables.set("prompt", JSON.stringify(prompt));

Here, page_text is the text extracted from the page, which we saved in the collection variable in the previous step,

fields - the fields we want to gather, defined in collection variable.

Now, let’s structure our request body:

{
"model": "gpt-3.5-turbo-1106",
"response_format": {
"type": "json_object"
},
"messages": [
{
"role": "user",
"content": {{prompt}}
}
]
}

Here, prompt is the configured prompt, which we saved in the collection variable in the previous step.

We can save the result in a variable for further sending to our API, for instance. Let’s do this in the Test tab:

var data = JSON.parse(pm.response.json().choices[0].message.content).results;
pm.collectionVariables.set("results", JSON.stringify(data));

Data Visualization

If we want to get visualized data in the form of a table, we need to add the following script in the Test tab:

let data = JSON.parse(pm.response.json().choices[0].message.content).results;
let fields = pm.collectionVariables.get("fields").split(',');
let template = `
<style type="text/css">
.tftable {font-size:14px;color:#333333;width:100%;border-width: 1px;border-color: #87ceeb;border-collapse: collapse;}
.tftable th {font-size:18px;background-color:#87ceeb;border-width: 1px;padding: 8px;border-style: solid;border-color: #87ceeb;text-align:left;}
.tftable tr {background-color:#ffffff;}
.tftable td {font-size:14px;border-width: 1px;padding: 8px;border-style: solid;border-color: #87ceeb;}
.tftable tr:hover {background-color:#e0ffff;}
</style>
<table class="tftable" border="1">
<tr>
{{#each fields }}
<th>{{this}}</th>
{{/each}}
</tr>

{{#each data }}
<tr>
{{#each this}}
<td>{{this}}</td>
{{/each}}
</tr>
{{/each}}
</table>
`;
function constructVisualizerPayload() {
return { data, fields }
}
pm.visualizer.set(template, constructVisualizerPayload());

However, it’s important to note that this view will be available only when running the request, not the entire collection.

Our collection is ready! Run it and get the necessary data in the required format within seconds! No need to isolate individual elements, no human involvement in configuring the scraper! Simply state what we want to gather and get the result. To get precise results point if you need to make revies shorter or keep them initial, by default AI decide what to do on their own.

Video result you can find in my post on LinkedIn.

Limitations of this method include the requirement for the site to be static and the data volume not exceeding the acceptable limit for the OpenAI API request. And of course it is the simplest solution to showcase the power of AI scraping. To bypass these limitations, my next step is to develop a simple extension that performs the scraping task. Subscribe to stay updated on the progress!

--

--