Unlock the Power of Structured Output with GPT-4o

Vincent
4 min readSep 21, 2024

Revolutionizing the Integration of Generative Models into Smart Applications

The release of OpenAI’s GPT-4o introduces a groundbreaking feature increasingly seen in large language models (LLMs): structured outputs.

In this article, we’ll explore an innovative approach to web scraping that combines traditional data extraction techniques with the power of artificial intelligence. By leveraging OpenAI’s “Structured Outputs,” a format designed to automatically organize extracted information, we can transform unstructured content (like raw HTML) into well-defined, usable data models. We’ll demonstrate this process through a Python example, using our blog as the data source.

What Are Structured Outputs?

Structured outputs offer a new way of interfacing with generative models like GPT, where the model produces pre-organized, structured information based on predefined formats. Rather than relying on unstructured, free-form text, this feature ensures the output conforms to specific data schemas, enabling seamless integration into automated systems, databases, or complex workflows.

Free-Form Text vs. Structured Output

In typical interactions with generative models, user queries such as questions or instructions result in free-form text responses. While these text outputs can be informative, they often require additional manual or algorithmic post-processing to extract and structure the relevant information (e.g., transforming it into JSON, tables, etc.).

Structured outputs eliminate this intermediate step by directly formatting the model’s response according to a well-defined data schema. This means the model provides structured responses — such as objects or tables (like JSON) — ready for immediate use in applications, with no need for labor-intensive post-processing.

Advantages for Smart Application Development

Schema Adherence
When working with generative models via APIs, it’s crucial that responses adhere to a specific format. For example, if the model generates product information or web scraping results, structured outputs ensure that the data follows a schema (such as a Product object with fields like name, price, url, etc.).

Process Automation
Structured outputs enable automation of processes that would otherwise require human intervention. For instance, if you scrape a webpage for structured data like titles, links, or images, structured outputs allow you to directly integrate the results into a database or system without needing to manually reformat or extract information.

Simplified Integration
In automated workflows or SaaS applications, structured data can be easily integrated into systems for analysis, reporting, or content management, making automatic processing much simpler and more efficient.

Hands-On Example

Preparing the Environment

To get started, let’s ensure we have the necessary libraries installed for data extraction and analysis.

pip install bs4 openai pydantic tiktoken

Key libraries we’ll use include:

  • BeautifulSoup for HTML extraction and cleaning,
  • OpenAI to interact with the GPT API and take advantage of structured outputs,
  • Pydantic for managing the data models.

Extracting HTML Content

The first step is to extract the content from the target webpage. In this example, we’ll scrape our blog’s article list.

import requests
from bs4 import BeautifulSoup

SOURCE_URL = "https://www.eurelis.com/blog/"
response = requests.get(SOURCE_URL)
soup = BeautifulSoup(response.text, 'html.parser')

# Clean the HTML content
for data in soup(['style', 'script', 'link', 'meta', 'header']):
data.decompose()

html_content = str(soup)

At this point, we’ve retrieved the raw HTML content from the page and removed irrelevant elements like scripts, styles, and metadata.

Preprocessing and Tokenization for GPT-4o

The GPT-4o model has a token limit, so we need to ensure the page’s content doesn’t exceed this limit before sending it to the API.

import tiktoken

OPENAI_MODEL = "gpt-4o-2024-08-06"
encoding = tiktoken.encoding_for_model(OPENAI_MODEL)

# Check the token count
if len(encoding.encode(html_content)) > 128000:
raise ValueError("The content exceeds the token limit.")

Interacting with OpenAI and Extracting Data

Now that we have cleaned and prepared the HTML content, we’ll proceed to extract relevant information using OpenAI’s structured outputs.

We’ll define a data model, such as BlogPost, which will hold the title, URL, and an image.

from pydantic import BaseModel
from typing import List

class BlogPost(BaseModel):
title: str
url: str
image: str

class Data(BaseModel):
blog_posts: List[BlogPost]

Next, we send a request to the OpenAI API to extract information based on this data schema.

from openai import OpenAI

openai_client = OpenAI(api_key=OPENAI_API_KEY)
SYS_PROMPT = "Analyze the following HTML page and extract key information, then organize it according to the expected data schema."

completion = openai_client.beta.chat.completions.parse(
model=OPENAI_MODEL,
messages=[
{"role": "system", "content": SYS_PROMPT},
{"role": "user", "content": "The HTML content is: " + html_content}
],
response_format=Data
)

# Result
print(completion.choices[0].message.parsed)

The Result

GPT-4o processes the HTML page and returns the data structured as a list of BlogPost objects. Each item contains a title, URL, and corresponding image, making it easy to reuse the data for tasks like content analysis or marketing automation.

Conclusion

Integrating OpenAI’s structured outputs into web scraping workflows significantly enhances the accuracy and structure of extracted data. This technique allows you to transform unstructured content into immediately usable data, while minimizing the manual effort traditionally associated with web scraping.

OpenAI’s API offers a powerful tool for complex web scraping tasks, particularly when it comes to extracting and organizing large volumes of unstructured information.

Structured outputs are a major leap forward in the use of generative models, especially in scenarios where data must be ready for immediate use. Whether for web scraping, information extraction, or process automation, they streamline workflows while ensuring data accuracy and reusability.

Learn more about structured outputs: https://platform.openai.com/docs/guides/structured-outputs/introduction

Thanks for reading, you can follow me on LinkedIn or X. More about Eurelis on LinkedIn or X.

--

--

Vincent
0 Followers

Engineer and EPITA graduate, I founded Eurelis to leverage AI for value creation and managing client relationships in digital transformations.