Leveraging ChatGPT for HTML parsing a game-changer: rendering regular expressions obsolete?

Alexandr Dzhumurat
5 min readJul 11, 2023

--

Spoiler: can’t wait to see the code? Here’s something for you.

import openai  # for OpenAI API calls


openai.api_key = os.getenv("OPENAI_API_KEY")


def promt_generation(raw_html: str) -> str:
promt = f"""
I a HTML code snippet. The snippet contains activity description.
Extract event description from HTML in JSON format
resulted JSON should contain fields: event name, link to event, location, description, time start, time end

HTML: {raw_html}

JSON:
"""
return promt

@backoff.on_exception(backoff.expo, openai.error.Timeout)
def gpt_query(gpt_promt, verbose: bool = False):
gpt_messages = [{"role": "system","content": "You are a HTML parser.",},{"role": "user","content": gpt_promt,},]
openai_params = {'model':'gpt-3.5-turbo', 'max_tokens':1000,'temperature':0.0,'top_p':0.5,'frequency_penalty':0.5, 'messages':gpt_messages}
response = openai.ChatCompletion.create(**openai_params)
gpt_responce_raw = response.get("choices")[0].get("message").get("content").replace('\n', '')
res = {'gpt_resp': validate_gpt_responce(gpt_responce_raw)}
res.update(response.get("usage").to_dict())
return res

def get_iterable_from_url(url: str) -> pd.DataFRame:
resp = requests.get(url)
dummy_scraper = BeautifulSoup(markup=resp.content, features="html.parser")
if 'eater.com/' in url:
scraper_params = {'name': 'section', 'class_': 'c-mapstack__card'}
elif 'everout.com/' in url:
scraper_params = {'name': 'div', 'class_': 'event-schedules'}
else:
RuntimeError('Valid scraper not found')
page_blocks = dummy_scraper.find_all(**scraper_params)
page_description = []
for i in page_blocks:
gpt_resp = gpt_query(promt_generation(i.text))
gpt_resp['gpt_resp'].update({'source_url': url, 'raw_html': str(i)})
page_description.append(gpt_resp['gpt_resp'])
result_df = pd.json_normalize(page_description)
return result_df

Now let’s see what’s going on here.

Let’s say we are solving the task of creating a service for restaurant recommendations, let’s call such a service “FindMeWhereToEat”. One of the crucial tasks is collecting the content necessary for making recommendations. While a common approach involves building parsers specific to restaurant websites like Yelp, this method has a drawback: it requires frequent refinement when the website layout changes. As we expand our service to include more sources, the time and effort spent on refining parsers also increase, posing scalability challenges.

In this article, I will guide you through building a more versatile parser using Language Model-based Learning (LLM). This approach offers a solution that minimizes scaling and support costs, allowing for a more efficient and adaptable system. Let’s dive into the details of this approach and explore how it can enhance our restaurant recommendation service.

Task decomposition

We want to build a system that will extract information about the content (restaurants) from HTML pages in a structured form (JSON): name, description, opening hours, etc.

The task of extracting structured information from HTML code in terms of NLP is a text summarization task, we want to pull a meaningful part of a large text. To train the summarization model we need to build a dataset in the form of mapping full_text: summarized_txt.

To train the summarization model we need to

  • collect a list of sites from which we will extract the training samples (restaurant data)
  • see what resources are included in the output and collect Blacklist — uninformative (for our task) sites (e.g. Reddit)
  • collect training sample of raw_html -> json
  • train the model and check the quality of summarization

In this article, we will consider the first two points — how to implement them as fast as possible.

Data sources

First, we need to get a list of pages where we will train and validate our model.

Collecting these sites can be automated

  • ask ChatGPT, Google Bard, or Bing for something like “Help me find a list of sites that list the best restaurants in New York City”
  • ask ChatGPT to generate 15–20 search queries that are semantically similar to the query from the first paragraph and extract links from the search results using the GoogleSerpAPI library — this approach will increase recall, i.e. it will add sites that ChatGPT “doesn’t know” about.

The collected list of sites should be reviewed, some of them are most likely not suitable for parsing or do not contain relevant information.

Training dataset

Test data for training the summarization model should be a dataset of raw_html -> json (structured information) type.

How to collect such data in the best possible way? Ideally we should implement a data scraper, but let’s take advantage of ChatGPT!

The basic trick is that web pages are usually quite well structured in the form of blocks. A couple of examples where blocks of content are visually highlighted in data sources.

Another one

To extract information from a block, the main thing is to extract the blocks themselves and then “feed” them to ChatGPT to structure the information according to the following algorithm

  • divide the page into top-level blocks for parsing (as in the picture above)
  • create a prompt for GPT to extract information from the block in JSON format
  • run a script to get the activity description in JSON using the created prompt
  • review cases when parsing did not work, and add to the script from point 3.

Example of HTML-snippet with the required information

<div class="c-mapstack__desktop-social">
<div class="c-social-buttons c-social-buttons--popover" data-cdata='{"entry_id":22663434,"services":["twitter","facebook","pocket","flipboard","email"],"base_url":"https://ny.eater.com/maps/best-romantic-restaurants-date-night-nyc"}' data-cid="site/social_buttons_list/popover-1689052966_9593_4783">
<h2 class="sr-only" id="heading-label--smgbvsiz">Share this story</h2>
<ul aria-labelledby="heading-label--smgbvsiz">
<li>
<a class="c-social-buttons__item c-social-buttons__facebook" data-analytics-social="facebook" href="https://www.facebook.com/sharer/sharer.php?text=The+Most+Romantic+Restaurants+in+NYC&amp;u=https%3A%2F%2Fny.eater.com%2Fmaps%2Fbest-romantic-restaurants-date-night-nyc">
<svg class="p-svg-icon c-social-buttons__svg" role="img"><use xlink:href="#icon-facebook"></use></svg>
<span class="sr-only">Share this on Facebook</span>
</a>
</li>
<li>
<div class="c-entry-content c-mapstack__content">
<p id="YnSnkk">When looking for a romantic restaurant or bar, words like “cozy,” “intimate,” and “low-lit” probably come to mind. But we’d argue that the food and bottles of colorful pet-nat are just as crucial to a memorable swoony evening as the decor. Below, we’ve rounded up a few of our favorite spots that are more than just a pretty indoor dining room — although we took that into account, too — for those special occasion dates.</p>
<p id="UsiWW1"><small><em>Health experts consider dining out to be a high-risk activity for the unvaccinated; it may pose a risk for the vaccinated, especially in areas with substantial COVID transmission.</em></small></p>
<p id="Agxajy"></p>
<p id="LfsNef"></p>
<a class="c-mapstack__content-read-more" data-analytics-link="read-more" data-read-more="now">Read More</a>
<div class="c-mapstack__disclaimer">
If you buy something or book a reservation from an Eater link, Vox Media may earn a commission. See our <a href="https://eater.com/pages/eater-ethics-statement">ethics policy</a>.
</div>

Prompt for ChatGPT (you can see the same in the Python code)

You are a HTML parser.
I a HTML code snippet. The snippet contains activity description.
Extract event description from HTML in JSON format
resulted JSON should contain fields: event name, link to event, location, description, time start, time end

HTML: {raw_html}

JSON:

Summarization result

{'event name': 'Le Rock',
'link to event': 'https://www.google.com/maps/place/45+Rockefeller+Plaza,+New+York,+NY+10111',
'location': '45 Rockefeller Plaza, New York, NY 10111',
'description': 'Even if you’re not a banker or SNL cast member, getting all dolled up for a date at this Art Deco Rockefeller Center restaurant is fun. Given how hard reservations are at the hotspot from the team behind Frenchette, you’ll impress your date by taking them there at all, let alone slurping on seafood and a luxe dessert tower, while the wine flows.',
'time start': '',
'time end': ''}

That is, ChatGPT efficiently handles HTML parsing.!

This firstly reduces parsing costs and secondly gives hope that it is possible to train a universal model for text summarization if we build a large dataset with different HTML samples and extracted structured data in JSON format from this HTML

Fitting an LLM for this task will be the subject of the next article in this series.

Stay tuned!

Bonus: https://gist.github.com/aleksandr-dzhumurat/3b4fdcdca8a0871ebd0c7a2338db93df

--

--