Make use of Beautiful Soup 4 to build a HTML table of contents automatically
Beautiful Soup 4 is a well known package to navigate inside HTML or XML data structures. Its simplicity can help us to gain a massive amount of time. Here, we will use Beautiful Soup 4 to build a HTML table of contents in a few line of code.
This will help us to rapidly get a rough idea of what happening in one article, and we could even re-use this code to push it even further and validate our hn tags structure from a SEO point of view. Let’s begin, right now.
If you are not able to visualise the content until the end, I invite you to take a look here to catch-up!
If you don’t already have Beautiful Soup 4 installed, you can find the process here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
Initializing Beautiful Soup 4
Firstly, we need some input data. Beautiful Soup 4 is made for HTML and XML parsing, so let’s get a HTML page: https://en.wikipedia.org/wiki/Machine_learning
We are going to begin with the basic by importing all required packages and get the page.
import re # We are going to need to use regular expressions later on
import requests
from bs4 import BeautifulSoupwiki_page = requests.get("https://en.wikipedia.org/wiki/Machine_learning")
Now that we got the page, we need to initialize Beautiful Soup with this line of code.
soup = BeautifulSoup(wiki_page.content, 'html.parser')
Finding the data in the HTML structure
Once we got the page loaded and parsed by Beautiful Soup, we can begin looking for the necessary information. We want to make a table of contents, so we need all the titles, including the main title.
The main title can be found looking for the title tag. Beautiful Soup 4 is really simple to use. We can find the title tag using the find method, pretty straightforward. After this, we will put the found data with the tag name and his text inside an array for later use.
title_node = soup.find("title")
tile_data = [title_node.name, title_node.get_text()]
What remains for us is to find every other titles in the page. If the page is well structured, every titles will be found in a hn tag. What we call hn tags are h1, h2, h3 and so on tags. Hn tags rarely go over 4th rank. In consequence, we can take only the tags having only 1 digit after the “h”. Regular expressions are a convenient way to do this, and it can be use with Beautiful Soup as well.
hns = soup.find_all(re.compile("h[0-9]{1}"))
Now we obtained our hn tags. We need to clean them a bit and to put them in a structured way. For every hn tag, we will take only the first text element of the tag because we noticed that our data have some “edit” links inside them that we need to remove.
hn_structure = []
for hn in hns:
hn_text_content = [x for x in hn.stripped_strings if x is not None]
if len(hn_text_content) > 0:
hn_structure.append([hn.name, hn_text_content[0]])
Lastly, we want to convert those data into a HTML format. A simple loop will get the job done.
tag_template = "<{tag_name}>{content}</{tag_name}>\n"
html_output = "<html>\n"
html_output += tag_template.format(tag_name = tile_data[0], content = tile_data[1])
for hn in hn_structure:
html_output += tag_template.format(tag_name = hn[0], content = hn[1])
html_output += "</html>"
That was it, you can find the HTML data in the html_output variable. So, we shown that it was really easy to get data from a HTML or XML structure with Beautiful Soup 4. From this starting point we can go either adding some style into your HTML table of content to make a wonderful visual one, or go the SEO way and use your hn tags data to validate the hn structure.
If you enjoyed the article or found it useful, it would be kind of you to support me by following me here (Jonathan Mondaut). More articles are coming very soon!
Below is the full code for your use:
import re # We are going to need to use regular expressions later on
import requests
from bs4 import BeautifulSoupwiki_page = requests.get("https://en.wikipedia.org/wiki/Machine_learning")soup = BeautifulSoup(wiki_page.content, 'html.parser')title_node = soup.find("title")
tile_data = [title_node.name, title_node.get_text()]hns = soup.find_all(re.compile("h[0-9]{1}"))
hn_structure = []
for hn in hns:
hn_text_content = [x for x in hn.stripped_strings if x is not None]
if len(hn_text_content) > 0:
hn_structure.append([hn.name, hn_text_content[0]])tag_template = "<{tag_name}>{content}</{tag_name}>\n"
html_output = "<html>\n"
html_output += tag_template.format(tag_name = tile_data[0], content = tile_data[1])
for hn in hn_structure:
html_output += tag_template.format(tag_name = hn[0], content = hn[1])
html_output += "</html>"