Make use of Beautiful Soup 4 to build a HTML table of contents automatically

Initializing Beautiful Soup 4

Firstly, we need some input data. Beautiful Soup 4 is made for HTML and XML parsing, so let’s get a HTML page: https://en.wikipedia.org/wiki/Machine_learning

Wikipedia Machine learning page
Wikipedia’s Machine learning page
import re # We are going to need to use regular expressions later on
import requests
from bs4 import BeautifulSoup
wiki_page = requests.get("https://en.wikipedia.org/wiki/Machine_learning")
soup = BeautifulSoup(wiki_page.content, 'html.parser')

Finding the data in the HTML structure

Once we got the page loaded and parsed by Beautiful Soup, we can begin looking for the necessary information. We want to make a table of contents, so we need all the titles, including the main title.

title_node = soup.find("title")
tile_data = [title_node.name, title_node.get_text()]
hns = soup.find_all(re.compile("h[0-9]{1}"))
hn_structure = []
for hn in hns:
hn_text_content = [x for x in hn.stripped_strings if x is not None]
if len(hn_text_content) > 0:
hn_structure.append([hn.name, hn_text_content[0]])
tag_template = "<{tag_name}>{content}</{tag_name}>\n"
html_output = "<html>\n"
html_output += tag_template.format(tag_name = tile_data[0], content = tile_data[1])
for hn in hn_structure:
html_output += tag_template.format(tag_name = hn[0], content = hn[1])
html_output += "</html>"
import re # We are going to need to use regular expressions later on
import requests
from bs4 import BeautifulSoup
wiki_page = requests.get("https://en.wikipedia.org/wiki/Machine_learning")soup = BeautifulSoup(wiki_page.content, 'html.parser')title_node = soup.find("title")
tile_data = [title_node.name, title_node.get_text()]
hns = soup.find_all(re.compile("h[0-9]{1}"))
hn_structure = []
for hn in hns:
hn_text_content = [x for x in hn.stripped_strings if x is not None]
if len(hn_text_content) > 0:
hn_structure.append([hn.name, hn_text_content[0]])
tag_template = "<{tag_name}>{content}</{tag_name}>\n"
html_output = "<html>\n"
html_output += tag_template.format(tag_name = tile_data[0], content = tile_data[1])
for hn in hn_structure:
html_output += tag_template.format(tag_name = hn[0], content = hn[1])
html_output += "</html>"

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jonathan Mondaut

Jonathan Mondaut

27 Followers

CTO in an international digital marketing agency specialized in international SEO.