HTML Parser — Developer Tools

AppSeed.us
6 min readJul 29, 2019

--

Hello Coders,

In this article, I will present an HTML parser used by our team to speed up the integration of new layouts into legacy products coded in different technologies (Python, Php, Javascript). I must say from the beginning that the tool is not open-source, based on the work involved to build it (~8mo) and the potential commercial value of the tool. In the public repository of the tool, we publish free HTML themes to prove the concept and also to be used by other developers for free.

Btw, my name is Sm0ke and I’m writing a lot on Dev.to

Now, back to our topic, why to bother and develop such a tool.

HTML Parser — Tool Screenshot.

Motivation & usage

In web development, things are changing fast. A few years ago, PHP / Bootstrap / jQuery and MVC was the mainstream pattern, but now the picture looks different:

  • Javascript ecosystem tend to dominate the whole picture
  • Python is the new “PHP”
  • jQuery is almost dead, Vue, React and Angular are melting the charts
  • GraphQL is the rising star for API’s
  • GatsbyJS is digging a nice grave to Wordpress
  • Backends are headless, decoupled from the frontends
  • JAMstack, what a beautiful concept.
  • MVC, sorry old fellow, you’re dead.
  • New players (Bulma CSS, Tailwind) are rising in the CSS Framework field

In this crazy world, we decided to stay relevant in the market as a technology company by using tools, for almost anything:

HTML parser features

The goal of this tool is to help us translate flat HTML into production-ready templates for various template engines like Blade (Laravel), Jinja2 (Python), PUG (Javascript), Mustache. To make this happen, a few steps must be solved in an optimal way from a technical point of view.

  • Parse the HTML files using a state of the art library with helpers and mature enough to support our R&D and minimize our development effort. After scanning the market, the winner is the BeautifulSoup library written in Python.
  • Traverse/edit and process the HTML tree in an optimal way. For this, we’ve built an interactive console that allows us to move back end forth on the tree, edit the elements, properties, and export the information in various formats
  • The translation should be supported by any element in the tree in any relevant format used in production (PUG, Jinja2 ..)

With this short list with requirements, we’ve started writing the code.

HTML parser implementation

Importing the code and inject the HTML content into BS library was the easy part. Only a few lines of code were enough:

$ pip install beautifulsoup4
$ # read_file return the file content as string
$ html_content = read_file('index.html')
$ soup = bs(html_content,'html.parser')
# At this point, we can interact with the HTML
# elements stored in memory using all helpers offered by BS library

Update the header

At some point, we need to change, remove the hardcoded strings present in the HTML header. In order to edit the HTML header (title and meta description), we use a simple code snippet:

$ header = soup.find('head') # this will return the whole HEAD node$ # To change the title use the code
$ header.title.string.replace_with('New title - set by HTML Parser')

Change the path for JS Scripts

To have full control over the HTML and future translations, we decided to “normalize” the HTML before starting to export the components. Normalization means to move the assets from a random structure into standard directories. Imagine that we have this structure before normalization:

<ROOT>
|---- index.html
|---- app.css
|---- js/app.js
|---- images/logo.png
|---- top-cover.jpg

We can easily see the index.html load images, js and CSS assets from different locations. If we plan to integrate this design into an application managed by Gulp, Webpack, Parcel or any other tool, we need to move manually all the stuff and update the index.html accordingly. This work can easily push the developer to depression. I did it many times before using tools, and I can assure you, is not funny to do it. The structure, after the automatic normalization:

<ROOT>
|---- index.html
|---- assets/css/app.css
|---- assets/js/app.js
|---- assets/images/logo.png
|---- assets/images/top-cover.jpg

The HTML sample for Javascript files:

...
<script type='text/javascript' src='js/bootstrap.js'></script>
<script type='text/javascript' src='js/custom.js'></script>
...

The Python code snippet:


$ # select all script nodes
$ for script in soup.body.find_all('script', recursive=False):

# Print the path
print(' JS source = ' + script[src])

# Update (normalize) the path
js_path = script['src']
js_file = js_path.split('/')[-1] # select the last segment
script[src] = '/assets/js/' + js_file

The python code to process Images path:

for img in soup.body.find_all('img'):

# Print the path
print(' IMG src = ' + img[src])

img_path = img['src']
img_file = img_path.split('/')[-1]
img[src] = '/assets/img/' + img_file

In a similar way, we can edit properties for anchors, p, span elements over and over, until the HTML tree is ready to be exported. All the above changes are made in memory, and to make them persistent across editing sessions, we need to save the changes to a (new) HTML file:

$ processed_html = soup.prettify(formatter="html")
$ f = open( 'index2.html', 'w+')
$ f.write(processed_html)
$ f.close

Real-life sample

The sample, extracted from Stellar HTML5Up theme is a simple navigation bar, extracted from this file

Pug version

nav#nav
ul
li
a.active.newclass(href='https://appseed.us/html-parser').

Introduction

li
a(href='#first').

First Section

li
a(href='#second').

Second Section

li
a(href='#cta').

Get Started

PHP version

<nav id="nav">
<ul>
<li>
<a class="active newclass" href="https://appseed.us/html-parser">
<?php echo $var_1?>
</a>
</li>
<li>
<a href="#first">
<?php echo $var_2?>
</a>
</li>
<li>
<a href="#second">
<?php echo $var_3?>
</a>
</li>
<li>
<a href="#cta">
<?php echo $var_4?>
</a>
</li>
</ul>
</nav>

Open-Source projects built with this HTML parser

HTML Parser — tool screenshots

HTML Parser — Select HTML theme to process
HTML Parser — Select the HTML theme
HTML Parser — Select target component
HTML Parser — Select target component
HTML Parser — visualize the component
HTML Parser — visualize the component

Thank you for reading! Feel free to AMA in the comments. BTW, my name is Sm0ke and I’m writing a lot on Dev.to

Sm0ke — Founder of AppSeed

--

--