HTML Parser — Developer Tools

6 min readJul 29, 2019

Hello Coders,

In this article, I will present an HTML parser used by our team to speed up the integration of new layouts into legacy products coded in different technologies (Python, Php, Javascript). I must say from the beginning that the tool is not open-source, based on the work involved to build it (~8mo) and the potential commercial value of the tool. In the public repository of the tool, we publish free HTML themes to prove the concept and also to be used by other developers for free.

Btw, my name is Sm0ke and I’m writing a lot on Dev.to

Now, back to our topic, why to bother and develop such a tool.

Motivation & usage

In web development, things are changing fast. A few years ago, PHP / Bootstrap / jQuery and MVC was the mainstream pattern, but now the picture looks different:

Javascript ecosystem tend to dominate the whole picture
Python is the new “PHP”
jQuery is almost dead, Vue, React and Angular are melting the charts
GraphQL is the rising star for API’s
GatsbyJS is digging a nice grave to Wordpress
Backends are headless, decoupled from the frontends
JAMstack, what a beautiful concept.
MVC, sorry old fellow, you’re dead.
New players (Bulma CSS, Tailwind) are rising in the CSS Framework field

In this crazy world, we decided to stay relevant in the market as a technology company by using tools, for almost anything:

To start a new project we use boilerplate code
To add new modules we generate the code
To integrate a new layout we use the HTML parser presented here
For deployment: we use scripts and modern platforms like Heroku, Now

HTML parser features

The goal of this tool is to help us translate flat HTML into production-ready templates for various template engines like Blade (Laravel), Jinja2 (Python), PUG (Javascript), Mustache. To make this happen, a few steps must be solved in an optimal way from a technical point of view.

Parse the HTML files using a state of the art library with helpers and mature enough to support our R&D and minimize our development effort. After scanning the market, the winner is the BeautifulSoup library written in Python.
Traverse/edit and process the HTML tree in an optimal way. For this, we’ve built an interactive console that allows us to move back end forth on the tree, edit the elements, properties, and export the information in various formats
The translation should be supported by any element in the tree in any relevant format used in production (PUG, Jinja2 ..)

With this short list with requirements, we’ve started writing the code.

HTML parser implementation

Importing the code and inject the HTML content into BS library was the easy part. Only a few lines of code were enough:

$ pip install beautifulsoup4
$ # read_file return the file content as string
$ html_content = read_file('index.html')
$ soup  = bs(html_content,'html.parser') 
# At this point, we can interact with the HTML 
# elements stored in memory using all helpers offered by BS library

Update the header

At some point, we need to change, remove the hardcoded strings present in the HTML header. In order to edit the HTML header (title and meta description), we use a simple code snippet:

$ header = soup.find('head') # this will return the whole HEAD node$ # To change the title use the code
$ header.title.string.replace_with('New title - set by HTML Parser')

Change the path for JS Scripts

To have full control over the HTML and future translations, we decided to “normalize” the HTML before starting to export the components. Normalization means to move the assets from a random structure into standard directories. Imagine that we have this structure before normalization:

<ROOT>
  |---- index.html
  |---- app.css
  |---- js/app.js
  |---- images/logo.png
  |---- top-cover.jpg

We can easily see the index.html load images, js and CSS assets from different locations. If we plan to integrate this design into an application managed by Gulp, Webpack, Parcel or any other tool, we need to move manually all the stuff and update the index.html accordingly. This work can easily push the developer to depression. I did it many times before using tools, and I can assure you, is not funny to do it. The structure, after the automatic normalization:

<ROOT>
  |---- index.html
  |---- assets/css/app.css
  |---- assets/js/app.js
  |---- assets/images/logo.png
  |---- assets/images/top-cover.jpg

The HTML sample for Javascript files:

...
<script type='text/javascript' src='js/bootstrap.js'></script>
<script type='text/javascript' src='js/custom.js'></script>
...

The Python code snippet:


$ # select all script nodes
$ for script in soup.body.find_all('script', recursive=False):

   # Print the path 
   print(' JS source = ' + script[src]) 

   # Update (normalize) the path
   js_path = script['src']
   js_file = js_path.split('/')[-1] # select the last segment
   script[src] = '/assets/js/' + js_file

The python code to process Images path:

for img in soup.body.find_all('img'):

   # Print the path 
   print(' IMG src = ' + img[src]) 

   img_path = img['src']
   img_file = img_path.split('/')[-1]  
   img[src] = '/assets/img/' + img_file

In a similar way, we can edit properties for anchors, p, span elements over and over, until the HTML tree is ready to be exported. All the above changes are made in memory, and to make them persistent across editing sessions, we need to save the changes to a (new) HTML file:

$ processed_html = soup.prettify(formatter="html")
$ f = open( 'index2.html', 'w+')
$ f.write(processed_html)
$ f.close

Real-life sample

The sample, extracted from Stellar HTML5Up theme is a simple navigation bar, extracted from this file

Index file: original version and normalized version
JSON descriptor is generated by the HTML parser tool and encapsulate the assets and resources used by the HTML files
Navigation component
HTML version
PUG version
Jinja2 Version
Php version
JSON descriptor

Pug version

nav#nav
  ul
    li
      a.active.newclass(href='https://appseed.us/html-parser').

        Introduction

    li
      a(href='#first').

        First Section

    li
      a(href='#second').

        Second Section

    li
      a(href='#cta').

        Get Started

PHP version

<nav id="nav">
 <ul>
  <li>
   <a class="active newclass" href="https://appseed.us/html-parser">
    <?php echo $var_1?>
   </a>
  </li>
  <li>
   <a href="#first">
    <?php echo $var_2?>
   </a>
  </li>
  <li>
   <a href="#second">
    <?php echo $var_3?>
   </a>
  </li>
  <li>
   <a href="#cta">
    <?php echo $var_4?>
   </a>
  </li>
 </ul>
</nav>