teleporthq.io
Published in

teleporthq.io

Understanding the Web: Parsing Web Pages Semantically

Using Machine Learning to Parse Web Pages Into Semantic Sections

Finding structure in chaos

  • nav
  • header
  • content
  • footer
  • control
  • form

Defining heuristics

navScore = 0.5 * containsTags(<UL>, <OL>, <LI>) 
+ 0.3 * containsTags(<A>)
+ 0.2 * containsKeywords(‘nav’, ‘navigation’, ‘menu’)
Our Chrome extension in action. Parsing our homepage, teleporthq.io

Gathering data

<index, url, features, label>
A few of the computed features. Now we can start talking ML!

Training a section classifier

Support vector machine (SVM)

Neural Network (NN)

195 input → 2x 512 fc → 2x 256 fc → 128 fc → 7 softmax

Extreme Gradient Boosting (XGBoost)

Results and comparison

Confusion matrices for all methods on the full features in the data set.
Confusion matrices for the low-dimensional data set. The SVM classifier failed to learn anything meaningful.
Mean accuracy for every model.

Conclusion and future work

Acknowledgements

--

--

teleporthq.io is a collaboration platform for designers and developers with design-to-code real-time capabilities

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Raul Incze

Fighting to bring machine learning to as many products and businesses as possible, automating processes and improving living experience.