A Benchmark Comparison Of Content Extraction From HTML Pages

8 min readAug 2, 2017

Introduction

Content extraction is the task of separating boilerplate such as comments, navigation bars, social media links, ads, etc, from the main body of text of an article formatted as HTML. The main content typically accounts for only a small portion of a page’s source code (highlighted in red in the image below). Extraction is usually the first step of any data analysis based on HTML data, as errors early on tend to propagate downstream and affect the rest of the study. High quality extraction is crucial to the success of data projects. However, extraction can be surprisingly tricky because HTML can be used in creative ways to achieve the same effect.

This post outlines how Skim Technologies’ content extractor works and compares our model to two established packages. Our model is more accurate than both competitors without making use of any external styling information.

Background

To date two main approaches to content extraction have been proposed. Rule-based methods typically take a DOM tree as input and use rules to decide which elements are boilerplate. Traditional ad blockers are the simplest instantiation of such a system. DOM elements that link to blacklisted domains are marked as boilerplate. Extending this idea, a larger set of simple rules can be used to assign scores to DOM elements based on how likely they are to be content. One once popular such algorithm is Readability, which used rules like this:

Add 1 point for each comma within this paragraph
For every 100 characters in this paragraph, add 1 point up to 3 points
Add 5 points if the element’s tag name is div
Subtract 5 points if the element’s tag name is a ol/ ul
Set score to zero if paragraph has length of less than 25 characters

This system is effectively a linear classifier with ad-hoc feature weights. A human has to decide how many points to add or subtract for each rule. Unless explicitly programmed, it is tricky to include feature combinations such as “add 1 point if the element is a div and it contains more than 25 characters”. Further, keeping track of all interactions between rules becomes near impossible as the number of rules grows. From a practical perspective, it is hard for developers to extend a rule-based system as its complexity increases dramatically with the number of rules.

To mitigate the shortcomings of rule-based methods, recent work has made use of Machine Learning (ML). The main advantage over rules is that a decision function can be learned automatically by example. Rather than hand-crafting rules, developers provide a learning algorithm with input pages where each part of the page are marked as either content or boilerplate. Long term, such a system is easier to maintain and extend.

Two main approaches have emerged in ML-based content extraction. The first one is image-based, where a page is rendered and computer vision techniques are used to analyse the visual layout of the page. One such prominent system is Diffbot. However, Diffbot is proprietary, so few details about how it works are publicly available.

The second family of ML content extractors operates on DOM trees, similar to rule-based models. However, supervised machine learning is used to learn how many points should be assigned to each element. This allows features to be combined in more complex ways than a human can ever write by hand. A prominent early system based on this approach is Dragnet. Skim Technologies’ extractor is also based on the DOM tree. The advantage over visual methods is that the system is less reliant on external CSS/ JS files. This means faster loading times, as there is no need to fetch third-party resources over the network (other than the page’s HTML code). More importantly, external content delivery networks (CDNs) tend to go offline. The odds of a page being rendered the same way therefore decrease considerably over time.

Our approach

Four main steps are involved in training our content extractor. First, we acquire a sample of web pages to use as input to the ML model. Next, we break down HTML pages into “blocks”, which are the smallest unit our model operates on. For each block in the data set, we ask a human to provide a boilerplate/content tag. Finally, we use domain knowledge to extract informative features for each block, which are fed into a ML algorithm along with human tags. More details about each step are given below.

DOM simplification

DOM simplification is vital because the average DOM tree is very strung-out: ultimately it can have nodes containing just a single character. We need a way of mapping DOM elements, which are designed to make sense to a browser, to units that are useful to humans (sentences, paragraphs, etc). For example, consider the following naive way of implementing a drop cap with CSS:

<div>
    <span style=”font-size: 200%;”>H</span>    <span>ello<span></div>

In this case, the DOM tree has three elements-a parent div and two child spans, but only a single logical block- the word “hello”. We therefore simplify the tree by, among other things, merging or removing insignificant children. The resultant data structure is a “block tree”. Like a DOM tree, it is hierarchical, but it is wider and shallower and has fewer leaves. The leaves are semantically more meaningful and usually map to paragraphs. The leaves are what we process downstream.

Labelled data

We gather training data by labelling each block in a given page as either content or not content. Four annotators labelled a total of ~400k blocks in ~2100 pages. Two thirds of those pages are the original Dragnet data, re-labelled to correct some occasional errors. The remaining pages are a random sample of real-world pages that were processed by our system in December 2016. We built a simple web-based tool to help us gather tags for each block. We bootstrap the data set by first labelling a small data set and training a prototype model, which we then use to pre-tag blocks to save annotator time.

Model training

To achieve a good trade-off between speed and accuracy in production, we extract a relatively small set of shallow features and classify blocks independently of one another. Our feature extraction pipeline generates over 40 features per block. These are inspired by Dragnet and include:

Shallow textual features, such as number of words, text/link/keyword density, average word/sentence length, readability score
Structural features, such as distance to the title
Our patent-pending styling and group features

We take context into account by also including additional features from adjacent blocks. All features we currently extract are language-independent, so our extractor works equally well with pages written in any human language.

Our final step is binary classification (content vs boilerplate) with gradient-boosted decision trees.

Evaluation

To assess our model’s performance, we compare it to Dragnet and Diffbot. We made every effort to ensure a fair comparison. We first split our data set 80–20. Both our model and dragnet are trained on the 80%, and all three models evaluated on the remaining 20%. The final test set includes 427 pages, 275 of which are from dragnet and 153 are from our data. Details about Diffbot’s training data are not publicly available.

We use Dragnet’s evaluation process, where content extraction is evaluated similarly to an information extraction system. Given an HTML page as input, the task is to output the main body of the article as plain text. This evaluation considers how good a model is at retrieving the main body and is independent of how pages are broken down into blocks. Note this post focuses exclusively on boilerplate removal and disregards other features of both Diffbot and our extractor, such as title or author extraction. These will be considered in a separate post.

Token-level precision, recall and F1 for all three models are:

Dragnet: .847/ .921/.876
Diffbot: .917/ .943/ .921
Skim Tech: .956/ .925/.926

Skim Technologies’ content extractor emerges as the winner, massively outperforming Dragnet and beating Diffbot by half a percentage point. These results suggest that careful data preprocessing and feature engineering result in considerable improvements in extraction accuracy, even without access to external styling information. These in turn improve the performance of downstream models such as text summarisers.

Qualitatively, both Skim Tech and Diffbot appear to perform well on pages with a traditional layout, where the main body is in the page center and boilerplate is at the edges. Such pages are typically extracted very well, with only occasional false positives or false negatives. Skim Tech performs better in dynamically modified pages, for example ones where a paywall is generated with JS [1] [2] [3]. This is likely because we do not currently execute JS code in pages.

Dragnet performs better than reported by its authors, possibly because the original paper used a logistic regression classifier, whereas the latest release uses a more powerful model (extremely randomised trees).

Our Diffbot comparison is not under identical conditions because the test data does not include JS/CSS files. Pages may render differently to how they used to when the data was first gathered. Manual inspection of 20 Dragnet pages (from 2012) and 20 of our pages (from 2016) shows that half render as they would have originally. Diffbot’s model uses computer vision, so it is possible that differences in page rendering may lower its accuracy. However, we view over-reliance on styling and visual information as a limitation. It should be possible to extract content from pages in the absence of external CSS, because such information goes missing over time.

In terms of speed, Dragnet is the fastest, processing tens of documents a second (when documents are on disk). Our extractor averages about one document per second. However, that includes a number of other models beside content. Also, our code is currently optimised for accuracy first, developer productivity second and speed last. Most of our feature extraction pipeline is written in pure Python and therefore much slower than it could be (compared to Dragnet, which is almost exclusively in C/C++/Cython).

We cannot present definitive speed figures for Diffbot because we can only access their code through an HTTP API. Factors such as network latency, API outages, request caching and rate limiting significantly affected our perceived throughput. Our subjective experience is that when we were not being rate-limited, Diffbot’s speed was in the same ballpark as our system’s.

Roadmap

While we are happy with our model’s current accuracy, we are also excited about what the future could look like as we turn our attention to improving its speed. We plan to reimplement critical parts of our extractor in Cython to make it blazing fast. Along the way, we will also be adding new features and more training data to increase accuracy.

We intend to extend this comparison to more extractors, such as Embedly, Boilerpipe or Goose, and more web pages. We would also like to do a speed comparison under identical conditions. Lastly, both Skim Techonologies and Diffbot’s APIs offer more ways of structuring a page that go beyond content extraction. We recently evaluated a number of text summarisation APIs. We would also like to compare other models such as author, date and title extraction.

If you have any questions about this evaluation, or would like to test out our API’s then contact us via api@skimtechnologies.com