Finding the creation or modification date of web pages

Metadata extraction is useful for different kinds of purposes ranging from knowledge extraction and business intelligence to classification and refined visualizations. It is often necessary to fully parse the document or apply robust scraping patterns, there are for example web pages for which neither the URL nor the server response provide a reliable way to date the document, that is find when it was first published and/or last modified.

There is an existing codebase to extract or derive metadata from web pages. I took inspiration from goose, newspaper and articleDateExtractor (all for Python) as well as metascraper, a Javascript suite focusing on metadata. With the exception of the latter, I could not find any functional and actively maintained module, especially for Python.

htmldate provides a simple and convenient way to extract the creation or modification date of web pages, within Python or on the command-line. Based on HTML parsing and scraping functions:

  1. Starting from the header of the page, it uses common patterns to identify date fields.
  2. If this is not successful, it scans the whole document looking for structural markers.
  3. If no date cue could be found, it finally runs a series of heuristics on the content (text and markup).

Installation

From package repository:

pip install htmldate

Direct installation of the latest version over pip is possible:

pip install git+https://github.com/adbar/htmldate.git

Within Python

All the functions of the module are currently bundled in htmldate.

In case the web page features clear metadata in the header, the extraction is straightforward:

>>> import requests
import htmldate
>>> r = requests.get('https://www.theguardian.com/politics/2016/feb/17/merkel-eu-uk-germany-national-interest-cameron-justified')
>>> htmldate.find_date(r.text)
>>> 2016–02–17

A more advanced analysis of the document structure is sometimes needed:

>>> r = requests.get('http://blog.python.org/2016/12/python-360-is-now-available.html')
>>> htmldate.find_date(r.text)
>>> # DEBUG analyzing: <h2 class="date-header"><span>Friday, December 23, 2016</span></h2>
>>> # DEBUG result: 2016–12–23
>>> 2016–12–23

In the worst case, the module resorts to a guess based on an extensive search, which can be deactivated:

>>> r = requests.get('https://creativecommons.org/about/')
>>> htmldate.find_date(r.text)
>>> 2017–08–11

There are however pages for which no date can be found, ever:

>>> r = requests.get('https://example.com')
>>> htmldate.find_date(r.text)
>>>

Command-line

A command-line interface is included:

$ wget -qO- “http://blog.python.org/2016/12/python-360-is-now-available.html" | htmldate
$ 2016–12–23

Usage instructions:

$ htmldate --help
htmldate [-h] [-v] [-s]
optional arguments:
-h, --help show this help message and exit
-v, --verbose increase output verbosity
-s, --safe safe mode: markup search only

Further information

Documentation

For more information see the readme page on GitHub or the Python module repository.

Feedback and pull requests are welcome!

Kudos to…

Last, if the date is really nowhere to be found, it might be worth considering carbon dating the web page, however this is computationally expensive.