Finding the creation or modification date of web pages
Metadata extraction is useful for different kinds of purposes ranging from knowledge extraction and business intelligence to classification and refined visualizations. It is often necessary to fully parse the document or apply robust scraping patterns, there are for example web pages for which neither the URL nor the server response provide a reliable way to date the document, that is find when it was first published and/or last modified.
htmldate provides a simple and convenient way to extract the creation or modification date of web pages, within Python or on the command-line. Based on HTML parsing and scraping functions:
- Starting from the header of the page, it uses common patterns to identify date fields.
- If this is not successful, it scans the whole document looking for structural markers.
- If no date cue could be found, it finally runs a series of heuristics on the content (text and markup).
From package repository:
pip install htmldate
Direct installation of the latest version over pip is possible:
pip install git+https://github.com/adbar/htmldate.git
All the functions of the module are currently bundled in htmldate.
In case the web page features clear metadata in the header, the extraction is straightforward:
>>> import requests
>>> r = requests.get('https://www.theguardian.com/politics/2016/feb/17/merkel-eu-uk-germany-national-interest-cameron-justified')
A more advanced analysis of the document structure is sometimes needed:
>>> r = requests.get('http://blog.python.org/2016/12/python-360-is-now-available.html')
>>> # DEBUG analyzing: <h2 class="date-header"><span>Friday, December 23, 2016</span></h2>
>>> # DEBUG result: 2016–12–23
In the worst case, the module resorts to a guess based on an extensive search, which can be deactivated:
>>> r = requests.get('https://creativecommons.org/about/')
There are however pages for which no date can be found, ever:
>>> r = requests.get('https://example.com')
A command-line interface is included:
$ wget -qO- “http://blog.python.org/2016/12/python-360-is-now-available.html" | htmldate
$ htmldate --help
htmldate [-h] [-v] [-s]
-h, --help show this help message and exit
-v, --verbose increase output verbosity
-s, --safe safe mode: markup search only
Feedback and pull requests are welcome!
Last, if the date is really nowhere to be found, it might be worth considering carbon dating the web page, however this is computationally expensive.