Guide to Handling Internet Archive’s CDX Server API Response
Accessing a digital archive of web pages
For those who don’t know already, the Wayback Machine is a comprehensive digital archive of web pages on the Internet — having already collected more than 286 billion web pages since 2001.
Wayback uses the CDX Server API to lookup captures — it allows for complex querying, filtering and analysis of Wayback capture data. I recently used this API to visualize a website over time using d3’s radial treemap. You can see the result here.
Since the most difficult part of my project was figuring out how to turn the response of the CDX API into an HTML page that I could crawl, this article is a guide to making a request to and handling the response from the API.
Making the Request
Making the request is simply a
GET request to http://web.archive.org/cdx/search/cdx. A website is specified via a query string — and it’s the only required query string argument. (To see what other query strings can be specified, check the docs).
To get all archived versions, or captures, of reddit.com, you would make a
GET request to http://web.archive.org/cdx/search/cdx?url=reddit.com.
Sweet! Let’s revisit the objective — get all the captures of a web page from the CDX API so I can send my crawler through the archived HTML. Okay, what to do from here?
Handling the Response
Content-Type of the response is
text/plain— so we’ll do the parsing ourselves. As it says in the documentation, the columns, split by spaces, correspond to the following, in order:
I created a data model object to represent one capture.
// snapshot.pyclass Snapshot(dict):
def __init__(self, urlkey=None, timestamp=None, original=None, mimetype=None, statuscode=None, digest=None, length=None):
self[‘urlkey’] = urlkey
self[‘timestamp’] = timestamp
self[‘original’] = original
self[‘mimetype’] = mimetype
self[‘statuscode’] = statuscode
self[‘digest’] = digest
self[‘length’] = length
self[‘snapshot_url’] = ‘http://web.archive.org/web/%s/%s/’ % (timestamp, original)
All I’m doing is setting keys on a dictionary object. Notice that I added a key that wasn’t provided in the CDX response to my
Snapshot model —
The HTML page for any given capture lives at a URL that is formatted from the
original URL of the capture. The format is
If you would prefer to get the original page back (without the Wayback Machine rewriting URLs on the page to point into the Archive), you should use this formatting:
Sweet! Now that we have the data model (namely, the
Snapshot class) set up, we can make the request to the CDX API and create instances of the
Snapshot class with the response.
Since we get back a text file, we’ll first split by
\n, or new line, to get a list of capture strings. For each capture, we’ll split by a space(‘ ’) to get the individual columns returned in a list format. After this is done, we’ll have a list of lists, representing the captures returned from the API.
// reddit.pyfrom snapshot import Snapshot
import requests as reqres = req.get(“http://web.archive.org/cdx/search/cdx?url=reddit.com")snapshots = res.text.split(‘\n’)for snapshot in snapshots:
snapshot_items = snapshot.split(‘ ‘)
Next, we’ll instantiate a new Snapshot object for each snapshot returned by the API.
// reddit.pyfrom snapshot import Snapshot
import requests as reqres = req.get("http://web.archive.org/cdx/search/cdx?url=reddit.com")snapshots = res.text.split('\n')
snapshot_list = for snapshot in snapshots:
snapshot_items = snapshot.split(' ')
if len(snapshot_items) == 7:
snap = Snapshot(snapshot_items, snapshot_items, snapshot_items, snapshot_items, snapshot_items, snapshot_items, snapshot_items)
snapshot_list variable contains
Snapshot objects that you can use! Make sure you check that
statuscode == 200 before making the request to that specific
I hope this article helped! Let me know if you have any questions!
Hacker Noon is how hackers start their afternoons. We’re a part of the @AMI family. We are now accepting submissions and happy to discuss advertising & sponsorship opportunities.
If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!