The Wayback Machine Scraper

Abhi Kumbar
Analytics Vidhya
4 min readJan 18, 2020

--

Why the Wayback Machine Scraper?

Web scraping for data collection is a common practice and I wanted to scrape some news websites to collect certain data elements such as news title, summary, and url for each article. The idea was to compile a news dataset to train topic models such as LDA, NMF and SVD. More on the topic model implementation in the upcoming posts.

I could have gone the api route and use some of the news apis to collect the same data points, but for most of them you either have to pay a fee to make higher requests per day or use several different api’s to gather data from multiple news sources. Wayback Machine scraper solves both of these problems. The Wayback Machine api is free to use with high request per day allowance. Wayback Machine archives most news websites, so you can gather same info from across different news sources using one api.

Depending on the intensity of the request, scraping can overload servers and one can be at risk of getting blocked. Wayback machine api minimizes this risk, as we are not targeting individual news websites. We still have to follow Wayback machine api rules, but we don’t run into risk of overloading a individual news website server.

Scraper Deep Dive

The scraper is built in Python and uses a few popular python packages for web scraping. Here are some of the packages used in our case.

import sys
import requests as rq
from bs4 import BeautifulSoup as bs
from time import sleep
from time import time
from random import randint
from warnings import warn
import json
import pandas as pd

BeautifulSoup is a popular python package for html and xml parsing. We use it to traverse through different html tags and pull all the necessary data, in this case, news title, summary, and url link to each article.

The entire scraping process is as follows:

  1. Compile a list of urls using Wayback Server CDX API .

Wayback Server CDX API serves as a http inlet to Wayback machine website captures and allows for in-depth querying of the data. Here’s an example query that returns some indexing data for each MSNBC news website ‘capture.’

'http://web.archive.org/cdx/search/cdx?url=nbcnews.com/politics&collapse=digest&from=20190401&to=20190431&output=json'

In the above query, url= is a required parameter, and here we set it to GET MSNBC politics section screen captures. The above query will return one row per index for each ‘capture’ of nbcnews.com/politics section. The Wayback CDX Server responds to GET queries such as the above, and outputs the result as a JSON array. The columns of the output are shown below.

[['urlkey',
'timestamp',
'original',
'mimetype',
'statuscode',
'digest',
'length'],
['com,nbcnews)/politics',
'20190401012911',
'https://www.nbcnews.com/politics',
'text/html',
'200',
'FVZYAKIUIFOQY5NCP7AI4LJB4JNLYQOF',
'38471'],

Once we get the portion of CDX columns as shown above, we will use the ‘timestamp’ and ‘original’ column to put together a final Wayback machine url, which we then use to open a particular html page and scrape the required data points.

# MSNBC Wayback machine archive urls
url = 'http://web.archive.org/cdx/search/cdx?url=nbcnews.com/politics&collapse=digest&from=20190401&to=20190431&output=json'
urls = rq.get(url).text
parse_url = json.loads(urls) #parses the JSON from urls.
## Extracts timestamp and original columns from urls and compiles a url list.url_list = []
for i in range(1,len(parse_url)):
orig_url = parse_url[i][2]
tstamp = parse_url[i][1]
waylink = tstamp+'/'+orig_url
url_list.append(waylink)
## Compiles final url pattern.for url in url_list:
final_url = 'https://web.archive.org/web/'+url

2. Parse html page using BeautifulSoup .

Once we have compiled a list of final urls, we use html.parser to parse each html page from the final_url .

# Open page
req = rq.get(final_url).text
# parse html using beautifulsoup and store in soup
soup = bs(req,'html.parser')
soup

soup stores the html output of final_url and we can parse soup html tags to gather the required data. In-order to know which tags to search for, one needs to inspect html pages on Wayback machine captures. There are many Medium articles that show how to inspect html pages to pick out particular tags for scraping.

# Get list of article tags that contain news titles
articles= soup.find_all('article')
for article in articles:
try:

if article != None:
#title and link
if article.find_all('h2') != None:
#get news title
title = article.find_all('h2')[1].a.text
#get individual news article link
link = article.find_all('h2')[1].a['href']
else:
title = 'N/A'
link = 'N/A'

article tag gives a list of all news articles, their titles, and links to each article. We then use individual article link to scrape article summary.

req = rq.get(link).text
soup=bs(req,'html.parser') # Parse each individual news article
article = soup.find('div',attrs={'class':'article container___2EGEI'})
article.div.text # news summary

3. Export the data as a csv.

Once we have gathered all the scraped data point, we export the data as a csv using pandas.to_csv function.

import pandas as pdnbc_df = pd.DataFrame({'title':news_title
,'summary':news_summary
,'source':news_source
,'article_link':news_link})
nbc_df.to_csv('nbc_articles.csv',index=False)

Here we have only scraped news title, summary, source, and article links. One can do the same to scrape for images, image captions, and article authors with some more html inspecting and parsing.

Here’s the entire code that pieces together individual sections described above.

Happy scraping!

--

--

Abhi Kumbar
Analytics Vidhya

Curious being, soaking up knowledge on anything and everything.