Scraping Wordpress — Archiving Journalism Step by Step

Using Node.JS to build a web scraper from scratch

Nik Rahmel
7 min readSep 30, 2018

My time at university was defined by a very large part by my involvement in Student Media — the student newspaper, the student TV station, radio station. I still work with the Student Radio Association (see you at the awards in November?) and the National Student Television Association to this day.

Last week, my student paper announced that they will delete all posts from their website from 2015 and older, so I thought I’d see if I can create my own little backup. Obviously, having graduated some years ago, I do not have a valid login anymore, so I had to figure out my own way to scrape the data.

The Headlines

Before I come to the technicalities, here is what some basic data analysis of the scraping shows:

How many posts will be deleted?
Analysing the posts accessible, we’re looking at around 58% of all posts to be deleted.
This is the number of posts published per year:
2014: 1883
2015: 1903
2016: 1388
2017: 900
2018: 471

Interestingly, the website was set up around 2011/2012. There was a provider migration from wordpress.com to a 3rd party around 2014. This analysis and backup is only based on the current website.

Note that during my time it was practice to publish all posts from the print edition online, as well. I do not know if this is still current practice, but it would explain the higher numbers for 2014/2015.

How much data will be freed?
For a lot of the older posts, the related images are already removed or not accessible in one way or another, so the majority of images is from more recent years.
Removing all from 2015 and earlier will result in around 13% less space taken up by locally hosted media:
2014: 7.08 MB
2015: 740.75 MB
2016: 3.39 GB
2017: 1.2 GB
2018: 542.67 MB
Total Size: 5.84 GB

So all in all, this will remove <1 GB of publicly accessible data — obviously, I do not know how much data there is, that isn’t exposed in the post content or part of other pages in the system.

The cost
All my analysed data is taking up in zipped form around 4.8 GB. I have uploaded a copy to an S3 bucket which is going to cost me around £1.02/year at current exchange rates.

I am pretty sure that the storage class they have to host the site allows for 20 GB, so I am a bit surprised that there seem to be less than 5 GB of content accessible whilst reaching the limit. I don’t know if the actual posts count towards the allowance, but my raw JSON file containing the post content and basic metadata weighs in at around 50 MB for the whole site.

I know Student Unions don’t have much money, and also not the resources to support custom-built solutions, but I am still disappointed that it has come to this — I am very interested to see what impact tomorrow’s planned deletion actually has, but the communication of the actual issue at heart has not been very thorough.

The Technicals

I am just interested in the actual post data and media content related to the posts, not the whole rendered output. I am also not a fan of HTML scraping, it always seems to end up messy even if you have a well-structured site — and with all the changes the new editorial teams have made throughout the years I have no idea how structured the site was anymore.

Searching for Wordpress Backup or Wordpress Scraping didn’t really yield any valuable results without having access to the Admin interface or database, so I took it as a bit of a challenge to write some code.

RSS Lives

I had a hunch as to how I can get standardised post data — good ol’ trusty RSS. I still use RSS daily; Feedly is my second most used website, and trying out /feed on the site gave me a nice bit of XML. It contained 10 posts. Great. At least they seemed to be full versions and not abridged. Yay!

I was expecting there to be thousands though, and there were no pagination indicators as per RFC 5005. Back to the drawing board? No!/feed/?page=2 showed me the second page. Thanks to user2611 on the Wordpress StackExchange for that tip!

The feed

Through some trial and error, I found out that there are 655 pages of RSS to paginate through, resulting in around 6,500 posts.

With this knowledge it’s fairly trivial to get the content of the complete feed:

const request = require('request-promise');
const bluebird = require('bluebird');
const fs = require('fs');
const PAGES = 655
const HOST = 'http://example.com'
async function getPage(i) {
return request(`${HOST}/feed/?paged=${i}`);
};
const pages = bluebird.map(Array(PAGES), (v, i) => {
console.log(`Getting Page ${i}`);
return getPage(i);
}, {concurrency: 5});
pages.then((res) => {
fs.writeFileSync('rss.json', JSON.stringify(res));
});

This gives me a JSON file containing an array of strings which represent the pages of the RSS feed

The posts

In the RSS pages there is a lot of data I don’t need — XML is quite verbose, and a lot of the metadata is not related to the posts. I used a nice little library called rss-parser to extract the actual posts from the pages:

const _ = require('lodash');
const fs = require('fs');
const Parser = require('rss-parser');
const parser = new Parser();
const rsss = require('./rss.json');const itemsP = rsss.map(async (rss) => {
const feed = await parser.parseString(rss);
return feed.items;
});
Promise.all(itemsP).then((itemsA) => {
const items = _.flatten(itemsA); // without this, we'd have an array of pages containg an array of posts
console.log(`Saving ${items.length} posts`);
fs.writeFileSync('items.json', JSON.stringify(items));
})

This gives me a JSON file containing an array of post objects — including relevant metadata for content , creator , title , link ,date , categories and guid.

To make them more easily browsable I’ve split them up into their own files each; using the guid , padded to a common length, as filenames:

const fs = require('fs');
const items = require('./items');
function pad(num, size) {
let s = num.toString());
while (s.length < size) s = `0${s}`;
return s;
}
items.forEach(item => {
const id = item.guid.split('=')[1]; // extract ID from full URL
const idNumber = parseInt(id);
const idFixed = pad(idNumber, 8);
fs.writeFileSync(`./items/${idFixed}.json`, JSON.stringify(item, null, 2));
});

Let’s analyse some of this — how many posts are there per year?
I used the infamous moment.js to find the year for each item with the following code:

const posts = require('./items');
const moment = require('moment');
const years = {}function getYear(post) {
if (!post.isoDate) throw new Error('No Date found');
const postDate = moment(post.isoDate);
return postDate.year();
}
posts.forEach(post => {
const year = getYear(post);
if (!years[year]) years[year] = 0;
years[year]++;
});
Object.keys(years).map(year => {
console.log(`${year}: ${years[year]}`);
});

And these are the stats we find if you skipped the intro:

2014: 1883
2015: 1903
2016: 1388
2017: 900
2018: 471

The media

Photos and other imagery make out a large part of the posts, so I wanted to find all media links the posts mentioned.

Let’s start with all URL’s mentioned in the post content. I used another library that can comb through any string and identifies URLs, get-urls:

const posts = require('./items.json');
const fs = require('fs');
const _ = require('lodash');
const getUrls = require('get-urls');
const allUrls = new Set();
posts.forEach((post) => {
const values = [post['content:encoded'], post.content, post.contentSnippet];
const strings = _.compact(values).join(' ');
const urls = getUrls(strings);
urls.forEach(url => {
allUrls.add(url);
});
});
fs.writeFileSync('allUrls.json', JSON.stringify([...allUrls], null, 2));

I also noticed that quite of them didn’t resolve to valid files, and a status code 404 was returned from the host, which I also wanted to filter out.
Since I have to do requests for this anyway, I thought it might be nice to figure out how much storage these files actually take up, so for each valid URL I also recorded the reported file size:

const allUrls = require('./allUrls');
const _ = require('lodash');
const bluebird = require('bluebird');
const fs = require('fs');
const request = require('request-promise');
const storedUrls = allUrls.filter((url) => {
const isCorrectHost = url.includes('example1.com') || url.includes('cdnbucket.com');
const isUpload = url.includes('/wp-content/uploads')
return isCorrectHost && isUpload;
});
let counter = 0; // To store and display progress
const urlsAndSizes = bluebird.map(storedUrls, async (url) => {
let res;
try {
res = await request(url, {
method: 'HEAD',
resolveWithFullResponse: true
});
} catch (error) {
if (error.statusCode === 404) return;
};
counter++;
const percentage = ((counter/storedUrls.length) * 100).toFixed(2)
console.log(`Finished ${percentage}%`);
return {
url,
size: res.headers['content-length']
};
}, {concurrency: 5});
urlsAndSizes.then((urls) => {
const validurlsAndSizes = _.compact(urls); // remove 404s
const fileContent = JSON.stringify(validurlsAndSizes, null, 2);
fs.writeFileSync('validUrlsAndSizes.json', fileContent);
});

This gives me a JSON file containing an array of objects with urls and their reported file size in bytes. Let’s analyse the data!

Again, I am interested in a storage cost breakdown by year — does removing content from 2015 and before actually result in substantial savings?
Similar to the post analyses above, we can make use of the previously extracted media URLs:

const filesize = require('filesize');
const objects = require('./validUrlsAndSizes');
const years = {}
let totalSize = 0;
function getYear(url) {
const split = url.split('uploads/');
if (split.length != 2) throw new Error('Invalid URL');
return split[1].slice(0, 4); //the year in the upload path
}
objects.forEach(medium => {
const year = getYear(medium.url);
if (!years[year]) years[year] = 0;
totalSize += parseInt(medium.size);
years[year] += parseInt(medium.size);
});
Object.keys(years).map(year => {
console.log(`${year}: ${filesize(years[year])}`);
});
console.log(`Total Size: ${filesize(totalSize)}`)

This is what we get for this:

2014: 7.08 MB
2015: 740.75 MB
2016: 3.39 GB
2017: 1.2 GB
2018: 542.67 MB
Total Size: 5.84 GB

Archiving

I have downloaded all the media files from my validUrlsAndSizes.json file, preserving their relative paths using wget:

wget --no-host-directories --force-directories --input-file=urls.txt

--

--

Nik Rahmel

Software Engineer. Media. TV. Journalism. Photography.