Basic Metascraper for lists of urls

Hubert
2 min readAug 28, 2018

--

Step 1: Get npm if you don’t have it.

Step 2: In a new project directory (I called mine metascraper), run the following commands
npm init
npm i -s metascraper
npm i -s got

Step 3: Edit the scripts section in the package.json file to look like this:

"scripts": {
"start": "node index.js"
},

Step 4: If you look here, you can see the rules you can install. You must install the rules before you can use them.

npm i -s metascraper-author
npm i -s metascraper-date
npm i -s metascraper-title

Step 5: Make an index.js file in the project directory and copy this into it, and make any necessary changes to the rules and urls:

const metascraper = require('metascraper')([
//rules that are to be used. you must install these in order to use them.
require('metascraper-author')(),
require('metascraper-date')(),
require('metascraper-description')(),
require('metascraper-publisher')(),
require('metascraper-title')(),
require('metascraper-url')()
])
const got = require('got')//my urls to be scraped for dataconst urls = [
"http://my-url.com/my-article",
"http://my-url.com/my-article",
"http://my-url.com/my-article",
"http://my-url.com/my-article",
"http://my-url.com/my-article",
"http://my-url.com/my-article"
]
async function printFiles () {
for (const site of urls) {
const {body: html, url} = await got(site).catch((e)=>{console.log(e)})
const metadata = await metascraper({html, url})
console.log(metadata)
}
}
printFiles();

Step 6: Start the script by running npm run start. The console will print out a series of JSON objects that look something like this:

{
date: '2018-08-16T11:00:00.225Z',
description: 'About 8,300 Saudi post-secondary students living in Canada were left shocked and scrambling earlier this month when Saudi Arabia abruptly ordered them to withdraw from their studies and leave the country by Aug. 31. Here, one of those students shares his story of how the diplomatic feud is devastating his dreams after years of hard work.',
publisher: 'CBC',
title: 'POINT OF VIEW | As a Saudi student being forced to leave Canada, I’m going through the 5 stages of grief | CBC News',
url: 'https://www.cbc.ca/news/canada/calgary/saudi-arabia-student-leave-canada-riyadh-1.4786874'
}
{
date: '2018-08-13T21:45:00.000Z',
description: 'Prime Minister Justin Trudeau says his government wants to improve its relationship with Saudi Arabia, but will not sacrifice Canada’s position on human rights.',
publisher: 'Global News',
title: 'Canada will ‘engage’ with Saudi Arabia but won’t change position on human rights: Trudeau',
url: 'https://globalnews.ca/news/4385506/canada-saudi-arabia-human-rights-position/'
}

You can then send it to any Josh that you know who will make good use of the information.

Happy scrape!

--

--