Saving web pages with Vue and Node via Newspaper3k

While the Under Cloud has an extension for Google Chrome that allows us to save a selection of text from a web page, what’s been lacking is the option to automate the saving of the entire page. Obvious though it is, saving a web page is no trivial task, and it’s something I’ve been 7 parts preparing for, 2 parts avoiding, and 1 part dreading for ages!

Yet, here we are — at last — and the Under Cloud now supports the saving of web pages via Newspaper3k, a versatile package written in Python. I am stretching the definition of now, since I’m still running tests in the staging environment, but it’s almost complete and should be on production within the week.

The documentation for Newspaper is sparse, and code samples were (are) few and far between. Worse, I had no idea how I would make Python talk to Node — the API is the obvious choice here, but I had no understanding of Python, the types of data it supported, or how I would get that data out of it.

I’m writing this from the perspective of someone on the other side of the learning curve, having walked the long route to get here, but — given the time constraints I’m up against — would have preferred a path less cluttered with obstacles. So this article is from present me for the attention of past me.

Alternatives to Newspaper3k

Newspaper3k versus BeautifulSoup

  1. Newspaper appears to be focused on general purpose page scraping;
  2. while BeautifulSoup — with its wealth of options for parsing the DOM — is geared more towards data science.

You need to know the specific parts of a web page to get the most from BeautifulSoup. I could be wrong, so I look forward to someone stepping in with more information!

Scraping a web page with Newspaper3k

  • you have an understanding of both Vue and Node;
  • and don’t need me to go through the whole process of installing and configuring either;
  • or instantiating a new project;
  • you have Python installed, along with the Newspaper3k package;
  • I’ll be providing concise examples of the code, rather than the complete versions.

As an aside, I don’t like scraping as a description of what we’re doing here, given the horrible connotations attached to it. Please don’t use this article to create nefarious garbage for the purposes of plagiarising the work of others.

Python

import os
import sys
import json
from datetime import datetime
from newspaper import Article# Here, the `url` value should be something like: https://www.bbc.co.uk/sport/football/53944598
url = sys.argv[1]
template_for_exceptions = "An exception of type {0} occurred. Arguments:\n{1!r}"def get_web_page(url):try:if url and len(url) > 0:article = Article(url, keep_article_html = True)
article.download()
article.parse()
dataForBookmarkAsJSON = json.dumps({
'publicationDate': article.publish_date if article.publish_date is None else article.publish_date.strftime("%Y-%m-%d %H:%M:%S"),
'title': article.title,
'note': article.article_html,
'authors': article.authors
})
try:sys.stdout.write(dataForBookmarkAsJSON)
sys.stdout.flush()
os._exit(0)
except Exception as ex:message_for_exception = template_for_exceptions.format(type(ex).__name__, ex.args)
print(message_for_exception)
sys.exit(1)
except Exception as ex:message_for_exception = template_for_exceptions.format(type(ex).__name__, ex.args)
print(message_for_exception)
sys.exit(1)
if __name__ == '__main__':
get_web_page(url)

A few things to point out here, such as the article.publish_date variable, which is either a date string that I format, or is a null, that I handle when populating the JSON object. Yes, I could have done that upstream in Node, but I took the moment to learn a few things about and in Python.

Vue

getWebPage () {
this.$axios.get(`/newspaper`, {
params: {
// Params.
}
}).then(function(response) {
// Handle the response.
}
}).catch(function(error) {
// Handle the error.
})
}

Node

router.get('/newspaper', async (req, res) => {
const getNewspaper = await controllerNewspaper.getWebPage(data)
res.json(getNewspaper)
})

… and in the controller, I have:

services.getWebPage = async (params) => {let { spawn } = require('child_process')
let processForPython = spawn(process.env.PYTHON_VERSION, [
`${process.env.PYTHON_PATH}/get_web_page.py`,
params.url
], {
maxBuffer: 10240000
})
let dataForBookmarkStream = []return new Promise ((resolve, reject) => {
processForPython.stdout.on('data', (response) => {
dataForBookmarkStream.push(response)
})
processForPython.stderr.on('data', (error) => {
reject({
error: `An error occurred while attempting to parse the web page: ${error.toString()}`
})
})
processForPython.on('exit', (code) => {
switch (code) {
case 0:
if ( dataForBookmarkStream ) {
if ( dataForBookmarkStream.length > 0 ) {
try {
try {
dataForBookmark = JSON.parse(dataForBookmarkStream.join().toString())
} catch (exception) {
reject({
error: "JSON object supplied by Newspaper is invalid."
})
}
if (typeof dataForBookmark === 'object') {
const paramsForBookmark = new URLSearchParams()
paramsForBookmark.append('userID', params.userID)
// Additional parameters, using dataForBookmark...
instanceOfAxios.post('/assets', paramsForBookmark)
.then(function (response) {
resolve(response)
})
.catch(function (error) {
reject(error)
})
}
} catch (exception) {
reject({
error: "An error occurred while attempting to save the web page."
})
}
} else {
reject()
}
} else {
reject()
}
break
case 1:
reject({
error: "Web page couldn't be saved."
})
break
}
})
}).catch(error => {
return {
error: "Web page couldn't be saved."
}
})
}let dataForBookmarkStream = []return new Promise ((resolve, reject) => {
processForPython.stdout.on('data', (response) => {
dataForBookmarkStream.push(response)
})
processForPython.stderr.on('data', (error) => {
reject({
error: `An error occurred while attempting to parse the web page: ${error.toString()}`
})
})
processForPython.on('exit', (code) => {
switch (code) {
case 0:
if ( dataForBookmarkStream ) {
if ( dataForBookmarkStream.length > 0 ) {
try {
try {
dataForBookmark = JSON.parse(dataForBookmarkStream.join().toString())
} catch (exception) {
reject({
error: "JSON object supplied by Newspaper is invalid."
})
}
if (typeof dataForBookmark === 'object') {
const paramsForBookmark = new URLSearchParams()
paramsForBookmark.append('userID', params.userID)
// Additional parameters, using dataForBookmark...
instanceOfAxios.post('/assets', paramsForBookmark)
.then(function (response) {
resolve(response)
})
.catch(function (error) {
reject(error)
})
}
} catch (exception) {
reject({
error: "An error occurred while attempting to save the web page."
})
}
} else {
reject()
}
} else {
reject()
}
break
case 1:
reject({
error: "Web page couldn't be saved."
})
break
}
})
}).catch(error => {
return {
error: "Web page couldn't be saved."
}
})
}

Yeah, it’s a lot to take in, so let’s look at some specifics…

First, figure out what the version of Python is and create an equivalent environmental variable to process.env.PYTHON_VERSION.

Second, figure out what the path to Python is and create an equivalent environmental variable to process.env.PYTHON_PATH.

Then, feel free to tweak maxBuffer to fit. As an aside, I did attempt a version of the code using maxBuffer alone, but some web pages were too big, at which point the JSON object failed to parse and then everything went to crap.

Once the Python script is called, it begins to stream the JSON object to processForPython.stdout.on('data'), which I’m grabbing in chunks via the dataForBookmarkStream variable.

Assuming the process was a success, we hit the switch block in processForPython.on('exit') and exit when the code is 0. Here’s where we convert the encoded data in dataForBookmarkStream into something useful, using:

dataForBookmark = JSON.parse(dataForBookmarkStream.join().toString())

… before sending the data via the API to somewhere else in the application.

Do we have some Node and Python people shaking their collective heads wearing an avuncular expression with a hint of disappointment? If so, share and let’s learn what could be improved!

Owner of Octane, helping small businesses with big business problems, and the man behind the Under Cloud, the ultimate digital research assistant.