Save screenshot of websites with Puppeteer, Cloudinary and Heroku with Node.js
My partner works in the financial industry and one day he asked me:
“We have newspaper archive for the paper newspaper. But nowadays
news come from websites line cnn.com, how are they archived?”
As a techie, I immediately thought about API. There are plenty good feed providers. Yet it does not suit his need as he looks at Asian markets as well. And the API providers often just have news for US, or news from a US point of view of other markets. He wanted snapshots of a few news sites scheduled to run at certain times every day. I could not find a good API with affordable budget. Then I rolled up my sleeves.
This article will be focusing on three parts. I am just using javascript (Node.js) in this experiment. I believe you can jump to individual part which suits your interest, or you can go through it step by step together with me:
Part 1: use Puppeteer to do screenshot of websites and save locally.
Part 2: save files to Cloundiary
Part 3: scheduling Puppeteer and Cloundinary works on Heroku.
I had done PhantomJS and Selenium before, and I know they could do snapshot properly. Comparing a bit between PhantomJS and Puppeteer, Puppeteer to me is a better candidate as it sits in the Node.js ecosystem. I am also sold by the ease of apply a large variety of options to launch the headless Chrome browser. More information of puppeteer can be found in its git repo: https://github.com/GoogleChrome/puppeteer
Part 1: image capture of websites with Puppeteer
- Create a folder, called MyNewsShotProject. Then add a blank empty file called snapshot.js :
--MyNewsShotProject\
|--snapshot.js
2. Initialize a project with yarn init:
$ yarn init (or npm init)
When being prompted, enter snapshot.js as the entry point.
After the yarn init is finished, you should be able to see a package.json file created in project directory:
--MyNewsShotProject\
|--package.json
|--snapshot.js
3. Open package.json and add start script as below:
{
"name": "MyNewsShotProject",
"version": "1.0.0",
"description": "A demo of using Puppeteer, Cloundinary to do screenshot of sites",
"main": "snapshot.js",
"author": "Vivian Chan",
"license": "MIT",
"scripts": {
"start": "node snapshot.js"
}
}
4. Install puppeteer:
$ yarn add puppeteer
$ yarn install
5. Then edit snapshot.js, we will add a function doScreenCapture to do screen capture of one single site, and save the screenshot to a png file locally:
// snapshot.jsconst puppeteer = require('puppeteer');async function doScreenCapture(url, site_name) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, {waitUntil: 'domcontentloaded'});
await page.screenshot({
fullPage: true,
path:`${site_name}.png`
});
await browser.close();
}
The method calls to puppeteer are beautifully self-explained:
- require(‘puppeteer’): we need to import the puppeteer module to do screen capture
- puppeteer.launch: to launch the browser.
- browser.newPage: to create a new page in the browser (you can think it as if opening a new tab in your Chrome browser)
- page.goto: to open the site url in the page just created (you can think it as if opening the url in the new tab in your Chrome browser). Puppeteer will start navigating to the site, but how do we know it completely opens the site? Here we use the waitUtil parameter to tell puppeteer how we will wait the page.goto finishes. waitUntil: ‘domcontentloaded' means we will wait till the DomContentLoaded event is fired from the site we intend to go to. More other options for waitUtil can be found here.
- page.screenshot: do a screenshot of the site and save the image to a path locally. Here I specify I want the full length of the page, other options like specifying width, height, x-position, y-position are also available. I also used a string interpolation here so that the image name saved would be my site name with an png file extension.
- browser.close: to close the browser
6. As our goal is to capture multiple sites daily, we will define an array with the site name and the url. Then we will call a simple for loop to call doScreenCapture for each site:
const puppeteer = require('puppeteer');async function doScreenCapture(url, site_name) {
// details are skipped here. Refer to in previous step
}const news_sites = [
{
name: 'reuters',
url: 'https://www.reuters.com/'
}, {
name: 'reuters_china',
url: 'https://cn.reuters.com/'
}, {
name: 'reuters_japan',
url: 'https://jp.reuters.com/'
}, {
name: 'reuters_germany',
url: 'https://de.reuters.com/'
}, {
name: 'reuters_ara',
url: 'https://ara.reuters.com/'
}
];for (var i = 0; i < news_sites.length; i++) {
doScreenCapture(news_sites[i]['url'], news_sites[i]['name']);
}
7. Run the program.
$ yarn start (or npm start)
8. It should result with png files named with the news_sites names.
--MyNewsShotProject\
|--node_modules\
|--package.json
|--snapshot.js
|--reuters.png
|--reuters_ara.png
|--reuters_china.png
|--reuters_germany.png
|--reuters_japan.png
If you open one of the screenshot files, it will probably look like (reuters.png as of writing):
After Part 1, snapshot.js should look like
const puppeteer = require('puppeteer');async function doScreenCapture(url, site_name) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, {waitUntil: 'domcontentloaded'});
await page.screenshot({
fullPage: true,
path:`${site_name}.png`
});
await browser.close();
}const news_sites = [
{
name: 'reuters',
url: 'https://www.reuters.com/'
}, {
name: 'reuters_china',
url: 'https://cn.reuters.com/'
}, {
name: 'reuters_japan',
url: 'https://jp.reuters.com/'
}, {
name: 'reuters_germany',
url: 'https://de.reuters.com/'
}, {
name: 'reuters_ara',
url: 'https://ara.reuters.com/'
}
];for (let i = 0; i < news_sites.length; i++) {
try{
doScreenCapture(news_sites[i]['url'], news_sites[i]['name']);
}catch(e){
console.error(`Error in capturing site for ${news_sites[i]['name']}`, e);
}
}
Now we can take screenshot of sites and save images locally. Next, I will go through how to save it to Cloudinary.
Part 2: save files to Cloudinary
- Cloudinary provides video and image management service. It enables users to upload, store, manage, manipulate and deliver images and video for websites and apps. Cloudinary provides free account. We just need to apply it from https://cloudinary.com/users/register/free. It also has Node.js SDK with document we can follow (link here).
- After opening a free account and login, you can see the panel. If you click reveal next to API Secret, you should see something like below (I replace with fake details for convenience of demonstration):
3. In project folder, install cloudinary and dotenv.
$ yarn add cloudinary dotenv
and in the package.json, change the start script to include env config
"scripts": {
"start": "node --require dotenv/config snapshot.js"
},
The package cloudinary is needed for calling the cloudinary api to save file from local to your cloudinary space. And the dotenv package helps us maintain environment variables in decent ways.
4. Then create a file called .env
under the project folder
--MyNewsShotProject\
|--.env
|--package.json
|--snapshot.js
And copy the cloud name, api key and api secret to the .env file as :
CLOUD_NAME='<your cloud name in step 2>'
API_KEY='<your api key in step 2>'
API_SECRET='<your api secret in step 2>'
5. In Part 1, our doScreenCapture function just saves the image captured with the Headless browser locally, we need it to save to Cloudinary in this part. We adopted the async API of puppeteer in Part 1. Unfortunately, async API for Cloudinary is not that available at the time of writing (request can be tracked here). So we will adopt the traditional JS Promise way to call the Cloudinary Node.js api.
6. In snapshot.js , import Cloudinary and define Cloudinary config with the cloud name, api key and api secrete:
// Snapshot.js
const puppeteer = require('puppeteer');
const cloudinary = require('cloudinary');cloudinary.config({
cloud_name: process.env.CLOUD_NAME,
api_key: process.env.API_KEY,
api_secret: process.env.API_SECRET
});async function doScreenCapture(url, site_name) {
// details are skipped here. Refer to in previous step
}// skip remaining codes for news_sites and looping doScreenCapture
7. In the doScreenCapture function, I want to save every image captured in a newsshot folder and each file name should appear as yyyy_MM_dd_hh_mm_siteName.
async function doScreenCapture(url, site_name) {
const d = new Date();
const current_time = `${d.getFullYear()}_${d.getMonth()+1}
_${d.getDate()}_${d.getHours()}_${d.getMinutes()}`
const cloudinary_options = {
public_id: `newsshot/${current_time}_${site_name}`
}; const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, {waitUntil: 'domcontentloaded'});
await page.screenshot({
fullPage: true,
path:`${site_name}.png`
});
await browser.close();
}
8. Then I need to change the page.screenshot call a bit so that it will not save a local file with path `${site_name}.png`. But instead, I want to return the screenshot result, if successful, or else a false. Note that the screenshot result will be a stream buffer.
That is, change below from:
await page.screenshot({
fullPage: true,
path:`${site_name}.png`
});
to:
let shotResult = await page.screenshot({
fullPage: true
}).then((result) => {
console.log(`${site_name} got some results.`);
return result;
}).catch(e => {
console.error(`[${site_name}] Error in snapshotting news`, e);
return false;
});
9. Then I use the shotResult to upload the result to Cloudinary. The task to upload to Cloudinary has to be in a format of Promise (because async is not supported yet in Cloudinary). I create a cloudinaryPromise function which will return a promise. We use upload_stream here because we don’t have a local file any more. The screenshot result (a stream buffer) will be applied to cloudinary upload_stream method. The upload_stream method takes two parameters, the first one is the cloudinary options, which we setup in step 7. The second parameter is a callback function which will resolve or reject the promise based on the result.
async function doScreenCapture(url, site_name) { // Step 7: create cloudinary_options
const d = new Date();
const current_time = `${d.getFullYear()}_${d.getMonth()+1}
_${d.getDate()}_${d.getHours()}_${d.getMinutes()}`
const cloudinary_options = {
public_id: `newsshot/${current_time}_${site_name}`
}; // Part 1: make use of puppeteer
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, {waitUntil: 'domcontentloaded'}); // Step 8: return shotResult
let shotResult = await page.screenshot({
fullPage: true
}).then((result) => {
console.log(`${site_name} got some results.`);
return result;
}).catch(e => {
console.error(`[${site_name}] Error in snapshotting news`, e);
return false;
}); // This step (Step 9): return cloudinaryPromise if screen
// capture is successful, or else return null
if (shotResult){
return cloudinaryPromise(shotResult, cloudinary_options);
}else{
return null;
} await browser.close();
}
function cloudinaryPromise(shotResult, cloudinary_options){
return new Promise(function(res, rej){
cloudinary.v2.uploader.upload_stream(cloudinary_options,
function (error, cloudinary_result) {
if (error){
console.error('Upload to cloudinary failed: ', error);
rej(error);
}
res(cloudinary_result);
}
).end(shotResult);
});
}
10. In the end, I cannot use the simple for loop to loop through the doScreenCapture function as before. There will be one promise for each site, I want to have a promises array to store all the promises for all sites. And when all the promises are finished (either resolved or rejected), I will exit the process.
That is, replace the following for loop from:
for (var i = 0; i < news_sites.length; i++) {
doScreenCapture(news_sites[i]['url'], news_sites[i]['name']);
}
to:
async function doSnapshots(news_sites) { let cloundiary_promises = [];
for (let i = 0; i < news_sites.length; i++) {
try {
let cloudinary_snapshot = await doScreenCapture(
news_sites[i]['url'], news_sites[i]['name']);
if (cloudinary_snapshot){
cloundiary_promises.push(cloudinary_snapshot);
}
} catch(e) {
console.error(`[${news_sites[i]['name']
|| 'Unknown site'}] Error in snapshotting news`, e);
}
} Promise.all(cloundiary_promises).then(function(val) {
process.exit();
});
}doSnapshots(news_sites);
11. run yarn start, and you should see below result:
The url and secure_url is the cloudinary url for the screenshot image. If you copy the secure_url and open it on browser, you will see the image. And you can also notice the url is newsshot/yyyy_M_dd_HH_MM_siteName.png:
After Part 2, snapshot.js should look like
const puppeteer = require('puppeteer');
const cloudinary = require('cloudinary');cloudinary.config({
cloud_name: process.env.CLOUD_NAME,
api_key: process.env.API_KEY,
api_secret: process.env.API_SECRET
});async function doScreenCapture(url, site_name) {
const d = new Date();
const current_time = `${d.getFullYear()}_${d.getMonth()+1}_
${d.getDate()}_${d.getHours()}_${d.getMinutes()}`
const cloudinary_options = {
public_id: `newsshot/${current_time}_${site_name}`
}; const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, {waitUntil: 'domcontentloaded'}); let shotResult = await page.screenshot({fullPage: true})
.then((result) => {
console.log(`${site_name} got some results.`);
return result;
}).catch(e => {
console.error(`[${site_name}] Error in snapshotting news`, e);
return false;
}); if (shotResult){
return cloundinaryPromise(shotResult, cloudinary_options);
}else{
return null;
} await browser.close();
}const news_sites = [
{
name: 'reuters',
url: 'https://www.reuters.com/'
}, {
name: 'reuters_china',
url: 'https://cn.reuters.com/'
}, {
name: 'reuters_japan',
url: 'https://jp.reuters.com/'
}, {
name: 'reuters_germany',
url: 'https://de.reuters.com/'
}, {
name: 'reuters_ara',
url: 'https://ara.reuters.com/'
}
];function cloundinaryPromise(shotResult, cloudinary_options){
return new Promise(function(res, rej){
cloudinary.v2.uploader.upload_stream(cloudinary_options,
function (error, cloudinary_result) {
if (error){
console.error('Upload to cloudinary failed: ', error);
rej(error);
}
console.log(cloudinary_result);
res(cloudinary_result);
}
).end(shotResult);
});
}async function doSnapshots(news_sites) {let cloundiary_promises = [];
for (let i = 0; i < news_sites.length; i++) {
try {
let cloudinary_snapshot = await doScreenCapture(
news_sites[i]['url'], news_sites[i]['name']);
if (cloudinary_snapshot){
cloundiary_promises.push(cloudinary_snapshot);
}
} catch(e) {
console.error(`[${news_sites[i]['name']
|| 'Unknown site'}] Error in snapshotting news`, e);
}
}Promise.all(cloundiary_promises).then(function(val) {
process.exit();
});
}doSnapshots(news_sites);
Part 3: schedule work on Heroku
- Create a git repo, say call it, NewsShotProject on github.com
2. Setup git repo on the project folder
$ git init
Add a .gitignore
file under project directory and add below to the file:
# See https://help.github.com/ignore-files/ for more about ignoring files.# dependencies
/node_modules# env files
.env
3. Push the changes to your github repo:
$ git add *.*
$ git commit -m "Snapshot with Puppeteer and Cloudinary"
$ git remote add origin git@github.com:<your git account>/NewsShotProject.git
$ git push -u origin master
4. Create a new Heroku app:
5. In Deploy tab, hook up with your github repo and connect. Then deploy the code.
6. Then in Settings tab, configure the environment variables with you have in your .env file. Because we do not check in .env (it is in .gitignore), we set the values securely on Heroku Settings tab instead (which is good). CLOUD_NAME=<your cloudinary cloud name>
API_KEY=<your cloudinary api key>
API_SECRET=<your cloudinary api secret>
7. Then go to Resources tab, we will add Heroku Scheduler
After the scheduler is provision-ed, click on “Heroku Scheduler”
8. Now we need to setup what to do with the scheduler, click on Add new job:
Then config yarn start
to run daily, and then click Save:
You are all done, everyday, your Cloudinary account will have news screen snapshot from sites you specified.
Full code solution of this experiment can be found on my github(https://github.com/viviancpy/NewsShotProject).
Happy coding.