Phantom of the Output — Webpage Screenshots with PhantomJS

My adventures with an abandoned browser.

I needed a ton of webpage screenshots to enhance an AI model. And by a ton, I mean 500. My first thought was to open up my browser, head to each webpage, and take a screenshot using a keyboard shortcut. But doing this 500 times would’ve taken a long time. In addition, each screenshot must have consistent width and height dimensions. I asked myself, what is the best move? Here are my lessons from using the first alternative I found: PhantomJS.

PhantomJS: An Invisible Browser

It turns out that you visit websites without seeing them, through what’s called a headless browser. A way that you work with an invisible browser is through scripts of code. You can use JavaScript for this.

PhantomJS was initially released in 2011 as an open source project. While it was written in C++, it provides a JavaScript API for automated navigation, screenshots, and more. This has been used in the past by companies like Yahoo!, Twitter, LinkedIn, Netflix, and Time Warner Cable. PhantomJS has seen its rise, but also its fall. The last stable release was in 2016 before development was suspended, due to a lack of active contribution. For this reason, many people have moved on from the browser. There are now better options out there like Headless Chrome, but due to pure curiosity, I decided to try out PhantomJS for myself.

Since PhantomJS uses WebKit, a real layout and rendering engine, and can render anything on a webpage, it can capture a webpage as a screenshot. Here’s the simplest example of taking one screenshot of the Google homepage. You give it the URL and it creates an image for you:

var page = require('webpage').create();
page.open('https://www.google.com/', function() {
page.render('google.png');
phantom.exit();
});

Since I wanted to generate a bunch of webpage screenshots altogether with specific dimensions, I took this concept a step further.

Snappin’ Screenshots

I needed an efficient script that could run through an array of URLs, apply the dimensions of the image to be produced, and output images in an organized fashion. The automatic screenshot generator I used is a modified script based on Mario Ranftl’s work (thanks Mario). The first step is to save this code into a .js file you can call project.js:

var PAGE_WIDTH = 1280;
var PAGE_HEIGHT = 900;
var pageNumber = 1;
var URLS = [
"https://www.google.com/",
"https://www.apple.com/",
"https://www.bk.com/"
];
// phantomjs page object and helper flag
var page = require('webpage').create(),
loadInProgress = false,
pageIndex = 0;
// set clip and viewport based on PAGE_WIDTH and PAGE_HEIGHT constants
if (PAGE_WIDTH > 0 && PAGE_HEIGHT > 0) {
page.viewportSize = {
width: PAGE_WIDTH,
height: PAGE_HEIGHT
};
 page.clipRect = {
top: 0,
left: 0,
width: PAGE_WIDTH,
height: PAGE_HEIGHT
};
}
// page handlers
page.onLoadStarted = function() {
loadInProgress = true;
console.log('page ' + (pageIndex + pageNumber) + ' load started');
};
page.onLoadFinished = function() {
loadInProgress = false;
page.render("imagesBatch/output" + (pageIndex + pageNumber) + "_" + PAGE_WIDTH + "x" + PAGE_HEIGHT + ".png");
console.log('page ' + (pageIndex + pageNumber) + ' load finished');
pageIndex++;
};
// try to load/process a new page every 500ms
setInterval(function() {
if (!loadInProgress && pageIndex < URLS.length) {
console.log("image " + (pageIndex + pageNumber));
page.open(URLS[pageIndex]);
}
 if (pageIndex == URLS.length) {
console.log("image render complete!");
phantom.exit();
}
}, 500);
console.log('Number of URLS: ' + URLS.length);

We’ll refer to this script whenever we run a batch of URLs. An array holds your specified list of URLs (in the script above there are 3). You can modify the width and height of each screenshot, which relates to the rectangular image. The page number, which starts at 1 and iterates every time a new screenshot is produced, keeps track of images produced but also influences the name of each image file. When we run this script, images will appear in a folder called imagesBatch (folder name defined in the script). This folder appears in the same location as the script file, project.js. To try it out for yourself, here are the steps:

  1. Download PhantomJS to your computer.
  2. Create a blank file called project.js and save it to your Desktop.
  3. Copy the big block of code above and paste it into project.js. Make sure to save the file.
  4. Open up your terminal. Type the command cd Desktop to navigate to the location of your project.js. Next, enter phantomjs project.js to run the script using PhantomJS.

While the script is running, you’ll notice that screenshots will appear in a newly created folder called imagesBatch on your Desktop. Once the script has gone through the array of URLs, the program will terminate.

Twists in the Road

As this was my first attempt at working with PhantomJS, not everything went as planned. While one screenshot might come out beautifully, another might come out blank. This may have to do with load time allowed for each screenshot, but changing the time doesn’t always solve the issue. I played around with different load times and found that half a second per load worked decently. In theory, the script should be able to handle large lists of URLs, but in practice, it could only handle about 10 at a time before freezing up.

When I had a lengthy list and it would halt early, I would check which screenshots were taken, delete those URLs from the list accordingly, and adjust the page number variable to the number that comes after the last screenshot taken. Not doing this would result in the next batch of image files replacing your current ones since the page number currently defines the name of the image files. There were also times when URLs in the array were skipped over, so I had to make sure that any skipped URLs were captured next time I ran the script. If I had a list of 30 URLs, I would expect to have the run the script about four times in order to get all the output I needed.

Using PhantomJS for smaller-scale, sandbox work makes sense, but I would not use it professionally. It’s no longer maintained, so as the Web evolves, it will not be as reliable. There are now great alternatives like Headless Chrome, which allows you to run Chrome behind the scenes with executed JavaScript.


Using a script with PhantomJS was a step up from trying to get all of those webpage screenshots individually. Despite its challenges, the adventure revealed to me a brand new way to work with browsers. It just goes to show some amazing ways that you can interact with the Web.