Web Scraping in 2017: Advanced Headless Chrome Tips & Tricks

Stop doing curls

Martin Tapia
Aug 28, 2017 · 7 min read

Now that PhantomJS’ development has stopped, Headless Chrome is in the spotlight — and people love it, including us. At Phantombuster, scraping is a huge part of what we do, and we use Headless Chrome extensively.

Image for post
Image for post

In this blog post, I will give you a quick tour of the Headless Chrome ecosystem and show you what we’ve learned after having scraped millions of pages.

TL;DR:

  • Web scraping with Headless Chrome is easy, even more so when you’re aware of these tips & tricks;
  • Headless browser visitors can be detected but nobody does it.

A quick recap of Headless Chrome

It means that we can now harvest the speed and power of Chrome for all our scraping and automation needs, with the features that come bundled with the most used browser in the world: support of all websites, fast and modern JS engine and the great DevTools API. Awesome! 👌

Which tool should I use to control Headless Chrome?

Image for post
Image for post
Disclaimer: the last one is ours!

I don’t know if you’ve heard, but there are a lot of NodeJS libraries for exploiting Chrome’s new --headless mode. Each one has its specificity, and we’ve just added our own to the mix, NickJS. How could we feel at ease claiming we’re scraping experts without having our very own scraping library 😉

There is even a C++ API and the community is releasing libraries in other languages, like this one in Go. That said, we recommend using a NodeJS tool as it’s the same language as what’s interpreted in the pages (you’ll see below how that can come handy).

Scraping? Isn’t it Illegal?

Anyway, this is a technical article, so we won’t delve into the legality question of particular scraping practices. In any case, you should always strive to respect the target website’s ToS. We’re not responsible for any damages caused by what you’ll learn in this article 😉

Cool stuff we’ve learned thus far

Put the cookies back in the cookie jar 🍪

But sometimes login forms are so hardened that restoring a previously saved session cookie is the only solution to get in. Some sites will send emails or text messages with codes when they feel something is off. We don’t have time for that. Just open the page with your session cookie already set.

Bypassing the LinkedIn login form by setting a cookie

A famous example of that is LinkedIn. Setting the li_at cookie will guarantee your scraper bot access to their social network (please note: we encourage you to respect your target website ToS).

We believe websites like LinkedIn can’t afford to block a real-looking browser with a valid session cookie. It’s too risky for them as false-positives would trigger too many support requests from angry users!

jQuery will never let you down

A lot of sites already come with jQuery so you just have to evaluate a few lines in the page to get your data. If that’s not the case, it’s easy to inject it:

Scraping the Hacker News homepage with jQuery (yes, we know they have an API)

What do India, Russia and Pakistan have in common?

Image for post
Image for post
Screenshot from anti-captcha.com (they’re not kidding 😀)

The answer is CAPTCHA solving services*. You can buy them by the thousands for a few dollars and it generally takes less than 30 seconds per CAPTCHA. Keep in mind that it’s usually more expensive during their night time as there are fewer humans available.

A simple Google search will give you multiple choices of APIs for solving any type of CAPTCHA, including the latest reCAPTCHAs from Google ($2 per 1000).

Hooking your scraper code to these services is as easy as making an HTTP request. Congratulations, your bot is now a human!

On our platform, we make it easy for our users to solve CAPTCHAs should they require it. Our buster library can make calls to multiple solving services:

Handling a CAPTCHA problem like it’s nothing

*It’s a joke. I have to say it otherwise I receive emails…

Wait for DOM elements, not seconds

But that’s not how it should be done. Our 3 steps theory applies to any scraping scenario: you should wait for the specific DOM elements you want to manipulate next. It’s faster, clearer and you’ll get more accurate errors if something goes wrong:

It’s true that in some cases it might be necessary to fake human delays. A simple await Promise.delay(2000 + Math.random() * 3000) will do the trick.

MongoDB 👍

JSON-LD & microdata exploitation

Kidding aside, some sites will be easier than others. Let’s take Macys.com as an example. All of their product pages come with the product’s data in JSON-LD form directly present in the DOM. Seriously, go to any of their product page and run: JSON.parse(document.querySelector("#productSEOData").innerText)
You’ll get a nice object ready to be inserted into MongoDB. No real scraping necessary!

Intercepting network requests

Image for post
Image for post

Because we’re using the DevTools API, the code we write has the equivalent power of a human using Chrome’s DevTools. That means your bot can intercept, examine and even modify or abort any network request.

We tested this by downloading a PDF CV export from LinkedIn. Clicking the “Save to PDF” button from a profile triggers an XHR in which the response content is a PDF file. Here’s one way of intercepting the file and writing it to disk:

Here “tab” is a NickJS tab instance from which we get the Chrome Remote Interface API.

By the way, the DevTools protocol is evolving rapidly. There’s now a way to set how and where the incoming files are downloaded with Page.setDownloadBehavior(). We have yet to test it but it looks promising!

Ad-blocking

Example of an extremely aggressive request filter. The blacklist further blocks requests that passed the whitelist.

In the same vein, we can speed up our scraping by blocking unnecessary requests. Analytics, ads and images are typical targets. However, you have to keep in mind that it will make your bot less human-like (for example LinkedIn will not serve their pages properly if you block all images — we’re not sure if it’s deliberate or not).

In NickJS, we let the user specify a whitelist and a blacklist populated with regular expressions or strings. The whitelist is particularly powerful but can easily break your target website if you’re not careful.

The DevTools protocol also has Network.setBlockedURLs() which takes an array of strings with wildcards as input.

What’s more, new versions of Chrome will come with Google’s own built-in “ad-blocker” — it’s more like an ad “filter” really. The protocol already has an endpoint called Page.setAdBlockingEnabled() for it (we haven’t tested it yet).

That’s it for our tips & tricks! 🙏

Headless Chrome detection

This is basically a big cat-and-mouse game between angry sys-admins and ingenious bot makers… But you know what? We’ve never seen these methods implemented in the wild. Yes, it is technically possible to detect automated visitors. But who’s going to be willing to face potential false-positives? This is especially risky for large audience websites.

If you know of websites that have these detection features running in production, we’d love to hear from you! 😃

Closing thoughts

By the way, this Franciskim.co “I Don’t Need No Stinking API” article inspired our post. Thanks! Also, check out this post for detailed instructions on how to get started with Puppeteer.

In the next article, I’ll write about “bot mitigation” companies like Distill Networks and the wonderful world of HTTP proxies and IP address allocations.

What next? Check out our library at NickJS.org, our scraping & automation platform at Phantombuster.com. You might also be interested in our theory of the 3 scraping steps.

Phantombuster

Set your web on Automatic.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store