Web Scraping in 2017: Advanced Headless Chrome Tips & Tricks
Now that PhantomJS’ development has stopped, Headless Chrome is in the spotlight — and people love it, including us. At Phantombuster, scraping is a huge part of what we do, and we use Headless Chrome extensively.
In this blog post, I will give you a quick tour of the Headless Chrome ecosystem and show you what we’ve learned after having scraped millions of pages.
- There are a lot of libraries for controlling Chrome, pick the one you like;
- Web scraping with Headless Chrome is easy, even more so when you’re aware of these tips & tricks;
- Headless browser visitors can be detected but nobody does it.
A quick recap of Headless Chrome
Headless Chrome is basically PhantomJS, but made by Google’s Chrome team. They say they are committed to the project and will maintain it for the foreseeable future.
It means that we can now harvest the speed and power of Chrome for all our scraping and automation needs, with the features that come bundled with the most used browser in the world: support of all websites, fast and modern JS engine and the great DevTools API. Awesome! 👌
Which tool should I use to control Headless Chrome?
I don’t know if you’ve heard, but there are a lot of NodeJS libraries for exploiting Chrome’s new
--headless mode. Each one has its specificity, and we’ve just added our own to the mix, NickJS. How could we feel at ease claiming we’re scraping experts without having our very own scraping library 😉
There is even a C++ API and the community is releasing libraries in other languages, like this one in Go. That said, we recommend using a NodeJS tool as it’s the same language as what’s interpreted in the pages (you’ll see below how that can come handy).
Scraping? Isn’t it Illegal?
We don’t really want to stir this endless debate, but less than two weeks ago LinkedIn was ordered by a US District Judge to allow scraping of public profiles. As of now, it’s only a preliminary injunction and the lawsuit still has to take place. LinkedIn will fight this for sure, but you can rest assured that we’ll monitor the situation with close attention. It’s interesting because we talk a lot about LinkedIn in this article.
Anyway, this is a technical article, so we won’t delve into the legality question of particular scraping practices. In any case, you should always strive to respect the target website’s ToS. We’re not responsible for any damages caused by what you’ll learn in this article 😉
Cool stuff we’ve learned thus far
Listed below are some tips & tricks used almost daily at Phantombuster. The code examples are using our own scraping library but they’re easy to rewrite for any other Headless Chrome tool. We’re really more interested in sharing the concepts here.
Put the cookies back in the cookie jar 🍪
Scraping with a full-featured browser gives you peace of mind. No need to worry about CORS, sessions, cookies, CSRF and other modern web stuff. Just simulate a human and you’re in.
But sometimes login forms are so hardened that restoring a previously saved session cookie is the only solution to get in. Some sites will send emails or text messages with codes when they feel something is off. We don’t have time for that. Just open the page with your session cookie already set.
A famous example of that is LinkedIn. Setting the
li_at cookie will guarantee your scraper bot access to their social network (please note: we encourage you to respect your target website ToS).
We believe websites like LinkedIn can’t afford to block a real-looking browser with a valid session cookie. It’s too risky for them as false-positives would trigger too many support requests from angry users!
jQuery will never let you down
If there’s one important thing we’ve learned, it’s this one. Extracting data from a page with jQuery is very easy. In retrospect, it’s obvious. Websites give you a highly structured, queryable tree of data-containing elements (it’s called the DOM) — and jQuery is a very efficient DOM query library. So why not use it to scrape? This “trick” has never failed us.
A lot of sites already come with jQuery so you just have to evaluate a few lines in the page to get your data. If that’s not the case, it’s easy to inject it:
What do India, Russia and Pakistan have in common?
The answer is CAPTCHA solving services*. You can buy them by the thousands for a few dollars and it generally takes less than 30 seconds per CAPTCHA. Keep in mind that it’s usually more expensive during their night time as there are fewer humans available.
A simple Google search will give you multiple choices of APIs for solving any type of CAPTCHA, including the latest reCAPTCHAs from Google ($2 per 1000).
Hooking your scraper code to these services is as easy as making an HTTP request. Congratulations, your bot is now a human!
*It’s a joke. I have to say it otherwise I receive emails…
Wait for DOM elements, not seconds
We often see scraping beginners make their bot wait for 5 or 10 seconds after having opened a page or clicked a button — they want to be sure the action they did has had time to have an effect.
But that’s not how it should be done. Our 3 steps theory applies to any scraping scenario: you should wait for the specific DOM elements you want to manipulate next. It’s faster, clearer and you’ll get more accurate errors if something goes wrong:
It’s true that in some cases it might be necessary to fake human delays. A simple
await Promise.delay(2000 + Math.random() * 3000) will do the trick.
We’ve found MongoDB to be a good fit for most of our scraping jobs. It has a great JS API and the Mongoose ORM is handy. Considering that when you’re using Headless Chrome you’re already in a NodeJS environment, why do without it?
JSON-LD & microdata exploitation
Sometimes scraping is not about making sense of the DOM but more about finding the right “export” button. Remembering this has saved us a lot of time on multiple occasions.
Kidding aside, some sites will be easier than others. Let’s take Macys.com as an example. All of their product pages come with the product’s data in JSON-LD form directly present in the DOM. Seriously, go to any of their product page and run:
JSON.parse(document.querySelector("#productSEOData").innerText)You’ll get a nice object ready to be inserted into MongoDB. No real scraping necessary!
Intercepting network requests
Because we’re using the DevTools API, the code we write has the equivalent power of a human using Chrome’s DevTools. That means your bot can intercept, examine and even modify or abort any network request.
We tested this by downloading a PDF CV export from LinkedIn. Clicking the “Save to PDF” button from a profile triggers an XHR in which the response content is a PDF file. Here’s one way of intercepting the file and writing it to disk:
By the way, the DevTools protocol is evolving rapidly. There’s now a way to set how and where the incoming files are downloaded with
Page.setDownloadBehavior(). We have yet to test it but it looks promising!
In the same vein, we can speed up our scraping by blocking unnecessary requests. Analytics, ads and images are typical targets. However, you have to keep in mind that it will make your bot less human-like (for example LinkedIn will not serve their pages properly if you block all images — we’re not sure if it’s deliberate or not).
In NickJS, we let the user specify a whitelist and a blacklist populated with regular expressions or strings. The whitelist is particularly powerful but can easily break your target website if you’re not careful.
The DevTools protocol also has
Network.setBlockedURLs() which takes an array of strings with wildcards as input.
What’s more, new versions of Chrome will come with Google’s own built-in “ad-blocker” — it’s more like an ad “filter” really. The protocol already has an endpoint called
Page.setAdBlockingEnabled() for it (we haven’t tested it yet).
That’s it for our tips & tricks! 🙏
Headless Chrome detection
Recently an article was published listing multiple ways to detect Headless Chrome visitors. It was also possible with PhantomJS. Both of these describe techniques that range from basic
User-Agent string comparison to more complex stuff like triggering errors and inspecting the stack trace.
This is basically a big cat-and-mouse game between angry sys-admins and ingenious bot makers… But you know what? We’ve never seen these methods implemented in the wild. Yes, it is technically possible to detect automated visitors. But who’s going to be willing to face potential false-positives? This is especially risky for large audience websites.
If you know of websites that have these detection features running in production, we’d love to hear from you! 😃
We believe scraping has never been so easy. With the latest tools at our disposal and the know-how, it can even become a pleasant and fun activity for us developers.
By the way, this Franciskim.co “I Don’t Need No Stinking API” article inspired our post. Thanks! Also, check out this post for detailed instructions on how to get started with Puppeteer.
In the next article, I’ll write about “bot mitigation” companies like Distill Networks and the wonderful world of HTTP proxies and IP address allocations.