How Web Scraping is Used to Create Headless Chrome and Puppeteer Using an Authenticated Proxy Server?

The inclusion of headless modes to Google Chromium, as well as the availability of a similar Node.js API called Puppeteer by Google previously this year, has made it exceedingly easy for developers to automate web operations like filling out forms and taking screenshots of web pages. You may use the-proxy-server command-line option to allow Chromium to utilize a custom proxy server:

chrome --proxy-server=http://proxy.example.com:8080

It’s important to remember that chrome has to be an alternative for your Chromium executable (see how to do this). Because Chrome does not support the -proxy-server option in non-headless (headful?) mode, you must use Chromium instead of Chrome.

The browser will display a window inviting you to provide a username and password if the proxy server requires authentication

When you start Chromium in headless mode, though, you won’t see this prompt since the browser doesn’t have any windows. Chromium doesn’t have a command-line option for passing proxy information, and neither Puppeteer’s API nor the underlying Chrome DevTools Protocol (CDP) provide a mechanism to give it to the browser programmatically. It turned out that forcing headless Chromium to utilize a certain proxy account and password is not simple.

After trying

chrome --proxy-server=http://John_Doe:123@Pass!@proxy.example.com:8080

To get around Chromium’s constraint, you may set up an open local proxy server that forwards data to an upstream authorized proxy, and then tell Chromium to accept it. Squid and its cache peer configuration option can be used to build such a proxy chain. The following is an example of a Squid configuration file (squid.conf):

http_port 3128 cache_peer proxy.example.com parent 8080 0 \ no-query \ login=John_Doe:123@Pass! \ connect-fail-limit=99999999 \ proxy-only \ name=my_peer cache_peer_access my_peer allow all

Execute the following command to initiate squid:

Now that the proxy is running locally on port 3128, Chromium should be able to utilize it:

chrome --proxy-server=http://localhost:3128

If you wish to access it directly from your code or if you need to modify proxies on the fly, this technique becomes laborious. You’ll need to either dynamically change Squid configuration or run a different Squid instance for each proxy in this situation.

Squid processes might hang or not start at all, each platform acted differently, and so on. To do something about this, we created proxy-chain, a new NPM package that we distributed as open-source on GitHub. With it, you can quickly “anonymize” an authorized proxy and then use Puppeteer to start headless Chromium using the following Node.js code:

const puppeteer = require('puppeteer'); const proxyChain = require('proxy-chain'); (async() => { const oldProxyUrl = 'http://John_Doe:123@Pass!@proxy.example.com:8080'; const newProxyUrl = await proxyChain.anonymizeProxy(oldProxyUrl); // Prints something like "http://127.0.0.1:45678" console.log(newProxyUrl); const browser = await puppeteer.launch({ args: [`--proxy-server=${newProxyUrl}`], }); // Do your magic here... const page = await browser.newPage(); await page.goto('https://www.example.com'); await page.screenshot({ path: 'example.png' }); await browser.close(); // Clean up, forcibly close all pending connections await proxyChain.closeAnonymizedProxy(newProxyUrl, true); })();

To handle protocols like HTTPS and FTP, the proxy-chain package supports both standard HTTP proxy forwarding and HTTP CONNECT tunneling. We’ll be utilizing many more features in the package for our forthcoming projects, so follow us on Twitter:

If you need a proxy for web scraping service, check out Scraping Intelligence Proxy, an HTTP proxy service that allows you access to both datacenter and residential IP addresses, as well as clever IP address rotation.

Read the sample code given below:

const puppeteer = require('puppeteer'); (async() => { const proxyUrl = 'http://proxy.example.com:8080'; const username = 'John_Doe'; const password = '123@Pass!'; const browser = await puppeteer.launch({ args: [`--proxy-server=${proxyUrl}`], headless: false, }); const page = await browser.newPage(); await page.authenticate({ username, password }); await page.goto('https://www.example.com'); await browser.close(); })();

Get in touch with us for any web scraping services.

Request for a quote!

Originally published at https://www.websitescraper.com.

--

--

--

Scraping Intelligence is provide all type off website scraper software, web scraping service, data extraction service, web data mining service, web data scraper

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How JavaScript Still Hasn’t Broken the Internet just yet

Hello API

Passwordless Authentication for Better Security

A Keyboard Avoiding View for React Native in 2021

The Dark Side of Express and React That Nobody Talks About

React Native Effective Patterns

Bi-Weekly Update #10

Building an Ionic application using React

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Scraping Intelligence

Scraping Intelligence

Scraping Intelligence is provide all type off website scraper software, web scraping service, data extraction service, web data mining service, web data scraper

More from Medium

Creating a Simple NLP Chatbot using DialogFlow

The BEST Template HANDS-DOWN to Use to Build Your Squarespace Site in 2022 & Beyond

Large Businesses Resolve Issues with A Ticketing System

Celtics Vs Warriors Player Props: Klay Thompson Tops Game 5 Props