Getting started with Bright Data’s Web Scraping Browser

Gidon Lev Eli
7 min readMar 21, 2023

--

Bright Data web scraping browser

A web scraping browser, also known as a headless browser, is normally a tool that simulates a web browser environment without a graphical user interface (GUI). It allows developers to automate web page interaction and data extraction, just like a regular web scraper, but with more control and flexibility.

The benefits of using a web scraping browser over regular scraping libraries include:

  1. JavaScript rendering: Many modern websites rely on JavaScript to load dynamic content. Regular scraping libraries may not be able to handle this content, but web scraping browsers can fully render the page, execute JavaScript, and extract all the data.
  2. Human-like browsing behavior: Web scraping browsers can simulate user interactions such as mouse clicks, scrolling, and form submissions, which can help avoid detection and prevent the website from blocking the scraper.
  3. High-level API: Web scraping browsers often provide a high-level API that makes it easier to interact with web pages and extract data.

Moving on from stopgap solutions for scraping browsers

If you are a novice or expert web scraper, you’ve probably heard of, or used the following headless browsers for web scraping:

  1. Puppeteer: a Node.js library that provides a high-level API to control headless Chrome or Chromium browsers. It can be used to automate web page interaction, generate screenshots, and extract data from web pages.
  2. Selenium: a popular open-source tool for automating web browsers. It supports many programming languages and can be used with a variety of web browsers, including Chrome, Firefox, and Edge.
  3. Playwright: another Node.js library that provides a high-level API to automate web browsers. It supports Chrome, Firefox, and Safari, and can be used to simulate user interactions, generate screenshots, and extract data from web pages.

However, none of these tools were specifically designed for web scraping. They are all browser automation tools primarily designed for website testing, etc. These browsers are usually detected by bot-protection software (like Akami & DataDom), making it difficult to scrape at large scale. The only reason Python and Node.js programmers made use of them is because they are open-source tools and easily accessible. The same goes for using other headless browsers for web scraping such as Phantom.js, etc.

how to use a browser to scrape data

Enter the first true, dedicated headful scraping browser

Scraping Browser by Bright Data is the first browser proxy-unblocking solution and is designed to help users easily focus on their web scraping while Bright Data takes care of the full proxy and unblocking infrastructure for them.

Users don’t need to learn any new languages. They can easily access and navigate target websites via their favorite libraries such as Puppeteer or Playwright. In addition, they can interact with the website’s HTML code to extract the data they need.

Behind the scenes, the Scraping Browser solution incorporates Bright Data’s superior proxy infrastructure along with its Web Unlocker capabilities for retrieving dynamic content.

The main benefits of the Bright Data Scraping Browser are:

  1. Increasing data acquisition success rates— Bypassing the toughest website blocks using AI-driven technology on premium domains that use the most advanced bot detection software
  2. Automatic CAPTCHA solving — No need to pay for additional service.
  3. Boosting developer productivity — Seamlessly integrating with Puppeteer, Selenium, or Playwright. That means you don’t need to learn anything new to use it.
  4. Cutting overhead infrastructure costs — Highly scalable web scraping using unlimited scraping browser sessions simultaneously.

What makes this possible?

The Bright Data scraping browser is what you may call a ‘headful’ browser. As opposed to the aforementioned headless browsers, it does have a graphic user interface and can be controlled by Puppeteer, Selenium, or the Playwright API. GUI browsers are way more effective at dealing with advanced bot detection software and their fingerprints appear less suspicious.

Scraping Browser also has built-in website unlocking functions that handle blocks automatically. Besides Auto CAPTCHA-solving, it offers the following features:

  1. Managing user-agents
  2. Rendering JS
  3. Handling cookies
  4. Setting referral headers
  5. Auto rotating IPs and retries
  6. Browser fingerprinting
  7. Data validation

Because Scraping Browsers are hosted on Bright Data’s servers, these automated browsers are ideal for auto-scaling web data scraping projects, as users can open as many scraping browsers as they require without the need to pay for an expansive in-house infrastructure.

Scraping browser quick start guide:

  • If you haven’t yet signed up for Bright Data, signing up is free — when you enter your payment method, you’ll receive a $5 credit to get started**
  • Creating a scraping browser zone — on your dashboard, navigate to ‘My Proxies’ page, and under ‘Scraping Browser’ click ‘Get started’.
Screenshot courtesy of Bright Data
  • On the ‘Create a new proxy” page, choose and input a name for your newly created Scraping Browser proxy zone.
    Important! Please select a meaningful name for the zone as it cannot be changed once created.
  • Click ‘Add proxy’ to create and save your zone.
  • To create your first scraping browser session in Node.js or Python, go to your proxy zone’s ‘Access parameters’ tab, you’ll find your API credentials including your Username (Customer_ID), Zone name (attached to username), and Password. You will use them in the following integrations.

Scraping Browser Integration and Sample Code

Node.js with Puppeteer

Install Puppeteer-core (lightweight package without its own browser distribution)

npm i puppeteer-core

In the example script below, simply add your credentials, zone, and target URL instead of the placeholders:

const puppeteer = require('puppeteer-core');
const auth = 'brd-customer-hl_b685f489-zone-scraping_browser:yq5zs66xnk05';
async function run(){
let browser;
try {
browser = await puppeteer.connect({
browserWSEndpoint: `wss://${@zproxy.lum-superproxy.io">@zproxy.lum-superproxy.io">auth}@zproxy.lum-superproxy.io:9222`,
});
const page = await browser.newPage();
page.setDefaultNavigationTimeout(2*60*1000);
await page.goto('http://lumtest.com/myip.json');
const html = await page
.evaluate(() => document.documentElement.outerHTML);
console.log(html);
} catch(e){
console.error('run failed', e);
} finally {
await browser?.close();
}
}
if (require.main==module)
run();

Run the script:

node script.js

Node.js with Selenium

Install Selenium webdriver

npm i selenium-webdriver

Again, add your proxy credentials, zone, and target URL instead of the placeholders in the example script:

const fs = require('fs/promises');
const { Builder, Browser, By } = require('selenium-webdriver');
const AUTH = 'USER:PASS';
const SBR_WEBDRIVER = `@zproxy.lum-superproxy.io:9515`">@zproxy.lum-superproxy.io:9515`">https://${AUTH}@zproxy.lum-superproxy.io:9515`;
async function main() {
const driver = await new Builder()
.forBrowser(Browser.CHROME)
.usingServer(SBR_WEBDRIVER)
.build();
try {
console.log('Connected! Navigating…');
await driver.get('https://example.com');
console.log('Taking page screenshot to file page.png');
const screenshot = await driver.takeScreenshot();
await fs.writeFile('./page.png', Buffer.from(screenshot, 'base64'));
console.log('Navigated! Scraping page content…');
const html = await driver.getPageSource();
console.log(html);
} finally {
driver.quit();
}
}
if (require.main == module) {
main().catch(err => {
console.error(err.stack || err);
process.exit(1);
});
}

Run the script:

node script.js

Python with Playwright

Install Playwright

pip3 install playwright

In the example script below, simply add your credentials, zone, and target URL instead of the placeholders:

import asyncio
from playwright.async_api import async_playwright
auth = 'brd-customer-hl_b685f489-zone-scraping_browser:yq5zs66xnk05'
browser_url = @zproxy.lum-superproxy.io">f'@zproxy.lum-superproxy.io:9222'">@zproxy.lum-superproxy.io:9222'">https://{auth}@zproxy.lum-superproxy.io:9222'
async def main():
async with async_playwright() as pw:
print('connecting');
browser = await pw.chromium.connect_over_cdp(browser_url)
print('connected');
page = await browser.new_page()
print('goto')
await page.goto('http://lumtest.com/myip.json', timeout=120000)
print('done, evaluating')
print(await page.evaluate('()=>document.documentElement.outerHTML'))
await browser.close()
asyncio.run(main())

Run the script

python scrape.py

For all coding examples and variations please head to the scraping browser’s documentation page.

More Scraping Browser features

Blocking requests

It is possible to block some endpoints to save bandwidth.
Example:

// connect to a remote browser...const blockedUrls = ['*doubleclick.net*];const page = await browser.newPage();const client = await page.target().createCDPSession();await client.send('Network.enable');await client.send('Network.setBlockedURLs', {urls: blockedUrls});await page.goto('https://washingtonpost.com');

Country Targeting

When using the scraping browser, the same country-targeting parameter is available to use as in other Bright Data proxy types.

When sending your request, add the -country flag, after your scraping browser zone’s name in the request, followed by the 2-letter ISO code for that country.

See in the example hoe -country-us was added to the request, so it will originate from the US.

curl--proxy zproxy.lum-superproxy.io:22225 --proxy-user brd-customer-<CUSTOMER_ID>-zone-<ZONE_NAME>-country-us: <ZONE_PASSWORD>  "http://target.site"

EU region targeting

Users can target the entire European Union region in the same manner as “Country” above by adding “eu” after “country” in the request: -country-eu

Requests sent using -country-eu, will use IPs from one of the countries below which are included automatically within "eu" region:

AL, AZ, KG, BA, UZ, BI, XK, SM, DE, AT, CH, UK, GB, IE, IM, FR, ES, NL, IT, PT, BE, AD, MT, MC, MA, LU, TN, DZ, GI, LI, SE, DK, FI, NO, AX, IS, GG, JE, EU, GL, VA, FX, FO

Important! The allocation of a country within the EU is random by default.

Your next steps

If you are a Puppeteer, Selenium or Playwright expert, you should be able to quickly experience notable changes when using the Bright Data scraping browser. It should also be fairly easy (and cheap) to conduct a small A/B test to compare success rates, speed, etc. scraping the same target web pages while using the scraping browser and without it.

** It is important to note that the price rate for a “no commitment” plan is $20/GB and can go as low as $15/GB in the “enterprise plan” which is on the same level as most web scraping API or residential proxy service packages.

More content at PlainEnglish.io.

Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.

Interested in scaling your software startup? Check out Circuit.

--

--

Gidon Lev Eli

Marketer by day, Musical producer by night. Over 15 years experience in digital marketing, story-telling and song producing