The Data Scraper’s Toolkit: Essential Tools and Strategies

Danang Firmanto
16 min readMay 12, 2024

--

1. What is Web Scraping?

Web scraping is a technique used to extract data from websites. This process involves using automated tools to gather specific information such as price lists, product details, email addresses, and even images. The gathered data can then be used for various purposes, such as market research, competitive analysis, or updating online databases.

https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExYmRpazdsNWhoZWtzajBjaGQ2d2NyZXE3czR6cmFzZ2E1eXk2aTVrayZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/a5viI92PAF89q/giphy.gif

The process typically begins with a scraper program sending a request to a web page. It then parses the HTML content of the page to retrieve the specific data elements. The flexibility of web scraping allows it to be tailored to capture nearly any type of information visible on a website, making it an invaluable tool for data-driven decision-making.

2. Benefits of Web Scraping

Price Monitoring:

The first benefit of web scraping is monitoring product prices in the market. For example, if you own a business that sells a particular type of food, you must always be aware of the price range for similar items being sold. Web scraping allows you to easily track prices. Once you know what your competitors are charging, setting your own product prices becomes much simpler.

Gathering Information from Other Companies:

When you’re looking to partner with another company, it’s essential to know more about them. Web scraping can help you collect extensive data on potential partners. This information allows you to determine if the company is reliable and a good fit for collaboration. Thus, this process plays a crucial role in making sound business decisions.

Market Research:

Market research is critical for any business. It reveals what users want and serves as a foundation for building effective marketing strategies. The most accurate information is essential for market research, and web scraping can help achieve this. By using web scraping, you can discover the latest trends favored by consumers. This data can then be analyzed to guide the development of products that cater to your target market.

Monitoring News and Content:

One of the easiest ways to build your brand is by inviting media to your new product launch. The media will cover your event and review your products, creating valuable buzz for your business. Monitoring media coverage can be done effortlessly through web scraping, allowing you to see what is being reported about your products and business.

Generating Leads:

A vital strategy for gaining new leads is to collect as much contact information as possible from potential customers. Web scraping is a highly effective method for obtaining contact details of potential clients, helping you target new customers.

https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExam56bmp0OXJ2amg3MTJhcWFoc3A3NnNvdzE3cHFtNHR1aHdtYXA4MSZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/b0VK26c9Ne0ak/giphy.gif

3. Web Scraping Technique

There are generally two methods available for web scraping:

Manual: This approach requires you to copy and paste data directly from a web page manually. Although straightforward, it can be extremely time-consuming and tedious when working with large datasets.

Automatic: This method employs coding, software applications, or browser extensions. Automation has gained popularity due to its ability to scrape data swiftly. While methods vary based on the specific tools or software being used, all web scraping bots follow three fundamental principles:

  1. Request:
    The process begins by sending an HTTP request to the target website using the GET method. The program accesses the desired web page to fetch information. This step ensures that the bot establishes a connection and identifies the webpage for data extraction.
  2. Parse:
    After receiving a response from the website, the program initiates the parsing process. Parsing involves extracting specific data points from the webpage by leveraging data scraping techniques. The program identifies and isolates relevant information based on HTML markup or another structural format.
  3. Display:
    Once the desired data has been collected and identified through parsing, it’s converted into a readable report or display. The data can be presented in various formats, such as tables, graphs, or other structures that align with previously defined specifications or needs. The ultimate aim is to present the information in an easily understandable format that can be further analyzed or used for strategic decision-making.

4. How to Do Web Scraping

  1. Select the Target Website
    Begin by identifying the website you want to scrape. For instance, if you’re aiming to analyze customer book reviews, websites like Amazon, Goodreads, or LibraryThing are good options to consider.
  2. Inspect the Page
    Before diving into the code, it’s crucial to identify the data you need to scrape. Right-click on the page and choose “Inspect Element” or “View Page Source” to view the website’s underlying HTML code. This will give you an idea of how the data is structured.
  3. Identify the Desired Data
    If you’re focused on book reviews on Amazon, locate where the reviews are in the page’s HTML code. Most browsers highlight the selected front-end content along with its corresponding back-end code. The goal here is to identify unique tags that will help you isolate the relevant data.
  4. Write the Code
    Once you’ve pinpointed the relevant tags, incorporate them into your scraping software of choice. Python is commonly used for this purpose, thanks to its powerful libraries that simplify the scraping process. Make sure to specify the exact data you want to analyze and store, like book titles, author names, and ratings.
  5. Code Execution
    After writing the code, the next step is to run it. The scraper will prompt access to sites, extract data, and analyze it.
  6. Saving Data
    After extracting, analyzing, and collecting relevant data, you need to save it. You can instruct your algorithm to do so by adding additional lines to your code. The format you selected does not matter, but it must conform to the most common Excel formats. You can also run your code through the Python Regex module to extract the dataset into a cleaner one that is easier to read.

5. Web Scraping Tools

Web scraping tools are essential software applications designed for the automated extraction of data from websites. They streamline the process of gathering large amounts of information from the internet, making it accessible and usable for a variety of purposes such as market research, sentiment analysis, competitive analysis, and academic research.

For server-side scraping, developers often use Node.js due to its efficiency and speed. Libraries like Playwright allow for control over headless browsers, automating the interaction with web pages as if a real user were navigating them. This can include anything from logging into a website to capturing dynamic AJAX content that only loads upon user interaction. Another popular library, Cheerio, provides a simplified way to parse HTML, making it easy to select and manipulate data similar to jQuery but with the added speed and efficiency suited for server tasks.

One of Playwright’s key features is its use of browser contexts, which simulate separate and independent browsing sessions. This means it can handle multiple pages or scenarios simultaneously, making it suitable for scraping large datasets or monitoring multiple web pages concurrently. By customizing these browser contexts to match different network conditions or devices, it bypasses anti-bot measures and gathers accurate data.

Playwright is particularly adept at navigating the intricacies of modern web technologies, which often rely heavily on JavaScript for client-side rendering. It launches a full browser session to replicate the way a real user interacts with content, ensuring that data loads fully before performing scraping tasks. This feature is crucial for accurately capturing dynamic content that traditional scraping methods often miss.

Moreover, JavaScript web scraping tools can automate the collection of data at scheduled intervals or in response to specific triggers, ensuring timely updates and the ability to handle large volumes of data efficiently. The data captured can be exported in various formats like JSON or CSV, or even directly fed into databases and analytics pipelines, facilitating easy integration into data-driven applications and processes.

Automation is another area where Playwright shines. With its foundation in Node.js, the framework enables automated scraping tasks to be scheduled at regular intervals or triggered by specific events. This level of automation keeps datasets current and reduces manual intervention, streamlining data collection for businesses and researchers alike.

6. HTML — What is HTML?

In web scraping, HTML is fundamental because it forms the framework of the web pages from which data is extracted. Each web page’s HTML code reveals the structure and organization of its content, guiding scrapers in navigating and locating specific data points. The HTML framework consists of various tags like <div>, <a>, and <table>, each representing different components such as paragraphs, links, images, and tables.

The hierarchical structure of HTML helps scrapers understand how content is organized. Tags contain attributes like id and class, which act as unique identifiers for the elements, allowing scrapers to pinpoint specific information quickly. For instance, a scraper might look for a <table> with a specific class attribute to extract tabular data, or it might target links within a <div> with a unique id.

By parsing the HTML with specialized scraping tools and libraries, a developer can instruct the scraper to locate the exact tags or patterns needed. For example, if collecting product reviews from an e-commerce site, the scraper would identify the HTML tags and attributes that encapsulate each review, including user ratings, comments, and product names.

Dynamic web pages that use JavaScript to load content asynchronously can present a challenge. However, scrapers equipped with browser automation tools can simulate a real browsing session, allowing the page to render fully before extracting the desired data.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Simple HTML Example</title>
</head>
<body>
<h1>Hello, World!</h1>
<p>Welcome to a basic HTML page.</p>
<a href="https://www.example.com">Visit Example</a>
</body>
</html>

Explanation:

  1. <!DOCTYPE html>: This declaration is not typically used directly in web scraping but informs the parser that the HTML5 standard is expected.
  2. <html lang="en">: The root element of the page, specifying the language as English. While the language attribute (lang) might not often be targeted directly, it could be useful for conditional scraping based on language specifics.
  3. <head>:
  • <meta charset="UTF-8">: This tag sets the character encoding to UTF-8. It’s crucial for correctly interpreting the text, especially when scraping non-English content to avoid encoding issues.
  • <title>Simple HTML Example</title>: The title of the document is a common target in web scraping, often used to quickly understand the context or to categorize the page among search results.

4. <body>:

  • <h1>Hello, World!</h1>: Headings are prime targets in scraping as they often contain key information or summaries. This h1 could be used to identify the main topic or as a part of a dataset to understand page structures across a website.
  • <p>Welcome to a basic HTML page.</p>: Paragraph tags often hold the main body of text on a page. Scraping this data could be useful for extracting descriptions, details, or relevant textual content.
  • <a href="https://www.example.com">Visit Example</a>: Hyperlinks are crucial for web navigation during scraping. The href attribute provides the URL, which can be used to follow links, gather resources, or scrape linked pages. This is essential for deep web scraping, where recursive techniques retrieve data across linked pages.

7. HTTP — HTTP CONCEPT

HTTP, or HyperText Transfer Protocol, serves as the foundational protocol for data communication on the World Wide Web, linking web browsers with servers. When it comes to web scraping, understanding the intricacies of HTTP is vital for effectively accessing and retrieving data from websites.

Fundamental Aspects of HTTP Relevant to Web Scraping:

Requests and Responses: At the heart of HTTP are the request and response processes. Web scrapers simulate browser requests to retrieve web pages. Each request can utilize different methods such as GET, commonly used to fetch a page’s content, and POST, often used to submit form data or login to a site.

Headers: Headers in HTTP requests and responses carry crucial metadata. For web scraping, headers like User-Agent help disguise a scraper as a legitimate browser, aiding in bypassing basic anti-scraping checks. Cookies within headers manage session states, ensuring that scrapers can maintain continuity, such as staying logged in across multiple pages.

Status Codes: These codes inform the client about the status of the request. For example, a 200 status code indicates success, while a 404 suggests that the requested resource isn’t available. Understanding these codes helps scrapers handle errors and redirects effectively.

Rate Limiting: Many websites implement rate limiting to control access frequency. Web scrapers must manage their request rates to avoid triggering these limits, which could lead to blocked access or legal issues.

Secure Communication: HTTP requests can be secured with HTTPS, which encrypts the data exchanged between the browser and the server. This is crucial for maintaining privacy and security, especially when scraping sensitive data.

APIs: Websites that offer data through APIs provide a structured and often more reliable way of accessing data than parsing HTML from web pages. APIs typically respond with data in formats like JSON or XML, which are straightforward for scrapers to process.

8. Scrap Pricing and Product Information Using PlayWright Javascript

Before diving into creating a web scraping script using Playwright in JavaScript, it’s essential to set up the necessary development environment. This preparation ensures that all the tools and libraries needed for efficient and effective web scraping are in place.

Setting Up Your Development Environment

1. Installing Visual Studio Code (VS Code):

  • Visual Studio Code is a lightweight but powerful source code editor from Microsoft. It supports JavaScript and Node.js natively and offers a wide range of extensions that enhance functionality, including debugging, intelligent code completion (IntelliSense), and easy navigation.
  • To begin, download and install Visual Studio Code from the official Visual Studio Code website. Follow the installation instructions suitable for your operating system (Windows, macOS, or Linux).

2. Installing Node.js and npm:

  • Node.js is a runtime environment that allows you to run JavaScript on the server side. npm (node package manager) is included with Node.js and helps manage dependencies for Node.js applications.
  • Download Node.js from the official Node.js website. Installing Node.js with the default options will also install npm, setting you up to handle various libraries including Playwright.
  • After installation, you can verify the installation by opening a terminal or command prompt and typing node -v and npm -v, which will display the current versions of Node.js and npm installed on your system.

3. Setting up a Node.js Project:

  • Open Visual Studio Code and create a new project folder, or navigate to an existing one.
  • Open a terminal in VS Code (or use your operating system’s terminal), navigate to your project directory, and initialize a new Node.js project by running npm init. This command creates a package.json file in your project directory, which will keep track of all your dependencies and project metadata.

4. Installing Playwright:

  • With your Node.js environment ready, install Playwright by running npm install playwright. This command downloads Playwright and its dependencies, adding them to your project’s node_modules directory. It also updates your package.json to include Playwright as a dependency.
  • Playwright installation includes browser binaries for Chromium, Firefox, and WebKit, allowing your scripts to simulate a wide range of browsing environments.
const playwright = require('playwright');
const fs = require('fs');
const path = require('path');
  1. const playwright = require('playwright');:
  • This line imports the Playwright library into the current file. Playwright is a popular tool that allows for browser automation, including web scraping and testing.
  • With this import, you can use Playwright’s API to interact with web pages programmatically in different browsers (Chromium, Firefox, WebKit).

2. const fs = require('fs');:

  • This imports Node.js’s fs (file system) module, which provides functions to interact with the local file system.
  • It allows the script to read from and write to files, making it useful for logging, data storage, or working with existing files.

3. const path = require('path');:

  • This imports Node.js’s path module, which helps manipulate and handle file paths in a consistent, cross-platform manner.
  • It provides utilities to work with file and directory paths, ensuring that the correct syntax is used regardless of the operating system.

Putting It Together:

  • With these imports, the script is setting up the foundation for a Playwright-based web scraping or browser automation project.
  • playwright will control the browser, allowing interactions like navigating to URLs, clicking on elements, and capturing data.
  • fs will help manage the storage of any data collected during the scraping process, perhaps saving it to JSON or CSV files.
  • path will help organize file paths properly so that data can be saved or read efficiently, regardless of the operating system.

Example Use Case:

  1. Use Playwright to navigate to a webpage and scrape information.
  2. Store the scraped data in a structured format using fs.
  3. Organize output files effectively using path.
(async () => {
const browser = await playwright.chromium.launch();
const page = await browser.newPage();
  1. (async () => {...})():
  • This structure is an Immediately Invoked Function Expression (IIFE) that allows code to be executed immediately after it is defined.
  • The async keyword indicates that the function contains asynchronous operations, enabling the use of await inside the function.
  • By executing this function immediately, you can handle asynchronous tasks cleanly and keep the global scope uncluttered

2. const browser = await playwright.chromium.launch();:

  • This line initializes a browser instance using Playwright’s chromium engine.
  • playwright.chromium refers to Playwright's Chromium browser automation engine. Similar options include firefox and webkit.
  • launch() is an asynchronous method that starts a new headless (default) browser session.
  • The await keyword ensures that the function pauses execution until the browser is fully launched, then assigns the launched browser instance to the browser variable.

3. const page = await browser.newPage();:

  • This creates a new page (or tab) within the launched browser instance.
  • The new page will operate independently, meaning actions on one page won’t affect other open pages in the same browser.
  • The await keyword again ensures that execution pauses until the new page is ready, assigning it to the page variable.
 
await page.goto('https://www.unitedbike.com/bikes');
  • await: This keyword pauses the function execution until the promise resolves. In this case, it waits until the page is fully loaded before proceeding to the next line.
  • page.goto(url):
  • goto is a method provided by Playwright that instructs the browser page to navigate to the specified url.
  • url is the web address to visit, given here as 'https://www.unitedbike.com/bikes'.
  • When this line is executed, the Playwright page instance will navigate to the given URL just as a real browser would, fully rendering the page, including any JavaScript-based dynamic content.
  const productInfoElements = await page.$$eval('.product-information', elements => elements.map(el => {
const caption = el.querySelector('.caption').textContent.trim();
const price = el.querySelector('.price').textContent.trim();
return `${caption},${price}`;

This code snippet is part of a larger web scraping script, which aims to extract specific product information using the Playwright page object. Here's an explanation of each part:

  1. const productInfoElements = await page.$$eval(...):
  • const: This keyword declares the variable productInfoElements, which will hold the data extracted from the page.
  • await: The function pauses execution until the operation completes, ensuring the variable is only assigned once the data is ready.
  • page.$$eval:
  • $$eval is a Playwright method that evaluates a function against all matching elements on the page.
  • The first argument, '.product-information', is a CSS selector that targets all HTML elements with the class .product-information.
  • The second argument is a function to execute against the selected elements, receiving them as the parameter elements.

2. elements.map(el => {...}):

  • This map function iterates over each HTML element in the elements array, processing each one to extract the necessary information.

3. const caption = el.querySelector('.caption').textContent.trim();:

  • The code searches within each .product-information element for a child element with the class .caption.
  • textContent retrieves the text content of that element, and trim() removes leading and trailing whitespace.
  • The result is stored in the caption variable.

4. const price = el.querySelector('.price').textContent.trim();:

  • Similar to the caption, this line searches for a child element with the class .price.
  • textContent extracts the inner text of the .price element, which typically represents the price of the product.
  • trim() ensures that no extra whitespace is included.

5. return ${caption},${price};:

  • Each iteration returns a formatted string that concatenates the caption and price values separated by a comma.
  const outputFilePath = path.join(__dirname, 'product_information.csv');


fs.writeFileSync(outputFilePath, productInfoElements.join('\n'), 'utf8');

console.log(`Data telah diekspor ke file: ${outputFilePath}`);
await browser.close();
  1. const outputFilePath=path.join(__dirname,'product_information.csv');:
  • path.join:
  • join is a method from the path module that combines multiple path segments into one cohesive path string.
  • It makes the path cross-platform by automatically applying the appropriate path separator (e.g., backslashes on Windows, slashes on Linux/macOS).
  • __dirname:
  • This special variable holds the absolute path to the directory where the currently running script resides.
  • 'product_information.csv':
  • This is the name of the CSV file where the data will be saved.
  • Combined, path.join creates a full path to a file named product_information.csv within the directory of the running script.

2. fs.writeFileSync(outputFilePath, productInfoElements.join('\n'), 'utf8');:

fs.writeFileSync:

  • This synchronous method from the fs module writes data directly to the specified file.
  • It takes three main arguments

outputFilePath: The path to the file where the data will be written.

productInfoElements.join(‘\n’)

  • productInfoElements is expected to be an array of strings (likely containing product data).
  • join('\n') concatenates these strings into a single string, separated by newline characters (\n), making it suitable for writing as CSV data.

‘utf8’

  • Specifies the encoding format for the written file as UTF-8, ensuring proper text representation.

3. console.log(The data has been exported to a file: ${outputFilePath});:

  • Outputs a message to the console, confirming that the data has been exported and providing the path to the generated file.

4. await browser.close();:

  • Closes the Playwright browser instance that was opened at the start of the script.
  • This releases system resources and ensures a clean exit after all web scraping tasks are complete.

Now run the code in the terminal:

$ node scrap.js

the output will be like this

VITESSA 2.00


Rp 9.820.000,Rp 9.820.000
VITESSA 1.00


Rp 7.960.000,Rp 7.960.000
STYGMA LITE


Rp 14.060.000,Rp 14.060.000
STYGMA


Rp 18.640.000,Rp 18.640.000
STERLING R2 DISC


Rp 33.260.000,Rp 33.260.000
STERLING R1 DISC


Rp 26.200.000,Rp 26.200.000
STERLING PRO DISC


Rp 75.000.000,Rp 75.000.000
OXYDE PRO


Rp 67.725.000,Rp 67.725.000
OXYDE ONE


Rp 20.370.000,Rp 20.370.000
KYROSS 2.1


Rp 18.140.000,Rp 18.140.000
KYROSS 2.00+


Rp 17.630.000,Rp 17.630.000
KYROSS 1.1


Rp 12.850.000,Rp 12.850.000
KYROSS 1.00


Rp 12.850.000,Rp 12.850.000
GAVRIIL


Rp 16.620.000,Rp 16.620.000
E-GAVRIIL


Rp 48.280.000,Rp 48.280.000
EPSILON T6


Rp 45.350.000,Rp 45.350.000

here is the whole code:

const playwright = require('playwright');
const fs = require('fs');
const path = require('path');

(async () => {
const browser = await playwright.chromium.launch();
const page = await browser.newPage();

await page.goto('https://www.unitedbike.com/bikes');

const productInfoElements = await page.$$eval('.product-information', elements => elements.map(el => {
const caption = el.querySelector('.caption').textContent.trim();
const price = el.querySelector('.price').textContent.trim();
return `${caption},${price}`;
}));

const outputFilePath = path.join(__dirname, 'product_information.csv');

fs.writeFileSync(outputFilePath, productInfoElements.join('\n'), 'utf8');

console.log(`Data telah diekspor ke file: ${outputFilePath}`);
await browser.close();
})();

Perhaps that’s the end of this introduction to JavaScript playwriting. If you want to delve deeper into web scraping and playwright, you can visit this website.

Thank you! :3

--

--