Web Scraping with Python — The Ultimate Guide

Prerna Mittal
The ACM Manipal Blog
12 min readJan 15, 2023

What’s web scraping, and what is it used for?

Let’s say you have to create a presentation on the history of ice cream. To gather data for it, you would probably go to a website and gather information from it. It’s just a simple copy-paste!
Now, let’s say you plan on doing social media sentiment analysis to track the latest music trends. That involves extracting data from your social media site of choice. However, we can’t simply copy-paste anymore as the site is enormous! In this scenario, we would use something known as web scraping.

Like the name says, it’s basically “scraping” data from the web. We can also say that it’s a method to fetch or extract data from websites. It presents large amounts of data in a structured manner which is helpful for analyses and multiple applications.

Web scraping consists of two parts which are the crawler and the scraper. The scraper “scrapes” the data from the data. Its design varies with the project’s requirements so that it can efficiently extract the data. The crawler is an intelligent algorithm that browses the web for the required data through links across the internet.

The process of web scraping follows these steps:

  1. A GET request is sent to the targeted website using the HTTP protocol.
  2. The web server processes the request to determine legitimacy. If found to be accurate, the scraper is then allowed to read and extract the data of the website.
  3. A web scrape locates the targeted elements and then saves these in a structured format.

Commercial Advantages

Web Scraping has multiple applications and can be advantageous in the following ways:

  1. Tracking Competitors: To keep up with the fast-paced digital space, one must analyze one’s competitors and gain necessary business intel. Web scraping will be helpful by providing data such as public reactions to product launches and their budget insights, etc.
  2. Value Optimization: Putting the correct price on one’s product is critical to successful businesses. Hence, it is essential to know the prices placed by your competitors to price your product appropriately to stay competitive.
  3. Job Listing: Web scraping can be used to get data on job openings which can be used to track market trends and help companies find potential candidates.
  4. Sentiment Analysis: Web scraping can collect data on customer reviews and feedback, which can help businesses understand customer sentiment and identify areas for improvement.
  5. News Monitoring: Web scraping can collect news articles from different websites, which can help businesses stay up to date on industry news and track mentions of their brand.

Now let’s have a look at how we can scrape data using these python libraries:

  1. Selenium
  2. BeautifulSoup
  3. Scrapy

Selenium

Advantages

  • Selenium is an Open Source Software.
  • Selenium supports various programming languages to write programs (Test scripts)
  • Selenium supports various operating systems (MS Windows, Linux, Macintosh etc…)
  • Selenium supports various Browsers (Mozilla Firefox, Google Chrome, IE, Opera, Safari etc…)
  • Selenium supports Parallel Test Execution and uses fewer Hardware resources.
  • Selenium Test Case Execution time is faster than other tools like UFT, RFT, TestComplete, SilkTest, etc. and these test cases can be executed while the browser window is minimized.

Disadvantages

  • It supports Web-based applications only.
  • Difficult to use, takes more time to create Test cases.
  • Difficult to Setup Test Environment when it compares to Vendor Tools like UFT, RFT, SilkTest, etc…
  • Limited support for Image Testing.
  • New features may not work properly.
  • No Test Tool integration for Test Management.
  • No Built-in Reporting facility and reliable Technical Support is unavailable.

Types of Selenium Modules

  1. Selenium IDE

Selenium IDE is one of the plugins from the selenium suite, this is the easiest tool among others to use as this integrated development environment(IDE) doesn’t require any particular setup to get started. The latest versions are compatible with Firefox and Chrome as well.

It allows one to record user actions using a particular graphical user interface. It doesn’t require learning a test scripting language as it can convert the test cases into different programming languages. Hence it’s very flexible and easy to use.

2. Selenium Grid

Selenium grid enables testing across multiple platforms on multiple machines. This software is a good example of a distributed system.

Selenium grid supports almost every framework; you don’t have to worry about the programming language you are using for testing. A remote web browser is used to run tests from the local environment parallelly on a different machine. A server, known as the ‘hub,’ acts as a command center and routes the JSON format to the nodes.

A ‘node’ is a different machine on which we must execute the tests. ‘Hub’ and ‘node’ are a selenium grid’s two most essential components. There can be only one hub in a particular grid; nodes can be multiple.

3. Selenium Webdriver

Selenium web driver, also known as selenium 2.0, is an open-source tool for web automation. This version also supports cross-browsing testing. While creating test scripts, any programming language can be used, which is a function missing in selenium IDE.

Selenium web driver is compatible with almost every browser for testing. Hence the multi-browser testing capability is higher. It communicates directly with the browsers without any medium. Selenium web driver uses UI elements that are AJAX-based. It is also easier to use because of its drag-and-drop user navigation systems. Client library holds binding between multiple programming languages connected with JSON.

For each browser, there is a browser driver. For example, for a chrome browser, we have a chrome browser driver. These drivers are used to communicate with the respective browser. If these drivers get any signal/command, it is executed in the respected browser. The response after execution is sent back with the help of the HTTP server.

Get Started with Selenium:

Selenium is a popular tool for web scraping because it can automate the interaction with websites. Here’s an essential guide for using Selenium for web scraping:

Install Selenium: You’ll need to have Selenium installed on your system. You can do this using pip:

pip install selenium

Start a web driver: Selenium requires a web driver to interact with the browser. You can use the Webdriver for the browser of your choice. For example, you can use Firefox driver like this:

from selenium import webdriver
driver = webdriver.Firefox()

Navigate to the website: Once you have a web driver running, you can use it to navigate to the website you want to scrape. For example,

driver.get("http://www.example.com")

Inspect the page: Before extracting data from a website, you need to understand its structure. You can inspect the page by right-clicking on an element and selecting “Inspect” or “Inspect Element”. This will open the browser’s developer tools, showing you the page’s HTML source code.

Locate the data: Once you understand the page’s structure, you can use Selenium’s find_element_by_* methods to locate the data you want to extract. For example, you can use find_element_by_css_selector to find an element by its CSS class or find_element_by_xpath to locate an element by its xpath.

Extract the data: Once you have located the data, you can use Selenium’s text property to extract the element’s text content. For example,

 element = driver.find_element_by_css_selector(".some-class")
data = element.text

Close the driver: Once you are done scraping, you should close the driver to release any resources.

driver.close()

Keep in mind that this is just a basic guide and there are more complex scraping scenarios where you will need to handle dynamic pages, cookies, and more. But with this guide, you should be able to get started with web scraping using Selenium.

BeautifulSoup

Advantages:

  • Easy for beginners to learn and master web scraping.
  • It has good community support to figure out the issue.
  • It has good comprehensive documentation.
  • It gives you more freedom to experiment and create your parameters from scratch.
  • It can be considered if one wants to create one-time scraping scripts that won’t be maintained in the long run.

Disadvantages:

  • It has an external python dependency.
  • fetches only the contents of your source URL
  • Laggy with dependencies, particularly when compared to Scrapy. This makes it less preferable for more extensive projects.
  • Requires Python knowledge
  • Minimal proxy support makes it hard to extract large amounts of data from the same server without getting your IP banned or blocked.

Get Started with BeautifulSoup:

Beautiful Soup is a Python library for parsing HTML and XML documents. It can be used for web scraping, allowing you to extract specific elements from a web page and navigate through the document tree. Here’s a basic guide for using Beautiful Soup for web scraping:

Installing Beautiful Soup: If you’re using a recent version of Debian or Ubuntu Linux, you can install Beautiful Soup with the system package manager:

apt-get install python3-bs4

Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system packager, you can install it with pip. The package name is beautifulsoup4. Make sure you use the right version of pip for your Python version (these may be named pip3).

apt-get beautifulsoup4
pip install beautifulsoup4

Installing a parser: Beautiful Soup supports the HTML parser included in Python’s standard library and several third-party Python parsers. One is the lxml parser. Depending on your setup, you might install lxml with one of these commands:

apt-get install python-lxml
pip install lxml

Another alternative is the pure-Python html5lib parser, which parses HTML like a web browser does. Depending on your setup, you might install html5lib with one of these commands:

apt-get install python-html5lib
pip install html5lib

This table summarizes the advantages and disadvantages of each parser library:

Import the library: You’ll need to import the library in your script. You can do this by using the following command:

from bs4 import BeautifulSoup

Retrieve the HTML: You can use the requests library to retrieve the HTML of the webpage you want to scrape.

import requests
url = "http://www.example.com"
response = requests.get(url)
html = response.text

Parse the HTML: Once you have the HTML, you can parse it using Beautiful Soup.

soup = BeautifulSoup(html, 'html.parser')

Inspect the page: Before extracting data from a website, you need to understand its structure. You can inspect the page by looking at the HTML source code.

Locate the data: Once you understand the page’s structure, you can use Beautiful Soup’s methods to locate the data you want to extract. For example, you can use the find or find_all methods to locate elements by their tag name, class name, or other attributes.

element = soup.find('p', class_='some-class')

Extract the data: Once you have located the data, you can use Beautiful Soup’s properties and methods to extract the text or other attributes of the element. For example:

data = element.text

With this guide, you should be able to get started with web scraping using Beautiful Soup.

Scrapy

Advantages

  • Scrapy can extract data in CSV, XML, and JSON formats.
  • Scrapy provides AutoThrottle features that automatically adjust the tool to the ideal crawling speed.
  • Scrapy is asynchronous, so it can load several pages in parallel.
  • Large volumes of data can be extracted.
  • In terms of speed, Scrapy is fast.
  • Scrapy consumes little memory and CPU space.

Disadvantages

  • Scrapy cannot handle Javascript
  • The installation process varies for different operating systems
  • Scrapy requires Python version 2.7.+
  • Scrapy provides limited documentation, which can be challenging for beginners.

Creating a project

Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:

scrapy startproject tutorial

This will create a tutorial directory with the following contents:

tutorial/
scrapy.cfg #configuration file for the Scrapy project.
tutorial/ #directory is a Python package, it contains the project modules.
__init__.py #is an empty file that tells Python that this directory should be considered a Python package.
items.py #where you'll define the data structure.
middlewares.py #where you can handle the requests and responses.
pipelines.py #where you can process the data returned from the spider.
settings.py #where you can set various settings for the Scrapy project, such as the custom setting, user-agent, robots.txt, etc.
spiders/ #directory is a package where you'll later put your spiders.
__init__.py

Our first Spider

Spiders are classes that you define, and Scrapy uses to scrape information from a website (or a group of websites). They must subclass Spider and determine the initial requests to make, optionally, how to follow links in the pages, and how to parse the downloaded page content to extract data.
As you can see, our Spider subclasses scrapy.Spider and defines some attributes and methods:

  • name: identifies the Spider. It must be unique within a project; you can’t set the same name for different Spiders.
  • start_requests(): must return an iterable of Requests (you can return a list of requests or write a generator function) from which the Spider will begin to crawl. Subsequent requests will be generated successively from these initial requests.
  • parse(): a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.
    The parse() method usually parses the response and extracts the scraped data as dicts. It also finds new URLs to follow and creates new requests (Request) from them.

Get Started with Scrapy:

Scrapy is a Python framework for large-scale web scraping. It is built on top of the Twisted library for asynchronous network programming. Here’s a basic guide for using Scrapy for web scraping:

Install Scrapy: You’ll need to have Scrapy installed on your system. You can do this using pip:

pip install scrapy

Create a new Scrapy project: Use the Scrapy command-line tool to create a new project.

scrapy startproject projectname

Define the spider: A spider is a class that defines how Scrapy should navigate a website and what data to extract. You can define your spider by creating a new Python file in your project’s spiders directory.

Define the start URL: In the spider file, you’ll need to define the start URL(s) that Scrapy should begin scraping.

Define the parsing logic: You’ll need to define the parsing logic to extract the data you want. Scrapy uses a technique called selectors to extract data from the HTML.

Start the spider: Once you’ve defined your spider, you can run it using the following command:

scrapy crawl spider_name

Store the data: Scrapy provides several ways to store it, like saving it in json, csv or xml format. You can also store it in a database.

Scrapy is a powerful web scraping tool that can handle large-scale projects. It also has built-in support for handling everyday web scraping tasks such as following links, managing cookies, and more.

Web scraping Python libraries compared:

Now that you have a brief overview of the three most useful web-scraping python libraries let’s bust the myths associated with web-scraping.

Myth-Busters

  1. It’s illegal: As long as laws and regulations are not violated, web scraping is perfectly legal. It entirely depends on what data is collected and where, and how it is planned to be used.
  2. It can only be done by those who know how to code: That is no longer the case. There are many tools with which even someone who doesn’t code can get the necessary data.
  3. It is only meant to extract text: Web scraping can also extract images, videos, and non-textual data.
  4. It’s easy: That’s not true. The complexity and size depend on the use case and the website’s structure. Some may need more advanced techniques for web scraping.
  5. Web scraping is only for large-scale data collection: This is only sometimes the case. Web scraping can also be done for small-scale data collection, depending on the use case.

Legality

Web scraping is at times viewed as something that’s illegal or as “unethical hacking” however, as long as all laws are abided by, it is not unlawful. To ensure that one does web scraping legally and responsibly, here are some things to keep in mind:

  1. Limit the frequency and amount of data collected: do not scrape the website too frequently or in vast amounts.
  2. Don’t use scraped data for illegal or unethical purposes: do not use the collected data for unethical purposes or to harm the website or its users.
  3. Obtain permission when necessary: In some cases, it may be required to obtain explicit consent from the website owner before scraping their site.
  4. Check for a robots.txt file: Many websites have a robots.txt file that tells web crawlers which pages or sections of the website should not be accessed. Check for and abide by any instructions in the website’s robots.txt file.
  5. Respect privacy: When scraping personal data, comply with all applicable privacy laws and regulations.

Ready to scrape websites? Building web scrapers in Python, acquiring data, and drawing conclusions from large amounts of information is an exciting and complicated process. The web scraping applications are boundless, and the work it does is phenomenal.

Authors: Kashin Mittal and Madhuria Rudra
Second-year students at Manipal Institute of Technology

--

--

Prerna Mittal
The ACM Manipal Blog

Upcoming SWE @Microsoft | Ex-Intern @Microsoft, Cadence | Samsung PRISM Intern | NXP WIT Scholar'22 | UIUC+ Research Intern | Beta MLSA | GATE CS qualified