Dynamic Web Scraping — 6: Using Proxies

Published in

Geek Culture

4 min readMar 5, 2023

I have written five articles about dynamic web scraping using Selenium, which can be found here. However, before we continue exploring this topic further, it’s important to understand an essential aspect of web scraping: proxies.

In this article, we’ll cover the following topics:

Why it’s necessary to understand proxies in web scraping
What are proxies?
The importance of proxies in web scraping
How to obtain proxies
How to use proxies in web scraping

In this article, we will explore how to scrape free proxies from the “free-proxy-list”, conduct a demonstration of testing them, and store the working proxies.

Why to understand proxies

As web scrapers, our scripts can be perceived as intruders by websites. When a website detects an abnormal number of visits from a single IP address, it becomes suspicious and may block the IP to protect its privacy and properties. This can lead to the scraper being unable to access the website, resulting in incomplete or inaccurate data.

This is where proxies come into play. Proxies allow a scraper to make requests to a website from multiple IP addresses, which can help prevent website blocking and improve the scraper’s anonymity. Therefore, having a good understanding of proxies and their usage is essential for successful web scraping.

Although not all websites have this mechanism, as web scrapers, we may face problems with getting blocked.

So,

What are proxies?

Proxies are intermediary servers that act as a gateway between a user and the internet. When a user sends a request to a website, the request is first sent to the proxy server, which then forwards the request to the website. The website sees the request as coming from the proxy server rather than the user’s IP address. This can be useful for web scraping because it allows a scraper to make requests to a website without revealing its own IP address.

The importance of proxies in web scraping

Proxies play a crucial role in web scraping. Web scraping involves sending a large number of requests to a website in a short amount of time, which can trigger website security mechanisms such as IP blocking. When a website blocks the IP address of a scraper, the scraper is unable to access the website and collect the data it needs.

Proxies can help prevent IP blocking by allowing scrapers to make requests to a website from multiple IP addresses. By rotating through a list of proxies, a scraper can make a high volume of requests without triggering IP blocking. Proxies can also improve a scraper’s anonymity, making it more difficult for websites to track the scraper’s activity.

But, there are ways we can prevent IP blocking while web scraping. To learn more about these techniques, make sure to hit the ‘Follow’ button so that you don’t miss any of our future articles.

Furthermore, some websites restrict access based on geographic location. By using proxies located in different regions, a scraper can bypass these restrictions and access the website from anywhere in the world.

In summary, proxies are essential for successful web scraping. They help prevent IP blocking, improve anonymity, and allow access to restricted websites. It is important for web scrapers to have a good understanding of proxies and their usage in order to avoid being detected and blocked by websites.

How to obtain proxies

There are various types of proxies that can be found on the internet. To obtain them, one can simply search for “free proxies”. After saving the available proxies, it is important to test them for functionality and select the ones that work best.

Let’s explore how to scrape free proxies from the “free-proxy-list”, conduct a demonstration of testing them, and store the working proxies.

** I have created a repository on my github. Where I’ll be sharing my web scraping scripts.

How to use proxies in web scraping

Start by obtaining proxies from a website.

import csv
import re
import requests


def extract_proxies(url, filename):
    """
    Extracts proxy IP addresses and ports from a given URL and writes them to a CSV file.

    Args:
        url (str): The URL to scrape for proxies.
        filename (str): The name of the CSV file to write the proxies to.
    """
    # Send a GET request to the URL
    response = requests.get(url)

    # Extract the proxies using a regular expression
    proxies = re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d+', response.text)

    # Write the proxies to a CSV file
    with open(filename, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        for proxy in proxies:
            writer.writerow([proxy])


extract_proxies('https://free-proxy-list.net/', 'proxies.csv')

The above function scrapes all available free proxies from the ‘free-proxy-list’ proxy provider and saves them in a CSV file named ‘proxies.csv’. The next step is to test these proxies and retain only the ones that work properly.

In addition to testing the proxies, this demo also showcases how to use proxies with the ‘requests’ library. You can find the demonstration on my Github page. Don’t forget to give this article a clap and star my repository to stay updated on future updates.

Wishing you the best on your learning journey! ❤️