Web Scraping in Python

Pretending I’m a human(while web scraping)

Kevin
5 min readJun 4, 2022

How I passed the Turing test

coding man

Web scraping. It’s a super hot topic, and many successful startups are based entirely on selling access to data that they’ve collected on the internet.

The other day, I planned out a web scraping project. I was going to collect and analyze rent prices in Downtown Toronto, the place where I go to university. I wanted a good deal on apartments for the next school year.

I inspected the page to lay out a plan of what and how I would scrape the site. Installed Selenium and BeautifulSoup in Python, and wrote a short script to grab the contents of a website. I ran it, and got:

“Your request was blocked by [anti-DDoS service]”

That sucks. I need the data, my rent prices depend on it!!! How can we bypass the security measures?

Web scraping is all about convincing a server that the request you’re sending it is from a human. Here are some ways to seem human, and how to implement them in Python.

A quick note on ethics: web scraping like this is legal for the most part. Bypassing these filters is slightly unethical, but many companies want to obtain data despite these filters and look for this skillset. Make sure to check a site’s ToS and robots.txt before scraping, and try your best to comply, I guess?

Also, this article assumes you know the basics of sending/receiving requests with BeautifulSoup/Selenium

Level 1: Headers

The first thing that servers look for is the request headers. These contain information about the request itself, like what browser the request originates from, the site that produced this request, etc.

Two important headers to change are “User-Agent” and “Referer.”

  • User-Agent specifies the browser/application that sent the request. Some python libraries set this as “Python,” which is a dead giveaway that you’re a bot. Here’s a list of User-Agent strings that you can use instead.
  • Referer specifies the website you redirected from. This can be a lifesaver sometimes. For example, setting Referer to “https://www.google.com” states that you were redirected from Google to the page that you’re visiting.

In Python, this can be implemented as follows:

import requestsheaders = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:78.0)   Gecko/20100101 Firefox/78.0", 
"Referer": "https://www.google.com"}
url = <site-url-here>response = requests.get(url, headers=headers)

This gets me around many sites with basic security measures. Also, many sites return different pages for mobile and desktop users, so setting the right headers can ensure that you get the correct response from a server.

However, in this case, it did not work for me. On to the next step!

Another basic tip: rate limiting

Here’s the thing. If a server sees some machine sending 1000 requests per second…you’re literally trying to DoS them. Don’t do that. You literally are a bot.

Send requests at a steady, relatively slow rate. 1 request/sec is a reasonable rate.

However, sites may anticipate this. When requests are being sent from a single source at a constant rate, you may get flagged as a bot. Thus, you can try to send requests at random intervals.

Even better, you can send these requests from various IP addresses using proxy servers.

Here’s how to implement that(not the proxy server part) in Python

import time
import random
while <condition>:
sleep_time = random.randint(10, 19)/10
requests.get(...)

Unfortunately for my project, I was only sending a single request to test my project out, so rate limiting was definitely not an issue. What else can I do?

Going Deeper: Automated Browsers

What if we pretend to be human…by being human?

Selenium is a web scraping library that makes use of an automated browser. What this means is that it will literally open a Chrome browser on your computer to access a page.

Benefits of this include: the ability to click and interact with pages, and…a 99% chance to be recognized as a human(I don’t know if it’s actually 99%, but since you literally opened a window on your computer, you’re most likely going to be recognized as a human)

Here’s some basic code for that:

from selenium import webdriverdriver = webdriver.Chrome()
driver.get(url)

If you run this in a Python file or IDE, you will see a chrome browser open up and load the page you request, and most sites will not be able to distinguish your automated browser page from the actions of a human.

After implementing this, I could successfully access the site I wanted to Web Scrape!

Fantastic! Why don’t we use Selenium for any web scraping applications then?

Selenium is annoying. It takes effort to seem human.

It sucks when a chrome browser pops up every few seconds while you’re trying to play League of Legends, or working on a different project.

Also, Selenium doesn’t work on a lot of cloud computing services like Google Colab and Heroku unless if you set it to run in headless mode(meaning the Chrome browser that opens is invisible), but that usually results in getting detected as a bot again!

Since I wanted to host my web scraping application on the cloud, and I hate random windows popping up on my screen in my spare time, I HAD to do better than this.

This is where we put all of these concepts together.

Using selenium-wire in Python, we can keep track of the requests that Selenium sends. Using my superior knowledge of headers, I can understand the requests.

If we print the headers of the request, we can quickly tell that the User-Agent header straight-up tells the server that our requests are being sent from a ‘HeadlessChrome’ browser. WHY DO YOU HAVE TO DO THIS TO ME, SELENIUM???

However, by putting everything we’ve learned together, we can alter the headers to make the headless state of the chrome browser undetectable!

from seleniumwire import webdriver
# you can use selenium if you don't want to monitor the requests sent.

chrome_options = webdriver.ChromeOptions()
# set a headless driver
chrome_options.add_argument('--headless')
# set the user-agent back to chrome.user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36'
chrome_options.add_argument(f'user-agent={user_agent}')
url = ...
driver = webdriver.Chrome(options=chrome_options)
driver.set_window_size(1080, 800) # set the size of the window
driver.get(url)

Notice that I also set the driver’s window size. Bot detection systems may inspect the window properties of our “chrome window,” checking if it’s physically open. Setting the width and height of the window like so is a simple way to bypass that.

After implementing this, I am now able to web scrape the site I was looking at, without being blocked by their anti DDoS service OR having random windows pop up on my screen! Yay!!!

Conclusion

There are still many things that websites do to block undesired traffic. For example, Google gives captchas when it detects unusual traffic. Some sites block known IP addresses of web hosting services.

Fortunately, there’s a lot of ways to bypass this. In fact, here is the article I read that helped me make my chrome headless driver undetectable!

Web scraping is a continuous effort to outplay the DDoS prevention system on a website, while that system evolves to detect non-human users better. It’s important to understand methods that are used to detect unwanted traffic, and learn how to bypass them.

If you have any interesting points to share, I would love to hear your comments down below. Please like the article D:

--

--

Kevin

Data Science | Unironically, my posts do not make sense sometimes