Scrape a Password-Protected Website with Python (Scrapy)

Jeff Vincent
2 min readJul 17, 2019

--

First things first, if you have to log in to a site to gain access to the content you wish to scrape, it warrants asking yourself if it’s a good idea in the first place. That said, legal and ethical use-case in hand, let’s get started.

For our purposes, we will be grabbing an entire page’s html and writing it to a local file, with which we can then do whatever we like.

Prerequisites:

You’ll need to install Scrapy, Python’s beloved web scraping module.

$ pip3 install scrapy will do the trick. Installing within a virtualenv is recommended but certainly not required.

Process:

First, we’ll create a new Scrapy project, by running:

scrapy startproject <project-name>

where <project-name> is the name of your project ;).

Then, within the spiders directory, create the module you’ll use to define the behavior of your spider. I’ve called mine scraper.py.

Withinscraper.py, we’ll have to first import Scrapy. Then, we’ll extend Scrapy’s Spider class with a class of our own. I’ve called mine PracticeSpider.

Note: name = 'practice' on line (4) will allow us to later call the class from the terminal. We can denote which spider we’d like to run at a given time by running scrapy crawl practice.

Then, we’ll write three class methods. The first, start_requests has a list of urls, which are iterated over and called, one by one.

Next, parse is called, and although it is defined explicitly here, even if it weren’t it would still run, as parse is the default callback method. Within parse, we’ll handle the response from the login page we’re trying to get past.

We’ll pass it to Scrapy’s FormRequest.from_response() method along with login credentials (keep an eye peeled for type='hidden' inputs with a token that may also need to be included).

From there, we could continue to any other page that is protected by the password we have by calling scrapy.Request(url=<the-protected-url>).

For the sake of brevity here, we are simply defining a callback self.write(response), which works because we are accessing the method on the instance of the class, rather than on the class itself.

Finally, we write the html that was returned after our login attempt to verify success or failure.

--

--