Nerd For Tech
Published in

Nerd For Tech

Scrape faster with Scrapy shell

Are you adding print statements and rerunning your scraper time and again to test your output? Do you wish you could set breakpoints? Do you have chrome open in the background and using jQuery to test those selectors live on the website you are trying to scrape?

I know this feeling and I am happy to share a better way! This trick has saved me hours.

Use the shell

Python has an excellent command-line interpreter I often use for checking simple syntaxes or how things work. Run python in your shell and you are in. I use this to test small code snippets and as it automatically evaluates objects without printing it’s great for quickly checking results.

Using the python command line interpreter to try out small snippets

Disclaimer: If you are building anything larger than a small script, I highly recommend having a test suite. By running this on file update instead, you get validation on your functionality and syntax without having to copy/paste or rewrite code.

You can add breakpoints in Scrapy

First up, let’s talk about how we can validate what is going on inside our scraper without having to rely on print statements. The solution here is to use the inspect_response method. This makes Scrapy stop the python execution and open an interactive prompt right where you place the method call. This is an excellent way to break the code execution and debug in-process.

Take for example this example scraper I’ve set up for my own blog:

Example spider

Adding a call to inspect_response in this file, I can now run this and inspect the response right away:

Breaking into a running scraper to inspect the response

The response object and any other local variables are all available there and it is possible to call any methods on them real time.

response.css('.entry-title a::text').getall()

You can use Ctrl+D to exit the shell and continue scraping or quit() to abort.

You can also Scrape real-time

Now for the best part. You don’t even have to have a scraper built. You can simply pass in your URL to Scrapy and run it as-is using the Scrapy shell:

scrapy shell https://greycastle.se
Running Scrapy shell to quickly inspect a page

Smart work is fast work

Moving from using jQuery and print statements to debugging using the Scrapy shell-like in the example above, cut down the time it took me to build scrapers by hours. Of course, the more complex your scraper becomes, if you have multiple steps etc, it may become more and more difficult to use the shell but I still think it may help you a lot to prove your thinking before finalising the code. When it needs to change, the inspect_response is your best friend for quickly verifying the results.

Hope this helps, enjoy your scraping!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
David Dikman

David Dikman

89 Followers

Full-stack developer and founder. Writing here and at https://greycastle.se. Currently open for contract work.