Scrape faster with Scrapy shell
Are you adding print statements and rerunning your scraper time and again to test your output? Do you wish you could set breakpoints? Do you have chrome open in the background and using jQuery to test those selectors live on the website you are trying to scrape?
I know this feeling and I am happy to share a better way! This trick has saved me hours.
Use the shell
Python has an excellent command-line interpreter I often use for checking simple syntaxes or how things work. Run
python in your shell and you are in. I use this to test small code snippets and as it automatically evaluates objects without printing it’s great for quickly checking results.
Disclaimer: If you are building anything larger than a small script, I highly recommend having a test suite. By running this on file update instead, you get validation on your functionality and syntax without having to copy/paste or rewrite code.
You can add breakpoints in Scrapy
First up, let’s talk about how we can validate what is going on inside our scraper without having to rely on print statements. The solution here is to use the inspect_response method. This makes Scrapy stop the python execution and open an interactive prompt right where you place the method call. This is an excellent way to break the code execution and debug in-process.
Take for example this example scraper I’ve set up for my own blog:
Adding a call to
inspect_response in this file, I can now run this and inspect the response right away:
response object and any other local variables are all available there and it is possible to call any methods on them real time.
You can use
Ctrl+D to exit the shell and continue scraping or
quit() to abort.
You can also Scrape real-time
Now for the best part. You don’t even have to have a scraper built. You can simply pass in your URL to Scrapy and run it as-is using the Scrapy shell:
scrapy shell https://greycastle.se
Smart work is fast work
Moving from using jQuery and print statements to debugging using the Scrapy shell-like in the example above, cut down the time it took me to build scrapers by hours. Of course, the more complex your scraper becomes, if you have multiple steps etc, it may become more and more difficult to use the shell but I still think it may help you a lot to prove your thinking before finalising the code. When it needs to change, the
inspect_response is your best friend for quickly verifying the results.
Hope this helps, enjoy your scraping!