Improve web scraping with only one command | scrapy python
Have you always wanted to see what you could scrape from a website before testing it element by element?
You know, making that tedious process quicker…
This was a huge deal when I was working with beautifulsoup and selenium. It used to take me hours to check what elements were scrapable and what elements were not accessible that easily, like inside of an iframe.
but later, after I started working with Scrapy (yeah, I was afraid of it the first time 😅), I realized that there was a solution, and it was simpler than I thought!
view(response)
It’s clear that scrapy itself gives you the ability to do a lot more testing before putting your hands in the actual scraper (spider).
And that playground for testing is the scrapy shell, that most of you know can access by simply writing this command in the terminal:
scrapy shell
And this is a huge deal coming from beautifulsoup and Selenium! 😎
But I’m talking about view(response), this one function I mentioned in the title for the interactive shell is a game changer:
fetch("Url.com")
view(response)
This command will open a window on your default browser. There, you can see the rendered raw HTML document with all the information it has.
I mean, all the information that was received from the request with the fetch function, therefore every element that is available for you to scrape!!
For example, this is Elton John’s Wikipedia page:
And this might be pretty simple to know since Wikipedia is mostly raw HTML.
But if you use the view(response) command in the shell for this website, you’ll see your browser opening and showing you this:
fetch("https://en.wikipedia.org/wiki/Elton_John")
view(response)
Which is almost the same as the previous page 😅, now you know everything you see is scrapable with scrapy!
This way, you just saved one little step in the process of web scraping. Believe me, I discovered this recently and now I just can’t stop using it. It is really useful and saves you a lot of time while using this awesome framework.