Web Crawling with Scrapy
In data analytics, the most important resource is the data itself. As web crawling is defined as “programmatically going over a collection of web pages and extracting data”, it is a helpful trick to collect data without an official API.
In this article, we will go through the following topics:
- Setup Scrapy
- Crawling data from webpages
- Deal with infinite scrolling page
Setup Scrapy
Scrapy is a powerful tool when using python in web crawling. In our command line, execute:
pip install scrapy
Our goal
In this article, we will use Yummly as an example. Our goal is to download ingredients from each recipe for further text mining usage (see the related kaggle competition) Now it’s time to create our spiders :)
Create our first Spider
create a python file called crawler.py
:
import scrapyclass RecipeSpider(scrapy.Spider):
name = "recipe_spider"
start_urls = ["https://www.yummly.com/recipes"]
Here we create a class inherited from scrapy.Spider
(In the library, Spider already defines the approaches of tracing paths and data scraping.) We need to give…