Web Crawling with Scrapy

Wendee 💜🍕
Analytics Vidhya
Published in
6 min readJan 10, 2020

--

In data analytics, the most important resource is the data itself. As web crawling is defined as “programmatically going over a collection of web pages and extracting data”, it is a helpful trick to collect data without an official API.

In this article, we will go through the following topics:

  1. Setup Scrapy
  2. Crawling data from webpages
  3. Deal with infinite scrolling page

Setup Scrapy

Scrapy is a powerful tool when using python in web crawling. In our command line, execute:

pip install scrapy

Our goal

In this article, we will use Yummly as an example. Our goal is to download ingredients from each recipe for further text mining usage (see the related kaggle competition) Now it’s time to create our spiders :)

Create our first Spider

create a python file called crawler.py :

import scrapyclass RecipeSpider(scrapy.Spider):
name = "recipe_spider"
start_urls = ["https://www.yummly.com/recipes"]

Here we create a class inherited from scrapy.Spider (In the library, Spider already defines the approaches of tracing paths and data scraping.) We need to give…

--

--