An Introduction to Web Scraping using Scrapy

Joyce Annie George
Analytics Vidhya
Published in
5 min readMay 8, 2020
Photo by Sai Kiran Anagani on Unsplash

Web scraping is the process of extracting relevant data from websites. It is an important skill in the field of data science. There are several libraries available for scraping data. In this tutorial, we will extract some relevant data from the Internet using Scrapy. Scrapy is an open source framework used to extract data from websites, process it, and store it in your preferred structure. Let us first look into the architecture of Scrapy. Scrapy has the following components.

  1. Scrapy Engine: The engine is responsible for controlling the data flow between all components of the system. It triggers events when certain action occurs.
  2. Scheduler: The scheduler enqueues the requests received from the server. Later on, when the engine requests them, the scheduler is responsible for feeding them.
  3. Downloader: The downloader is responsible for fetching web pages and feeding them to the engine.
  4. Spiders: Spiders are custom classes written by users to extract relevant information from the fetched web pages.
  5. Item Pipeline: After the spider scrapes data from the internet, the item pipeline is responsible for processing them.
  6. Downloader middlewares: Downloader middlewares are specific hooks that sits between the engine and the middleware.
  7. Spider middlewares: This component exist between the engine and the spider. They are able to process spider input and output.

If you want to read about the Scrapy architecture in depth, please take a look at the architecture overview.

Now let us start working with the framework. Make sure that you have already installed Python. First we have to install Scrapy.

$ pip install scrapy

If you come across any issues while installing Scrapy, please look into the official documentation here.

Let us build a spider to extract movie data. We are scraping IMDb website to create a movie dataset. This data can be used to create applications like movie search engine, movie recommendation system etc. First of all, we have to create a Scrapy project.

$scrapy startproject MovieSpider

This creates a Scrapy project with the following structure.

Project Structure

As a beginner, you don’t have to know about all the files. As of now, remember that your custom spiders should be placed inside the spiders folder. Now let us create our spider. Notice that there are two folders with the name MovieSpider. Move to the second directory and create a basic spider.

$cd MovieSpider/MovieSpider
$scrapy genspider imdb_bot

This will create a new spider imdb_bot.py in your spiders folder. The structure of the spider is as shown below.

Structure of Spider Code

The name of our spider is imdb_bot. We can specify the list of urls to be crawled in the start_urls. I have modified the start_url to point to this result page. You can give any search query in imdb and update the start_url to point to the result page. Now a request is send to this url, and the response is sent back from the Scrapy server. The parse() function specifies what we want to do with this response. Now, let us run our spider.

$scrapy crawl imdb_bot

Now, we have crawled a single page. But, we didn’t implement our parse() function. Let us start working on that. We have to analyze the page to identify the relevant information needed for our application. We can extract relevant information using CSS selectors and XPath expressions. CSS selectors are patterns used to select the styled elements. XPath is a query language for selecting nodes from an XML document. Locating elements using XPath gives a lot of flexibility. We are using XPath expression for selecting elements from the result page. If you are not familiar with XPath expressions, please have a look at this cheatsheet . Each movie title in the result page is a link which points to the corresponding movie page. We have to visit that particular page to extract information about the movie. For that, we have to inspect the title of the movie. Right click on the movie title and click Inspect to get the class name of the element.

The above image shows that the title is h3 tag with class name lister-item-header. The anchor link points to the movie page. If you inspect other movie titles, you will understand that all results titles have the class name. Instead of checking all the titles manually, you can use XPath expressions. Click on the Elements tab in inspect window and press Ctrl +F. You can enter expressions in the window to select items in the web page. The expression to get all movie links is //h3[@class=”lister-item-header”]/a/@href. If you enter this expression in the window opened right now, you will get all the movie links in that page as shown below.

From the image, we can see that there are 50 titles in this page. If you click on the title of the first result, you will notice that adding imdb.com to the beginning of the link obtained right now will lead to the movie page. On the movie page, you have to identify the data you need to scrape, inspect those elements, and extract data by writing the xpath expressions. I am not explaining it in detail. I have taken each of the 50 movies and extracted relevant data like title, summary, genres etc. Once the relevant information is scraped, we have to yield it to return it as a dictionary to Scrapy.

imdb_bot.py

You can store the scraped output in your preferred format. If you want to store it as a json file, run the following command.

$scrapy crawl imdb_bot -o movies.json

The whole code for the project is available here.

--

--

Joyce Annie George
Analytics Vidhya

MS, CS, Santa Clara University. Passionate about data collection, analysis and visualization. https://www.linkedin.com/in/joyce-annie-george/