Web scraping with scrapy

Data Science Skill:

Web Scraping using Scrapy in python

Ahmad Mahmoud

--

For the first time Web Scraping would be a completely alien concept, it is about gathering data from websites using code, this looks hard but it is one of the most logical and easily accessible sources of data.

I started by learning beautiful soup but to get the best result I moved on Scrapy.

In this tutorial I’ll go through an example of how to scrape a website to gather data from SpringCM.

Automating this process with a web scraper avoids manual data gathering, saves time and also allows to have all the data of the companies in one structured file.

Getting Started:

Scapy is a tool used for scraping a website efficiently. We will be using spyder from anaconda navigator, thus we will:

  1. Create a new virtual environment in anaconda navigator:
  2. Go to uninstalled packages
  3. Install pandas and scrapy
  4. Launch spyder

Creating a project:

  1. Open your anaconda prompt
  2. Navigate to the directory where you want to create the project.
  3. To create a scrapy project use:

scrapy startproject name-of-project

this will create a folder with this structure :

4. Create a file in spiders folder to write code in it, the initial code in that file is as below:

In the above code you can see name,start_urls and a parse function.

  • name: Name is the name of the spider. Proper names will help you keep track of all the spider’s you make. Names must be unique as it will be used to run the spider when scrapy crawl name-of-spider is used.
  • start_urls: This requests for the URLs mentioned. A list of URLs where the spider will begin to crawl from, when no particular URLs are specified . So, the first pages downloaded will be those listed here. The subsequent Request will be generated successively from data contained in the start URLs .
  • parse(self, response): This function will be called whenever a URL is crawled successfully. It is also called the callback function. The response (used in Scrapy shell) returned as a result of crawling is passed in this function, and you write the extraction code inside it!

Inspecting the web page:

Plan of Attack:

SpringCM web page

as we notice, we can see that the website is divided into several blogs of title (to the left) and content (to the right). Thus we need to:

  1. find the container that contains all these blogs.
  2. get the links of the whole blogs.
  3. from each link extract the title and content.
  4. save the data.

Inspecting the website:

as shown above, each post is found in div.post-container

To determine the html in better manner you can use Selector Gadget tool.

as you pressing on the sign targeted by the arrow on the top right allows me to get any html reference (tag) of any object/set of objects of the website.

Here the title has a tag of #hs_cos_wrapper_name, and the content by p.

Let’s Start Coding:

Scrapy’s main piece of code is “response.css(‘ ’)” where you specify inside it the html tag of what you want to get. It returns a response. (GET <200>)

If you use “response.css(‘ ‘ ).extract” you will get a list that contains the html content.

As we said we are going to scrap the SpringCM website; We are going to set the name of our spider to springcm and the start_urls will be ‘https://www.springcm.com/blog .

In the parse function we will read the title and the content using response.css(‘ ‘ ).extract.

The code is as follows:

#define the class that contains the scraping functions

class DataSpyder(scrapy.Spider):
#the name of the spider
name = ‘springcm’
#the urls that you want to scrape
start_urls = [
‘https://www.springcm.com/blog'
]
#NB: it is so important to set name and start_urls

#parsing function
def parse(self, response):
#read the whole contents in html form
contents = response.css(‘div.post-container’)
for content in contents:
#each blog url
url = content.css(‘.more-link::attr(href)’).get()
#call the getContent function giving it url as response
yield response.follow(url, callback = self.getContent)

#NB l yield like return

#this extracts the content I want
def getContent(self,response):
title = response.css(‘#hs_cos_wrapper_name::text’).extract()
content = response.css(‘p::text’).extract()
#put them in a table, convert them to DataFrame and then to csv file
table.append((title, content))
x = pd.DataFrame(table, columns = [‘Title’, ‘Content’])
yield x.to_csv(“r”,sep=”,”)

Remember! Each web page has its own structure. You will have to study the structure a little bit on how you can get the desired element.

For further reading, you can refer to Offical Scrapy Docs.

--

--