Write Scrapy Spider to crawl E-Commerce Website

In this blog we will learn how to crawl e-commerce website and get the results in the .csv format.

Here, we are going to crawl a website http://souq.com/ and the fields that we will scrap are Title, Category, OriginalPrice, CurrentPrice, Discounted, Savings, SoldBy, SellerRating, Url .

To start with scrapy spider there are few steps that we have to follow:

- Install requirements modules

- Create a scrapy project

- Create spider in scrapy project

- Declare items

- Write Spider

- Execute Spider and get results

Install requirements modules

For this scrapy spider we are going to use following modules you can install it using python-pip:

pip install scrapy
pip install bs4

If you are not aware how to do pip you can follow this link https://pip.pypa.io/en/stable/

Create a scrapy project

So now we are done with installation next step is to create a scrapy project for that we need to write following command

Scrapy startproject <project_name>

In our case we will write project name as “souqCrawler”

So we will write “ scrapy startproject souqCrawler”

After this you will have the final project structure will be something like given below:

Now Scrapy project is created now next step is to create scrapy spider, we can create multiple scrapy spiders under single scrapy project

Create spider in scrapy project

Now we will create scrapy spider in “souqCrawler” project that we have created recently

There are two ways to create scrapy spider one is using command line and another you can create a python file under souqCrawler/spiders directory and define spider structure in that file it will be treated as a spider for that scrapy project.

But in here we will start with creating spider using command line, to create spider using command line we have to pass spider name ( that can be a user defined ) and the start urls ( the starting url for spider that will be open first, but we can change this later also )

To create scrapy spider write following command:

“ scrapy genspider <spider_name> <start_url> “

So in our case we will write “scrapy genspider souqSpider http://uae.souq.com/ae-en/shop-all-categories/c/”

Make sure you run this command in the scrapy project directory

Now we are all set with creating a project and spider, now we will write a spider but before that items has to declared, Items can be written in items.py file which can be located in ( souqCrawler/items.py )

Here we are going to declare items like this :

Items are declared properly , Out next step is to write the spider code

Write Spider

Before writing the scrapy spider let’s learn something about how it works and what is the main purpose of writing scrapy spider so the main purpose is to grab some data from any website and store it into somewhere in various required formats and when scrapy spider start’s it start with specific provided url ( i.e start_urls ) and then depends on written program it start crawling but to crawl we have to locate the elements on that webpage where crawling is started.

To locate those elements we have two options to use one is using css and another is using xpath

I always prefer to use xpath so in this blog we will use xpath to locate element if you would like to learn about xpath more you can refer to this link https://www.w3schools.com/xml/xpath_intro.asp

Now we have to open the spider file which we have created earlier to open that file navigate to ( souqCrawler/spiders/souqSpider.py )

When you will open this file class will be already declared over there it will be like that

import scrapy
class SouqspiderSpider(scrapy.Spider):
name = "souqSpider"
start_urls = [' http://uae.souq.com/ae-en/shop-all-categories/c/']
def parse(self):
pass

Now we have to write the rest of code, Let’s declare xpath first i prefer to use mozilla firefox browser and along with that firebug + firebug plugin which helps me to locate xpath easily

You can install firebug and firepath by clicking on setting button on the top right and install them

class SouqspiderSpider(scrapy.Spider):
name = "souqSpider"
start_urls = ['http://uae.souq.com/ae-en/shop-all-categories/c/']
def __init__(self):
self.declare_xpath()
def declare_xpath(self):
self.getAllListXpath = "//div[@class='grouped-list']//a/@href"
self.getAllProductsXpath = "//a[@class='img-link quickViewAction']/@href"
self.pageTitleXpath = "//h1/text()"
self.categoryXpath = "//div[@id='productTrackingParams']/@data-category-name"
self.priceXpath = "//div[@id='productTrackingParams']/@data-price"
self.discountedPriceXpath = "//span[@class='was']/text()"
self.savingsXpath = "//span[@class='noWrap']/text()"
self.solfByXpath = "//dt[text()='Sold by:']/following-sibling::dd[1]//a/text()"
self.SellerRatingXpath = "//dt[text()='Sold by:']/following-sibling::dd[1]/span/span/small/text()"
def parse(self, response):
pass

Here we have declared all the xpath that is required in the scrapy spider, after declaring we have one parse method, this method call automatically first when spider executed. So we have to start crawling from here

def parse(self, response):
for href in response.xpath(self.getAllListXpath):
url = href.extract()
yield scrapy.Request(url, callback=self.parse_item)

def parse_item(self,response):
for href in response.xpath(self.getAllProductsXpath):
url = href.extract()
yield scrapy.Request(url,callback=self.parse_main_item)

def parse_main_item(self,response):
item = SouqcrawlerItem()

Title = response.xpath(self.pageTitleXpath).extract()
Title = self.cleanText(self.parseText(self.listToStr(Title)))

Category = response.xpath(self.categoryXpath).extract()
Category = self.cleanText(self.parseText(self.listToStr(Category)))

CurrentPrice = response.xpath(self.priceXpath).extract()
CurrentPrice = self.cleanText(self.parseText(self.listToStr(CurrentPrice)))

OriginalPrice = response.xpath(self.discountedPriceXpath).extract()
OriginalPrice = self.cleanText(self.parseText(self.listToStr(OriginalPrice)))
OriginalPrice = OriginalPrice.replace("AED","")
Discounted = 'False'
if OriginalPrice == '':OriginalPrice=CurrentPrice
else:Discounted = 'True'

Savings = response.xpath(self.savingsXpath).extract()
Savings = self.cleanText(self.parseText(self.listToStr(Savings)))
Savings = Savings.replace("AED","")

SoldBy = response.xpath(self.solfByXpath).extract()
SoldBy = self.cleanText(self.parseText(self.listToStr(SoldBy)))

SellerRating = response.xpath(self.SellerRatingXpath).extract()
SellerRating = self.cleanText(self.parseText(self.listToStr(SellerRating)))

item['Title'] = Title
item['Category'] = Category
item['OriginalPrice'] = OriginalPrice
item['CurrentPrice'] = CurrentPrice
item['Discounted'] = Discounted
item['Savings'] = Savings
item['SoldBy'] = SoldBy
item['SellerRating'] = SellerRating
item['Url'] = response.url
yield item

def listToStr(self,MyList):
dumm = ""
for i in MyList:dumm = "{0}{1}".format(dumm,i)
return dumm

def parseText(self, str):
soup = BeautifulSoup(str, 'html.parser')
return re.sub(" +|\n|\r|\t|\0|\x0b|\xa0",' ',soup.get_text()).strip()

def cleanText(self,text):
soup = BeautifulSoup(text,'html.parser')
text = soup.get_text();
text = re.sub("( +|\n|\r|\t|\0|\x0b|\xa0|\xbb|\xab)+",' ',text).strip()
return text

Now here you can see we have three different methods written, which means channable linked method “parse” method will capture all main links then further it will pass all those urls to “parse_item” method then further all product urls will be passed to final method “parse_main_item” ,Once you have written all the spider then just run this spider with command “scrapy crawl souqspider -o dump.csv”

Here dump.csv means output file name where data will be stored after crawling

This blog is originally posted on http://pritpal.xyz/blog/posts/write-scrapy-spider-to-crawl-e-commerce-website-2

You can download full source code from here https://github.com/pritsingh1701/souq