How to Create your own Search Engine with Python Language and Laravel Framework Step 2 of 4
Step 2 : Data Crawling Using Python Scrapy
In this part, I am going to write a web crawler that will scrape data from Books to Scrape website. But before I get into the code, here’s a brief intro to Scrapy itself.
What is Scrapy?
From Wikipedia:
Scrapy (pronounced skray-pee) is a free and open source web crawling framework, written in Python. Originally designed for web scraping, it can also be used to extract data using APIs or as a general purpose web crawler. It is currently maintained by Scrapinghub Ltd., a web scraping development and services company.
Prerequisites
So like the previous part to follow along with this part tutorial you should have :
- python 3.x.x
- The package manager PIP
- Code Editor (in this tutorial I will use Visual Studio Code, Optional)
Installing Scrapy
Now before we start our first Scrapy project we need to install Scrapy, so open up your terminal and type:
pip install scrapy
Start Your First Project
In your terminal Type :
scrapy startproject <project_name>
where <project_name> is the project name, call it whatever you want. but in this tutorial I will use “web_scrap”
Scrapy Project Structure
Before we do the scraping process it’s good we know the structure of the scrapy project
as seen after the startproject command is run, scrapy will create the project automatically. but in the newly created project, you will not see the file “book_list.py” because it is a file created manually for scraping. And “book.json” is it a result from scrapy process.
Make Scrapy Script
like the structure of the previous project, we will create a script for scrapy inside the “spiders” directory. in this tutorial I create a scrapy script with name “book_list.py”. If you have VS Code just type :
code <project_name>/spiders/book_list.py
And fill you blank script with this script :
Introduce The Script
import scrapy
import jsonfilename = "book.json" # To save store data
In this line code is the place where the scrap is declared and the output file is determined. In this tutorial I will export output file to JSON format file.
class IntroSpider(scrapy.Spider):
name = "book_spider" # Name of the scraper
The “name” attribute describes the name of the spider scrap to be created. you can use another name instead. The name class attribute, basically we use to call or launch the spider from command line bear in mind in Scrapy every spider should have a unique name.
urls = ['http://books.toscrape.com/catalogue/page-{x}.html'.format(x=x) for x in range(1, 50) ]
The urls list, we use it to specify what website/page we want to scrape.
and the another one is the parse method is responsible for parsing the DOM, it’s where we write the CSS Selector expressions to extract the data.
book_list = response.css('article.product_pod > h3 > a::attr(title)').extract() # accessing the titleslink_list = response.css('article.product_pod > h3 > a::attr(href)').extract() # accessing the title linksprice_list = response.css('article.product_pod > div.product_price > p.price_color::text').extract() # accessing the priceimage_link = response.css('article.product_pod > div.image_container > a > img::attr(src)').extract() # accessing the image links
To get CSS selector what was you need, you can just visit the website and do this things.
- Open a Website
- press “ CTRL+Shift+i ” on your keyboard, or you can just right click on your mouse and choose “Inspect Element” on your browser
- At last right click on the line code you want to scrape and copy CSS Selector. then you get the CSS Selector Path or you can just copy CSS Path and modify to CSS Selector on your own
And the last thing, we want to write the output as our wish to JSON file with this code :
i=0;
for book_title in book_list:
data={
'book_title' : book_title,
'price' : price_list[i],
'image-url' : image_link[i],
'url' : link_list[i]
}
i+=1
list_data.append(data)with open(filename, 'a+') as f: # Writing data in the file
for data in list_data :
app_json = json.dumps(data)
f.write(app_json+"\n")
The result of the JSON format from this line code is :
{ "book_title": "the title", "price": "the price", "image-url": "the image url", "url" : "the title url" }
Execute the Scrapy
Before execute your scrapy, you must be in the project directory , so from the command line or your terminal navigate to your project folder
cd <project_name>
Now from your command line, you can launch the spider using the command
scrapy crawl <spider_name>
<spider_name> is the name from “name” attribute from earlier , so in my case I will run :
scrapy crawl book_spider
Output
If success you will have the output result name “book.json” which contains a data-sets that we need to create our Search Engine
Reference :
code : https://github.com/Andika7/searchbook
Proceed to the next part !
I hope you’ve found the second part of this tutorial helpful. We learned hot to collect a data-sets with Python Scrapy
In the next part, we’re going to make Indexer and Query script with Python Programming Language
Please refer to this link for the part:
Part 3 : https://builtin.com/machine-learning/index-based-search-engine-python