How to Create your own Search Engine with Python Language and Laravel Framework Step 2 of 4

Andika Pratama
Analytics Vidhya
Published in
5 min readJan 27, 2020

Step 2 : Data Crawling Using Python Scrapy

https://i.ytimg.com/vi/ve_0h4Y8nuI/maxresdefault.jpg

In this part, I am going to write a web crawler that will scrape data from Books to Scrape website. But before I get into the code, here’s a brief intro to Scrapy itself.

What is Scrapy?

From Wikipedia:

Scrapy (pronounced skray-pee) is a free and open source web crawling framework, written in Python. Originally designed for web scraping, it can also be used to extract data using APIs or as a general purpose web crawler. It is currently maintained by Scrapinghub Ltd., a web scraping development and services company.

Prerequisites

So like the previous part to follow along with this part tutorial you should have :

  • python 3.x.x
  • The package manager PIP
  • Code Editor (in this tutorial I will use Visual Studio Code, Optional)

Installing Scrapy

Now before we start our first Scrapy project we need to install Scrapy, so open up your terminal and type:

pip install scrapy

Start Your First Project

In your terminal Type :

scrapy startproject <project_name>

where <project_name> is the project name, call it whatever you want. but in this tutorial I will use “web_scrap”

Scrapy Project Structure

Before we do the scraping process it’s good we know the structure of the scrapy project

Structure Of Scrapy Project

as seen after the startproject command is run, scrapy will create the project automatically. but in the newly created project, you will not see the file book_list.py because it is a file created manually for scraping. And book.jsonis it a result from scrapy process.

Make Scrapy Script

like the structure of the previous project, we will create a script for scrapy inside the spiders” directory. in this tutorial I create a scrapy script with name “book_list.py”. If you have VS Code just type :

code <project_name>/spiders/book_list.py

And fill you blank script with this script :

book_list.py

Introduce The Script

import scrapy
import json
filename = "book.json" # To save store data

In this line code is the place where the scrap is declared and the output file is determined. In this tutorial I will export output file to JSON format file.

class IntroSpider(scrapy.Spider):
name = "book_spider" # Name of the scraper

The “name” attribute describes the name of the spider scrap to be created. you can use another name instead. The name class attribute, basically we use to call or launch the spider from command line bear in mind in Scrapy every spider should have a unique name.

urls = ['http://books.toscrape.com/catalogue/page-{x}.html'.format(x=x) for x in range(1, 50) ]

The urls list, we use it to specify what website/page we want to scrape.

and the another one is the parse method is responsible for parsing the DOM, it’s where we write the CSS Selector expressions to extract the data.

book_list = response.css('article.product_pod > h3 > a::attr(title)').extract()  # accessing the titleslink_list = response.css('article.product_pod > h3 > a::attr(href)').extract()  # accessing the title linksprice_list = response.css('article.product_pod > div.product_price > p.price_color::text').extract() # accessing the priceimage_link = response.css('article.product_pod > div.image_container > a > img::attr(src)').extract()  # accessing the image links

To get CSS selector what was you need, you can just visit the website and do this things.

  • Open a Website
http://books.toscrape.com/catalogue/page-2.html
  • press “ CTRL+Shift+i ” on your keyboard, or you can just right click on your mouse and choose “Inspect Element” on your browser
  • At last right click on the line code you want to scrape and copy CSS Selector. then you get the CSS Selector Path or you can just copy CSS Path and modify to CSS Selector on your own

And the last thing, we want to write the output as our wish to JSON file with this code :

i=0;
for book_title in book_list:
data={
'book_title' : book_title,
'price' : price_list[i],
'image-url' : image_link[i],
'url' : link_list[i]
}
i+=1
list_data.append(data)
with open(filename, 'a+') as f: # Writing data in the file
for data in list_data :
app_json = json.dumps(data)
f.write(app_json+"\n")

The result of the JSON format from this line code is :

{ "book_title": "the title", "price": "the price", "image-url": "the image url", "url" : "the title url" }

Execute the Scrapy

Before execute your scrapy, you must be in the project directory , so from the command line or your terminal navigate to your project folder

cd <project_name>

Now from your command line, you can launch the spider using the command

scrapy crawl <spider_name> 

<spider_name> is the name from “name” attribute from earlier , so in my case I will run :

scrapy crawl book_spider

Output

book.json

If success you will have the output result name “book.json” which contains a data-sets that we need to create our Search Engine

Reference :

code : https://github.com/Andika7/searchbook

Proceed to the next part !

I hope you’ve found the second part of this tutorial helpful. We learned hot to collect a data-sets with Python Scrapy

In the next part, we’re going to make Indexer and Query script with Python Programming Language

Please refer to this link for the part:

Part 1 : https://medium.com/analytics-vidhya/how-to-create-your-own-search-engine-with-python-language-and-laravel-framework-step-1-of-4-f25e5ba1ab92

Part 3 : https://builtin.com/machine-learning/index-based-search-engine-python

Part 4 : https://medium.com/analytics-vidhya/how-to-create-your-own-search-engine-with-python-language-and-laravel-framework-step-4-of-4-4e91cf2557d6

--

--

Andika Pratama
Analytics Vidhya

Fresh Graduate of Computer Science at Universitas Syiah Kuala, Software Engineer. Check my github on github.com/Andika7