From 0 to 1: how to build a web crawler from scratch by python. Part II.

Lena Li
Lena Li
Nov 5 · 5 min read

Let’s continue with previous section. We have extracted URLs from all the pages of the website. Now let’s keep going to grab data points in the pages from URLs we already have.

Part II: data extraction

First, let’s click one URL and inspect all the elements. For example, the title, most of the time, is in the <h1> tag.

We mentioned how to extract data with xPath syntax from previous section, here’s the code to extract data from the page:

You can see here both xPath and CSS selector would work. You can choose either one to extract data.

With similar syntax, we can extract the price, description, availability, fating, reviews, etc. Here comes a tricky part: extract the images. Let’s see what the image URLs look like:

element inspection of image URL

But the full URL is: http://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg

So we need to replace ‘../..’ to ‘http://books.toscrape.com/’. So now we have a lot of data from this URL. Here’s the code:

Let’s move on to the table in the bottom:

We’d like to extract these data as well. But adding these repeating syntax is tedious. Let’s write a function instead and extract those data one by one.

def product_info(self, response, value):
return response.xpath(‘//th[text()=”’ + value + ‘“]/following-sibling::td/text()’).extract_first()

Here comes the code to extract all the info we need:

You can run this books Spider and check results.

Export data to databases

We have the script that can scraping data in the webpage. How to export the data?

You can export result to .csv or .json with output syntax:

scrapy crawl books -o results.csv

Open the CSV file, all the data is in the file as we scraped:

output file

Scrapy also provides pipline to databases, like mySQL and mongoDB. Here I’ll use mongoDB as an example. You can register a free account in mongoDB, and get 500M space with no charge.

After installation of mongoDB, you also need to install pymongo to connect with the database. The second thing that we need to do is go to the settings.py and navigate to the ITEM_PIPELINES, which are going to be here, uncomment them and then write the for example “MongoDBPipeline” which will actually go to the pipelines.py and then assign class MongoDBPipeline.

Then we need to do is to assign different MongoDB attributes.

“MONGODB_SERVER”, which is going to be a “localhost”;

“27017” will be the “MONGODB_PORT”;

“bookstore” will be the name of database;

“books” will be the name of collections.

Of course, you can name those as you wish.

ITEM_PIPELINES = {
‘bookstore.pipelines.BookstorePipeline’: 300,
}
MONGODB_SERVER = ‘localhost’
MONGODB_PORT = 27017
MONGODB_DB = ‘bookstore’
MONGODB_COLLECTION = ‘books’

After change settings in settings.py, we can implete our piplines.py file. We need to initiate settings from settings file, then connect it with our database.

Then you can run the code again, and check mongoDB with all the data.


Finally, some tips

Please be careful while crawling websites or you will be banned. Here’re some tips to keep in mind while web crawling.

1. Space out your requests: in settings.py activate the option DOWNLOAD_DELAY or manually add some code to sleep for a random number of seconds: sleep(random.randrange(1, 3))

2. Use agent: In settings.py activate the option USER_AGENT. Defining your agent will let you look more like a browser used by a human but not a robot.

3. Change IP with proxies: Find external proxies and rotate IP addresses while scraping. You can use the package scrapy-proxies for the purpose.

Something more before end

Now with the tutorials, you know what scrapy is; how to build a basic Spider to scrapy data from a website; how to iterate multiple pages and scrape data from each page; how to use xPath to extract data; how to output extracted data or export data to database.

During this practice, we know Scrapy tries not only to solve the contect extraction (or scraping), but also navigate to the relevant pages for the following extraction (or crawling). So you need to write the code and build the framework with a few special feature with python to trigger the functions.

Scrapy is very powerful, it can also combined with Splash or Selenium to create web crawlers of dynamic web pages. If you cannot fetch data directly from the website, you need to load the page, fill in a form, click somewhere, scroll down and so on, then Splash or Selenium will works well along with Scrapy to interact with AJAX calls and JavaScript execution.

Besides that, Scrapy is also very efficient for concurrency. It’s built on asynchronous networking framework, so that you do not have to wait for a request to finish before making another one. You can built a much more efficient crawler with an asynchronous code with Scrapy.

Scrapy is the most popular tool for web scraping and crawling written in Python. It is simple and powerful, provides many of the functions required for downloading websites and other content on the internet, making the development process quicker and less programming-intensive. This tutorial is very basic, but it’s a good beginning for advanced web crawler. If you are interested in more advanced functions. I’d strongly recommend you to take the course from Udemy.

Resources:

https://docs.scrapy.org/en/latest/intro/overview.html

https://www.udemy.com/course/scrapy-tutorial-web-scraping-with-python/learn/lecture/11127000#overview

https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/

https://www.irjet.net/archives/V4/i2/IRJET-V4I225.pdf

Lena Li

Written by

Lena Li

deep thinker

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade