Solving Real World Data Science tasks With Python Scrapy: “Books Reviews Dataset Creation”
Without a systematic way to start and keep data clean, bad data will happen. — Donato Diorio
Achieving the best accuracy model with the best predicting performances depends on our training methodology and the data we feed it. In real word, data isn’t always ready for us to use, you can just find the perfect csv file that contains exactly what you need for any project you have in mind. Instead, you can go collect whatever data you need for your model all by yourself, this is called “Web Scraping”.
Throughout this tutoriel, we’ll be scraping Goodreads, which is the world’s largest site for readers and book recommendations. Our goal is to extract data of 50 books in the data science field. For each book we’ll have: its title, author, global rating, reviews and ratings all gatehered in a json file.
1- What is Web Scraping? Refers to the creation of a software to automate data extraction from online sources. Data is collected and then exported in a format that is more useful to the user. While web scraping can be done manually, automated tools are preferred when scraping web data because they cost less and work faster. In most cases, web scraping is not an easy task. Websites come in many shapes and forms, so web crawlers vary in functionality and features.
- VsCode: is a code editor redefined and optimized for building and debugging modern web and cloud applications.
- Scrapy: is a fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath/CSS Selectors. It generates feed exports in formats such as JSON, CSV, and XML.
3- Installation: You can install Scrapy and its dependencies from PyPI with:
pip install Scrapy
4- Creating a project: Before we start scraping, we need to create a new scrapy project:
scrapy startproject scraping_goodreads
This will create a scraping_goodreads directory with the following contents:
scraping_goodreads/scrapy.cfg # deploy configuration file
__init__.py # project's Python module, you'll import your code from here
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ __init__.py # a directory where you'll later put your spiders
5- Creating our spider: In the spiders directory, create a new file spider_goodreads.py, in here we’ll be saving our spider. Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). The scraping cycle goes through something like this:
- Find the URL that you want to scrape
- Inspecting the Page and Find the data you want to extract
- Write the code
- Run the code and extract the data and store it in the required format
Let’s apply this to extract data from goodreads website:
Step1: Find the URL that you want to scrape: Here we want to extract books data from the datascience shelf, so the url needed is : ‘https://www.goodreads.com/shelf/show/data-science'.
Step2: Find the data you want to extract: First, we need to extract the first 50 books links, then we need to access those links in order to extract the title and the author of the book, its global rating and finally all the reviews with their ratings. How can we find all this? easy! we need to inspect the page:
Extracting The links:
#links css selector
Once we access the link we can start to extract the needed informations:
#title css selector
The author :
#author csss selector
The global rating :
#global rating css selector
For this part, the extraction is a little bit more complexe for 4 main reasons:
- The “more” button will do us having the first part of the review duplicated: we only need to take the whole review.
- The shortest reviews are contained in the same tags as the ones of the more button:we only need to take the shortest reviews out from that tag.
- Spoilers in the reviews need to be removed if there’s any.
- The ratings are star shaped so to get the value of a rating we need to calculate the number of stars contained in the span tag:
Step 3: Write the code: Now, we need to translate step 2 into a code: for the process we’ll be working with 2 files: items.py defines the model for our scraped items: each item you use in your crawler to store your data you need to define it in this file.
and, spider.py for our crawler:
- We import scrapy, and the items from ScrapingGoodreadsItem, the items.py class we defined above:
reviews_ratings: a dictionnary for storing our scraped reviews/ratings.
informations: an instance of our ScrapingGoodreadsItem() to store all the scraped items.
Then we create our crawler class GoodreadsSpider(scrapy.Spider):
- we give it a name: ‘goodreads’ to simplify the execution call after.
- We define our start_urls: the url we want to scrape.
Now, we define our parsing methods :
The first parsing method is to extract all the links, the parse method has self and the response we get from or start url as arguments:
books contains our css selector that target the links and exract them from the response. We extract the urls of each page inside the link(we have a pagination system for the reviews/ratings)and we pass them to our second parsing method within a callback inside a response.follow using:
yield response.follow(url, callback=self.parse_review)
The second parsing method is for the extraction of the informations from the passed links:
The difference we notice between the two functions is that in parse_reviews we needed to make a call to our ScrapingGoodreadsItem() because now we are scraping data we need to store for our files not just links we need to access.
We extract the title, the author and the global rating with the css selectors we defined earlier and store the in informations.
Taking the 3 conditions we set in step 2 we extract the reviews and ratings and store them in reviews_dict which need to be initialized at the begining of each process of extraction after appending the stored data in reviews_ratings. And finally pass the reviews_ratings to informations .
Step 4: Run the code and extract the data and store it in the required format: You only need one line of code to extract your data in a format that you want, whether it’s csv or json.etc: On your vscode terminal execute this command:
scrapy crawl goodreads -o goodreadsdata.json
where goodreads is the name of our spider and goodreadsdata.json is the name of our file. If you want a csv file instead you only need to change the format to:
scrapy crawl goodreads -o goodreadsdata.csv
Once the scraping done you’ll find your file here:
End Notes: I hope this Tutorial on web scraping with Python has been informative and adds value to your knowledge. You can now create your own datasets and this will ease up your exploratory data analysis because you already know what’s most important to highlight about what you’ve created.
You can find the project Here.