How to scrape all types of websites with python — part 2
--
A comprehensive guide on how I scraped 19 thousand medium posts with scrappy and splash.
In the previous article (part 1 of this tutorial) we learnt how to set up the environment for this project. We downloaded and installed Anaconda navigator, Scrapy, Docker and Splash.
If you already have these setups installed then, voila! else take a few minutes to go through the previous post for the environment setups.
Now let's get started!!
Project goal: Scrape thousands of medium articles from towardsdatascience
Things to learn
- Learn how to program in VS Code
- Write Splash Script
- Extract patterns with Scrapy
- Store data in CSV,JSON and XML
Creating a new project
a. Launch anaconda navigator and click on environment
b. Click on the play button of the right environment we created in the previous tutorial, in my case web_scraping_project
and run the terminal
c. Create a new directory with mkdir
and then change the directory to the new directory created with cd
[directory name].
d. Create a new project with the Scrapy command scrapy
startproject [project_name].
e. On your desktop open the folder medium_splash
and you will see the following files and sub-folders.
Let’s learn few things about the various files
scrapy.cfg
This helps us deploy our spider.
spiders
Within this folder is where we will write our spider scripts.
items.py
Used to store scraped data in some fields we will create.
middlewares.py
This is responsible for request and response objects.
Don’t worry we will learn about these two objects along the way.
pipelines.py
We use it to store the items we scrape in a database.
settings.py
We can configure our project settings here.
f. Change directory to the project folder and generate our spider with
scrapy genspider [spider_name] [page_link]
spider_name
is a spider name of choice, my case md
.page_link
is the link to the page to scrape.
g. Install scrapy-splash package. This enables us to send splash requests from Scrapy scripts.
h. Launch the anaconda navigator and run VS Code. If you launch the anaconda navigator and instead of the launch button you see install, click on it to install it first then you launch it.
i. Click on the Open Folder, navigate to your project directory, and select the folder we created, if you are following along, then you will have to select the folder with the name medium_splash
.
j. The project should open as below, we are currently in the spider >md.py file. This is where we will write all our codes.
Now we have successfully created a new project and launched it in VS Code.
In this section, we will learn:
- How to configure our project settings.
- write our scrapping script
- run it to extract our data set.
a. Open settings.py file
robots.txt
This is a file used by many websites to deny access to Scrapy.
change:
to:
b. Some websites quickly block you from their site if the default user agent of Scrapy is detected. So change it to your browser user agent.
You can know your user agent by just searching my user agent
in Google.
Remember to replace it with your own user agent
change:
to:
c. These changes can be found in the official Scrapy-splash GitHub repository. I encourage you to check it out.
change:
to:
change:
to:
d. At the base of the setting.py file add the below script and finally save it.
SPLASH_URL = 'http://localhost:8050'
That is the URL and the port splash would be opened within the browser.
Writing our code
a. Within the spider folder open the spider file, in our md.py
By default we already have:
We will import the scrapy_splash library
from scrapy_splash import SplashRequest
When you check out the main archive of towardsdatascience
https://towardsdatascience.com/archive
You could see that each page contains a list of articles, and also can be visited by appending the year, month and day as https://towardsdatascience.com/archive/2020/01/01
b. So let's generate all the main page links as follows
c. When you run a spider, Scrapy searches out for two functions, start_requests
and parse
functions.
Let's create them
Line 1–26 is splash code. Explanation to that has been infused as comments in it.
XPath
is a language for selecting nodes in XML documents, which can also be used with HTML.
The pattern .//div[@class=’postArticle-readMore’]/a/@href
is basically the path to the links of every listed article on the page. This path can be known by using the developer tool ( ctrl +shift +i) command.
I strongly recommend you to do a quick reference on how to use XPath and CSS selectors here.
d. We will create a function to extract our items from the HTML response returned by the splash script.
Putting everything together; the script should look like this.
Running our Spider
a. Launch the docker desktop
b. Open command prompt issue this command to run the docker server:
docker run -p 8050:8050 scrapinghub/splash --max-timeout 3600
c. On the tabs within the VS Code, click on view and then on terminal, to open the terminal within the VS Code.
d. Run the script scrapy crawl md
within the Terminal
e. To store the items scrapped in the file, simply do
scrapy crawl md -o [filename].[json or csv or XML]
eg. let’s store the data in JSON form
scrapy crawl md -o mydata.json
Now our spider should be running perfectly.
Conclusion
- In this tutorial, we learnt how to write and run simple spider with Scrapy and Splash
- Storing the data in either CSV, JSON OR XML
3. The items scraped are header, sub-header, article content, number of claps and tags.
You can get the entire code on my Github repository.