How to scrape all types of websites with python — part 2

Joachim Kuleafenu
Analytics Vidhya
Published in
6 min readAug 10, 2021


A comprehensive guide on how I scraped 19 thousand medium posts with scrappy and splash.

Photo by Nathan Dumlao on Unsplash

In the previous article (part 1 of this tutorial) we learnt how to set up the environment for this project. We downloaded and installed Anaconda navigator, Scrapy, Docker and Splash.

If you already have these setups installed then, voila! else take a few minutes to go through the previous post for the environment setups.

Now let's get started!!

Project goal: Scrape thousands of medium articles from towardsdatascience

Things to learn

  1. Learn how to program in VS Code
  2. Write Splash Script
  3. Extract patterns with Scrapy
  4. Store data in CSV,JSON and XML

Creating a new project

a. Launch anaconda navigator and click on environment

b. Click on the play button of the right environment we created in the previous tutorial, in my case web_scraping_project and run the terminal

c. Create a new directory with mkdir and then change the directory to the new directory created with cd [directory name].

d. Create a new project with the Scrapy command scrapy startproject [project_name].

e. On your desktop open the folder medium_splash and you will see the following files and sub-folders.

Let’s learn few things about the various files

scrapy.cfg This helps us deploy our spider.

spiders Within this folder is where we will write our spider scripts. Used to store scraped data in some fields we will create. This is responsible for request and response objects.
Don’t worry we will learn about these two objects along the way. We use it to store the items we scrape in a database. We can configure our project settings here.

f. Change directory to the project folder and generate our spider with
scrapy genspider [spider_name] [page_link]
spider_name is a spider name of choice, my case md.
page_link is the link to the page to scrape.

g. Install scrapy-splash package. This enables us to send splash requests from Scrapy scripts.

h. Launch the anaconda navigator and run VS Code. If you launch the anaconda navigator and instead of the launch button you see install, click on it to install it first then you launch it.

i. Click on the Open Folder, navigate to your project directory, and select the folder we created, if you are following along, then you will have to select the folder with the name medium_splash.

j. The project should open as below, we are currently in the spider > file. This is where we will write all our codes.

Now we have successfully created a new project and launched it in VS Code.

In this section, we will learn:

  • How to configure our project settings.
  • write our scrapping script
  • run it to extract our data set.

a. Open file

robots.txt This is a file used by many websites to deny access to Scrapy.



b. Some websites quickly block you from their site if the default user agent of Scrapy is detected. So change it to your browser user agent.

You can know your user agent by just searching my user agent in Google.
Remember to replace it with your own user agent



c. These changes can be found in the official Scrapy-splash GitHub repository. I encourage you to check it out.





d. At the base of the file add the below script and finally save it.

SPLASH_URL = 'http://localhost:8050'

That is the URL and the port splash would be opened within the browser.

Writing our code

a. Within the spider folder open the spider file, in our

By default we already have:

We will import the scrapy_splash library

from scrapy_splash import SplashRequest

When you check out the main archive of towardsdatascience

You could see that each page contains a list of articles, and also can be visited by appending the year, month and day as

b. So let's generate all the main page links as follows

c. When you run a spider, Scrapy searches out for two functions, start_requests and parse functions.

Let's create them

Line 1–26 is splash code. Explanation to that has been infused as comments in it.

XPath is a language for selecting nodes in XML documents, which can also be used with HTML.

The pattern .//div[@class=’postArticle-readMore’]/a/@href is basically the path to the links of every listed article on the page. This path can be known by using the developer tool ( ctrl +shift +i) command.

I strongly recommend you to do a quick reference on how to use XPath and CSS selectors here.

d. We will create a function to extract our items from the HTML response returned by the splash script.

Putting everything together; the script should look like this.

Running our Spider

a. Launch the docker desktop

b. Open command prompt issue this command to run the docker server:

docker run -p 8050:8050 scrapinghub/splash --max-timeout 3600

c. On the tabs within the VS Code, click on view and then on terminal, to open the terminal within the VS Code.

d. Run the script scrapy crawl md within the Terminal

e. To store the items scrapped in the file, simply do

scrapy crawl md -o [filename].[json or csv or XML]

eg. let’s store the data in JSON form

scrapy crawl md -o mydata.json

Now our spider should be running perfectly.


  1. In this tutorial, we learnt how to write and run simple spider with Scrapy and Splash
  2. Storing the data in either CSV, JSON OR XML

3. The items scraped are header, sub-header, article content, number of claps and tags.

You can get the entire code on my Github repository.



Joachim Kuleafenu
Analytics Vidhya

Software Engineer. I build smart features with Machine Learning techniques.