From Spiders to R and back…

A text-mining project for the social scientist (who likes computers, but knows not how to code)

Parvathi A. Subbiah
Nov 20, 2020 · 12 min read

For those who are curious about what these new computational methods and machine learning have to offer the discipline, here’s a taster! I’m going to be taking you through how to start (and hopefully finish) a computational social science project throughout the course of this post (and another 2).

Granted, I’m going to take you through a highly niche — can’t stress that enough — area of study: groups on the extreme left in the United Kingdom that are supporters of the Maduro government in Venezuela. Niche indeed. For some context on the situation in Venezuela at the moment (and as a means to draw conclusions on your own) I recommend you check out this documentary on BBC’s iPlayer. I will refrain from giving my own opinion on the matter — let’s just say it’s too close to home and the discussion might take hours (you can definitely ask me about it though, if you’re curious, as I’ve been writing about it for 4 years!)

In this analysis, I’m going to focus on the online output of these groups in the U.K.: specifically I will look at their more extensive written output — hence why text mining and textual analysis methods are appropriate. For this tutorial, I will only focus on one particular group in the U.K., the Hands off Venezuela campaign (HOV). (There are actually several Maduro supporter groups in the U.K. in case you were wondering. Strange, I know.) I will also focus only on their blog output, vis-a-vis their Facebook, or even their Twitter output, which I’ll save for a different project.

Our first task is therefore to get all HOV’s written blog output together in a lovely table. One way to go about collecting this data would be to copy and paste all the articles from the HOV site. You know, with a mouse and a web-browser. But this, I will strongly argue, is a dreadful idea. Not only is it time-consuming, mind-numbingly boring, and error-prone, it is also non-reproducible. And as an aspiring data scientist I must tell you, non-reproducibility is a complete no-no (again, I’m happy to discuss why if you ask nicely).

So, first we have to learn how to ‘scrape’ this information from the HOV blog. In other words, we have learn how to let our computer do the copying and pasting for us through a set of instructions. How you ask? Enter Scrapy: a nifty little application written in Python that we can use to finish this task for us. (See Scrapy’s documentation here.)

I will give a taster of what Scrapy can do, and if you’re not too intimidated by your computer’s console (or terminal) you can follow along by typing the commands as they’re shown below. The actual Spider built to scrape the HOV site is in my repository on GitHub. Here you can also download the file that I used to conduct the textual analysis, and follow more advanced instructions (lucky you! You can skip this tutorial altogether!)

Some legal things to note: A website’s term and conditions will note their scraping policy. Some sites very explicitly prohibit scraping (IMDB, for instance) and others less so. The data we are trying to capture increases the server load, so companies have reasons to want to limit scraping. A website’s robots.txt document should lay this out clearly. You can access this information by adding a slash to the website’s name and typing /robots.txt.

For now, let us build a simple spider that will teach us the basics of scraping, erm… legally. For more comprehensive introductions to Scrapy I highly recommend these articles:

  1. Daniel Li’s ultimate intro
  2. tutorial for scraping images and understanding how to crawl along different pages.

If you would like to learn how to work with textual data that you already have, skip this Python webscraping section, and head over to the next section where I discuss how to do textual analysis in R.

To install Scrapy, head over to your console. Feel free to copy and paste. I note that everything after the hash is ‘commented out’, the console doesn’t interpret it although we can copy and paste , it’s meant for human eyes only.

Here we make sure 1) our pip is up to date, 2) that we install the dependencies (in this case ‘pypiwin32’), and 3) only then install Scrapy:

# Install pip
python -m pip install --upgrade pip

# Install dependencies
pip install pypiwin32

# Install scrapy
pip install scrapy

Here, we install the dependencies, then make sure pip is up to date, and only then install Scrapy:

# Install dependencies
sudo apt-get install python3 python3-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

# Upgrade pip
pip install --upgrade pip

# Install scrapy
pip install scrapy

Here, we update Xcode, then make sure that home-brew and python are up to date, and only then install Scrapy (I note that depending on which version of Python (2, 3) is set you will need to you use pip3 instead of pip below).

# Xcode update
xcode-select --install

# Install homebrew
/bin/bash -c "$(curl -fsSL"

# Update path variable so that homebrew starts before system packages
echo "export PATH=/usr/local/bin:/usr/local/sbin:$PATH" >> ~/.bashrc

source ~/.bashrc

# Make sure brew and python are up to date
brew update; brew upgrade python

# Install scrapy
pip install scrapy

To start, decide on a memorable name for our spider and run the following command to — you guess it! — startproject:

scrapy startproject wikiNobel

This will generate a group of files that we will need to run our spider. I’m particularly terrified of real spiders, so even spelling and reading the word doesn’t feel great (I’m a visual creature). The fact that I’m writing so much about spiders, despite how I feel about the real ones, speaks volumes to the utility the little software versions of them serve. They are amazing!

You should now have a group of folders that looks like this:

├── scrapy.cfg --> deploys configuration of scrapy project.
└── wikiNobel --> this is your scrapy project module.
├── --> this initialises the module initializer
├── --> this project item definition py file
├── --> project middleware .py file
├── --> project pipeline .py file
├── --> project settings .py file
└── spiders --> save here!

We are now going to create a new Python script, within the /wikiNobel/spiders folder and name it If you’ve never created a Python script, don’t fret, you won’t need to know the language fluently, the intro here should suffice for writing your spider. You should check out Al Sweigart’s famous guide as an introduction to the language! It is a lot of fun—I promise.

I do recommend that you get some Python programming experience before you start flying solo — that is writing your own spiders, as they can get quite complex, as you will see.

To create this Python script, we need a text editor. We can write this in any text editor — even your trusty TextEdit will do. But there are some other lovely code editors out there, that highlight code and suggest entries (much like you’re suggestions on your smartphone when you are typing. I personally have a soft spot for Atom.

Once inside the editor, type:

import scrapy

class NobelSpider(scrapy.Spider):
name = "wikiNobel"
allowed_domains = [""]
start_urls = [""]

Ok. Let’s get to grips with all this…

First, import scrapy tells Python to access the Scrapy script, and therefore its functions and classes.

Second, Inside class we are creating a subset of the spider Scrapy already provides (scrapy.Spider), and we are naming it wikiNobel. The name is important as we will use the name to call this specific spider from the command line.

We are then setting the limits to the webpages the spider can access (in this case, only wiki):

allowed_domains lists the domains the spider is allowed to scrape

start_urls lists the url from where the spider is allowed to start its ‘crawl’, i.e. it is the first URL or spider will ‘read’.

Let’s practice extracting the title of the page (for now). Indenting the first line (from the one above), type:

def parse(self, response): 
data = {}
data['title'] = response.xpath('//h1[@class="firstHeading"]/text()').extract()
yield data

The data = {} defines an empty dictionary where Scrapy will save our extracted title. This needs to be indented along with everything below it! parse is the spider’s main function. As a note DON’T change the name of this function

Let’s dissect what the next line of code means — in very (very) blunt terms:

We have told Scrapy to extract(), from the entire HTML code saved in its response, the text() of the <h1> heading which contain an attribute class="firstHeading" —all this using xpath notation.

NOTE: Make sure you save your Python script under /wikinobel/spiders/ directory, and as—failing to do so means Scrapy will not be able to find it!

The full little spider should look like this once it’s saved:

import scrapy

class NobelSpider(scrapy.Spider):
name = "wikiNobel"
allowed_domains = [""]
start_urls = [""]

def parse(self, response):
data = {}
data['title'] = response.xpath('//h1[@class="firstHeading"]/text()').extract()
yield data

Let’s run the spider! Head over to your console (make sure you’re still in the wikiNobel/wikiNobel directory) and type:

scrapy crawl wikiNobel

Now you’ll see a bunch of printed text in the console (scary stuff, right??)

Among the information Scrapy has been submitted you should also see something like this if everything has gone smoothly (the date will obviously be different):

2020-11-19 11:18:52 [scrapy.core.scraper] DEBUG: Scraped from <200>
{'title': ['List of female Nobel laureates - Wikipedia']}

Bingo! You have built your first spider that extracted the title of the Wikipedia page for female Nobel laureates! Not very useful, yet, I admit. We’ll get to it.

Now… wouldn’t it be great to build a spider that could extract the names of all the laureates, and then head over to their wiki-entry to extract their birthdays, and the category of their Nobel prize?

Such a spider would need to:

  1. Get a wiki-entry for each of the laureates;
  2. Set forth on a crawling journey that visits each entry and extracts both the date of birth and the name of the award.

Edit yourdef parse(self, response): section, maintaining all the indentation, with the new lines of code that are shown.

NOTE: Don’t use any spaces to create indentation or Python will not know how to read it! When you copy/paste, text editors can sometimes transform the indents into spaces and you will never know what’s going on. The code looks exactly the same. Best practice is to type the code out yourself. Trust me. I’ve been there, and it wasn’t fun.

As you can see xpaths can become quite complex!!

def parse(self, response):
for href in response.xpath('//span[@class="vcard"]//a/@href').extract():
url = response.urljoin(href)
req = scrapy.Request(url, callback=self.parse_bday)
yield req

Let’s take a look at our this first parse function, which is quite a step up from the last one. Here, the response.xpath('//span[@class="vcard"]//a/@href).extract() looks for all // the <span> elements in the html that also have vcard as a class.

The xpath points at the <a>element (within that section of <span>) that also contains an href. hrefs in HTML other URLs. There are many href in our HTML, but in this case, we are looking for the links to the wiki-entries of the laureates. These happen to appear under the vcard class.

Because the href is given in abbreviated form (that is as /wiki/Bertha_von_Suttner rather that as, we have to ask Scrapy to go through the list of hrefs it has extracted and urljoin. This convenient join function will get us the full or absolute address.

We tell Scrapy to print the URL, so that we can see it on the screen once we run the spider, and then we ask it to go to it’s second task parse_bday, which we define below.

Now we need to ask our spider to loop over each one of those hrefs it found, extract the name of the laureate, her birthdate, and the category of Nobel she received.

Again, pay attention to indentation here!

Here’s our code:

def parse_bday(self, response):
for sel in response.css('html').extract():
data = {}
data['title'] = response.xpath('//h1/text()').extract()
data['bday'] = response.xpath('//span[@class="bday"]/text()').extract()
data['award'] = response.xpath('//a[contains(text(), "Nobel")]/@title').extract_first()

yield data

Yikes! Ok let’s break it down…

Our new function is parse_bday but we will give the spider instructions to extract the other stuff as well…

For each of the links the spider has extracted from the main Laureates extract it should:

  1. Create an empty dictionary called data (think of this as our ‘table’)
  2. Extract the title (the <h1> tag of the current entry) under the column ‘title’
  3. Extract the birthday, (the text of all span elements that are under class="bday"), under column ‘bday’.
  4. Extract the first (i.e. using the extract_first function instead of extract) instance of <a> elements that contain the text() “Nobel”. Save under the ‘award’ column. (We extract the first given each entry mentions the award several times. We extract it from the <a> element (rather than from the text as a whole) because it contains a link to the wiki-entry of the specific category. This should, in theory, ensure we won’t just be extracting the word Nobel on its own—unless there was a link to Alfred Nobel himself, we would then need to check.

Now we go back to the console and run the spider again. This time though we are going to save the output as a .csv file.

scrapy crawl wikiNobel -o nobels.csv

Xpath selectors can get quite complex. We’ve only barely scratched the surface. Generally we need to study the HTML code in detail, to understand what we want our spider to extract, so a good understanding of how HTML is built, is useful. For now, it’s good that you think of the ways the tags are structured: <h1>, <h2> etc, signal headings.hrefs are usually under the <a> tag, written texts are usually under <p>, etc. For more about HTML and how it’s built check out this course on the basics.

We can test xpaths by opening Chrome’s DevTools. We select the part of the web page we want to extract, right-click on it, and choose “Inspect” from the options. This opens the entire HTML (you’ll see it on the side). Once we have the entire HTML alongside the page we want to scrape, press ⌘F (command + F on a Mac, control + F on windows). This opens a search tool where we can type your xpaths; Chrome will highlight what it finds on the HTML document (see image below).

Search tool in DevTools using Chrome
Search tool in DevTools using Chrome

In this example you can tell Chrome has found 58 names with the Xpath we have given it

//span[@class="vcard"]//a/text(). You can press the down-arrow to move along all the instances it has found. In the example provided, Chrome is highlighting the 2nd instance of all the names it has found, in this case, good old Bertha von Suttner (Laureate 2 of 58). This should mean there are 57 laureates in total—Marie Curie won the Nobel twice, and is the only person to have won it in two different disciplines! (Talk about being an overachiever!! What a legend!!) It is good practice to check the number of instances your xpath returns, just to make sure it is getting what you need, and not more (you might need to use extract_first for instance).

Don’t despair if you find yourself having difficulties with Xpath! The learning curve is steep.

Scrapy includes its own ‘shell’, that is, it’s own bespoke terminal where you can directly input scrapy commands. The shell helps us examine the response as we code our selectors; in other words we can type in response.xpath('//h1/text()').extract() and it will return what it finds (if it finds something at all. That way we make sure Scrapy is selecting the information we want, before we “set it in stone” over in our text editor.

For more comprehensive intros to xpath check out this article, this article and this very concise but brilliantly written article.

You can find the spider I used for extracting the information on HOV here.

The paths are complicated as you can see, and require a lot of trial and error as well!

In our next tutorial, we will be getting into the analysis part of the project… I will assume you have your data with you, and it will be a question of excavation!

If you are very new to coding, you will realise that there is A LOT of information out there to help you deal with errors, troubleshooting and general ‘newbie-ness’ — consider yourself in good company. The programming and Data Science community are extremely generous with their time, and you can find answers to most of your questions on—if you’ve never tried to find answers there, you are in for a ride! It’s truly amazing.

See you soon when we start with an entirely different beast of a language, my fav (at least when it comes to analysis) R!

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Parvathi A. Subbiah

Written by

PhD Politics and Sociology | University of Cambridge | Gates Scholar | Social_Data_Science

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem