Mapping controversies with digital methods: scrapers, crawlers, APIs

Tutorial 02 in a series on controversy mapping

Ethnographic Machines
9 min readFeb 12, 2019

In this tutorial, you will learn how to harvest information from a large volume of Wikipedia pages for use in your controversy mapping. We will cover three basic techniques, namely:

  • Scraping
  • Crawling
  • And interacting with Application Programming Interfaces (APIs)

A scraper is a tool that automatically grabs specific information from a web page. Sometimes also known as web scrapers, because they work off the HTML structure of web pages, or screen scrapers, because they exploit the fact that the HTML contains instructions about how information should be displayed on your screen, they can be extremely useful to the controversy mapper because of their versatility and ability to systematically gather data in a way that is adaptable to the purpose of the mapping.

A crawler is a tool that automatically navigates a set of web pages according to a set of predefined rules. Used in combination with a scraper, crawling allows us to explore the web within certain parameters while looking for data to harvest.

Finally, an API, or Application Programming Interface, is a tool that most media platforms offer so that other programs (such as an app) can interact with them. APIs will have a syntax and a set of ‘calls’ or ‘requests’ that you will have to learn in order to know how to ask them for things. They are not as versatile as the scrapers, but they are often significantly faster and provide data that is less messy.

Question: Can you explain the difference in principle between a scraper, a crawler and an API?

In the introductory example below we will use an online tool called ‘Seealsology’ to get a sense of scraping, crawling and API interaction in practice. Sealsology is developed by Density Design and the SciencesPo médialab as a quick and dirty way to explore the semantic environment of a set of Wikipedia pages. It is based on the so-called ‘See also’ section, that Wikipedia editors can choose to include at the bottom of a page as a way to suggest other relevant Wikipedia pages to readers. Below is a screenshot of the ‘See also’ section for the page on ‘Circumcision controversies’.

‘See also’ section for the page on ‘Circumcision controversies’ (February 2019)

As you open the Seealsology interface you will be able to paste the URL of the Wikipedia page (or set of pages) that you wish to scrape into the box on the left. We will call this page the seed because it is the location from where the exploration begins (if there are several start pages we can talk about a list of seeds). We will use the URL for the page on ‘Circumcision controversies’ (namely: https://en.wikipedia.org/wiki/Circumcision_controversies) as our seed. Before we click ‘START CRAWLING’, however, we have two more parameters to set: ‘Distance’ and ‘Parent links’. For now, let us untick ‘Parent links’ and set ‘Distance’ to 1. Click ‘START CRAWLING’ and observe what happens.

Seealsology interface with ‘Distance’ set to 1, ‘Parent links’ unticked, and the URL for the ‘Circumcision controversies’ page as the seed.
Result after scraping the ‘Circumcision controversies’ page for ‘See also’ links at crawl distance 1.

You will observe that ‘Seealsology’ has now scraped the list of ‘See also’ links from the seed page and represented them as a network where the seed node is colored red. This is possible because of the way the HTML of a Wikipedia page is structured. You can explore this yourself in most browsers by using the inspect tool. In Chrome, for instance, right-click the ‘See also’ section and select ‘Inspect’ to open a panel with the corresponding HTML code. As you can see, all links on the list are tagged ‘<a href=’ and appear inside the tags ‘<li>’ and ‘</li>’. The list itself appears inside a ‘span class’ with the id ‘See also’. Effectively these tags allow a scraper to be programmed to always find the right links inside the right span class on any Wikipedia page, making sure, for instance, to not get all links from the text or all links in the footnotes instead.

Inspecting the HTML code (left) of a ‘See also’ section (right) in Google Chrome (to open the panel, right-click the ‘See also’ section and select ‘Inspect’).

Question: Can you explain what a scraper does? And can you explain how Seealsology uses scraping? What is the ‘seed’ and how does the scraper find more pages from this ‘seed’?

The ‘Distance’ parameter that we set to 1 refers to the crawl distance of the scraper. We can increase it to 2 or 3 and run the crawl again. This will cause Seealsology to scrape the ‘See also’ links from the pages found through the ‘See also’ links from the seed (crawl distance 2), and in turn the ‘See also’ links from this new set of discovered pages as well (crawl distance 3). On the network visualization, the crawl distance at which a page is found is represented by the color of the node (called ‘level’ in the top left legend). Whereas Seealsology works as a scraper on each page, we say that it is crawling when it is automatically navigating from page to page according to a set of predefined rules (in this case the distance parameter).

Result after scraping the ‘Circumcision controversies’ page for ‘See also’ links at crawl distance 2.
Result after scraping the ‘Circumcision controversies’ page for ‘See also’ links at crawl distance 3.

Question: Can you explain what a crawler does? And can you explain how Seealsology uses crawling? What difference does it make if we change the crawl ‘distance’?

Used in combination, scraping and crawling afford an extremely powerful method for gathering large volumes of digital traces in a structured manner. In principle, we could redesign the scraper to gather any type of data from any kind of webpage, as long as the data we are looking for is organized under a recognizable set of HTML tags. Similarly, we could redesign the crawler to follow any kind of path to other web pages as long as this path is visible to the scraper in the HTML. However, note that the fact that this is possible does not necessarily mean that it is ethical or indeed legal in all scenarios (notably if we were to scrape data behind paywalls or other access restrictions).

What we begin to see from the exploration of ‘See also’ links connecting pages around ‘Circumcision controversies’ is a pattern. At one end of the network is a cluster of pages about the anatomy of the penis, in the other a cluster of pages about reproductive rights and women’s issues. In the center is a cluster of pages on children’s rights and violence against men. We know that these clusters are formed by the fact that their articles can be found up to 3 ‘See also’ links away from the page on ‘Circumcision controversies’ and the fact that some of these articles refer more to each other than they do the rest.

What are not yet able to see are the articles that may refer directly to the ‘Circumcision controversies’ page without ever receiving a reference from the pages in the ‘See also’ network of the ‘Circumcision controversies’ page. These pages are difficult to get through scraping since we do not find links to them in the HTML of the pages we are looking at. That does not make them any less relevant for our exploration of the topic!

Question: Can you explain why scraping links from a specific seed page does not necessarily permit us to find all the links to this seed from other pages?

Seealsology solves this problem by interacting with Wikipedia’s API. If we tick ‘Parent links’ and run the crawl again we will see a new set of articles show up on the network as yellow nodes (‘level -1’). These pages are obtained by making an API call to Wikipedia and request a list of all pages linking to the ‘Circumcision controversies’ page. The Wikipedia API does not have a call that gets ‘See also’ links specifically, so Seealsology visits all the pages on the list returned by the API and uses the scraper to determine whether or not the link to ‘Circumcision controversies’ is, in fact, a ‘See also’ link. In this case, including these ‘Parent links’ with the assistance of the API allowed us to find an actor like Genital Autonomy America whose mission it is to protect the birthright of “children and babies to keep their sex organs intact”. Their page links to ‘Circumcision controversies’, although the same is not true the other way around.

Result after scraping the ‘Circumcision controversies’ page for ‘See also’ links at crawl distance 2 and with ‘Parent links’ ticked.

Question: Can you explain what an API is? And can you explain how Seealsology interacts with Wikipedia’s API?

Crawling and scraping a category on Wikipedia

While we could use Seealsology as a tool to build a dataset of Circumcision related pages on Wikipedia (indeed, this is the intended use) we can also go straight to the editors of Wikipedia who explicitly take on this task of categorizing how pages relate to larger topics or debates. They do so when they build so-called ‘Categories’ (see the one on circumcision here).

A category is a topical collection of Wikipedia articles. A category can both have sub-categories (with collections of articles on various sub-topics) and be part of more general categories (with collections of articles on more general topics). If we were to crawl and scrape this category we would, therefore, have to think about crawl depth (similar to the distance above). Inside the ‘Circumcision’ category, we find a number subcategories. ‘Circumcision debate’, for example, or ‘Female genital mutilation’. And if we open these categories, more subcategories will appear. We could thus say that pages found directly in the ‘Circumcision’ category, such as ‘Religious male circumcision’ or ‘Holy Prepuce’, are at depth 0, and we could decide to delimit our crawl by stating that it will go no further than depth 2 from the seed.

Question: Can you explain crawl depth refers to in a structure of sub-categories inside categories?

There is no standardized solution to finding a good seed category to start from or to set the right crawl depth. This requires careful and manual consideration from case to case. In the particular instance of circumcision, however, a depth of 2 from the seed category provides a very manageable output of 173 pages that are all related to the topic. We used a custom Python script, which also includes some API calls, to scrape the category. You can download it as a Jupyter Notebook here and try for yourself.

Question: Can you find a good seed and depth limitation to crawl and scrape a category on Wikipedia?

Scraping relationships between pages in a category

Once we have our list of circumcision-related Wikipedia pages we can use Seealsology to map how they relate to each other. This could be with the objective of spotting a pattern in how pages are organizing thematically, for instance with the Seealsology tool we just used above. In order to turn our list of page names from the Circumcision category into a list of URLs that can be pasted into Seealsology we need to add the following prefix to all the page names: ‘https://en.wikipedia.org/wiki/’. This can be done easily in a spreadsheet editor such as Google Sheets (see below) where a column is added and a formula combining the root URL with a page title pasted into each cell. Explore the resulting spreadsheet here. The list of URLs can be directly inputted to Seealsology and the crawl started at an appropriate depth.

Converting page titles (column A) to Wikipedia URLs (column C) in Google Sheets.
Result in Seealsology using all pages in the Circumcision category as seeds and the crawl depth set to 3 (no parent links).

You will notice that many of the seed pages (red nodes) remain unconnected to the network after the crawl has finished running. This is due to the fact that the editors of these pages have not equipped them with a ‘See also’ section. Clearly, if we are interested in mapping how circumcision pages on Wikipedia relate to each other, restricting relationships to ‘See also’ links have considerable weaknesses. Even in those cases where a ‘See also’ section is, in fact, present, it can quickly be established that many relevant links to other circumcision pages are provided outside of this section (for instance as part of the text).

You have now learned how scrapers, crawlers, and APIs offer different opportunities for curating digital datasets. In the next tutorial, you will learn how to use a combination of Python scripts and visual network analysis to go beyond SeeAlsology and ‘see also’ links to map other relationships around circumcision on Wikipedia.

--

--

Ethnographic Machines

“Traditional social science is on the lookout for variables; ethnographers are on the lookout for patterns” (Agar 2006, 109)