Data Mining Script for Text Analysis: Fanfiction from Archive of Our Own

Vee Kalkunte
4 min readAug 17, 2020

--

This article details a python script that scrapes the fiction text of any subsection of the fanfiction and fan works site: Archive of Our Own. To access the scraper code and an example dataset (the top 200 Coffee Shop AU fanfictions), here’s the Github link.

What is this?

Archive of Our Own (AO3) is a “A fan-created, fan-run, nonprofit, noncommercial archive for transformative fanworks, like fanfiction, fanart, fan videos, and podfic” by the Organization for Transformative Works, that caters to over more than 38,730 fandoms, 2,739,000 users, and 6,396,000 works.

It’s a major hub for writing and reading fanfiction, among other things, and boasts a very well categorized tagging system that allows for users to search and specify for what kind of fanfiction they wish to read. When one wishes to search and specify fanfiction based on some sort of attribute, they’ll encounter the following search results page:

On this page, one can navigate across the pages of results using the orange box, indicate further specifications using the blue sort and filter options, and see works like the one highlighted in the purple, which match the current specifications.

A work on Archive Of Our Own looks like the following:

Where the yellow box symbolizes the text within the work — the fiction.

This scraper does the following: given the URL of the first page of the search results (assuming you’ve already narrowed down what you want), how many pages of results you want to scrape, and the name of the output file you want to create — the scraper takes the fiction within the bounds of the search and puts them in a text file.

This text file can be used for a bunch of different things — from training your own GPT-2 AI to generate text based off of it, to general text and sentiment analysis. An example output exists in the github link, with the top 200 works of complete Coffee Shop AU fanfictions on AO3.

How can you use this?

Once you’ve downloaded and opened the .py file in your environment -

First: Narrow Down Search Results

On Archive Of Our Own, find a set of tags you’d like to specify for through the sort and filter section until you get a search results page that has what you’re looking for. Make sure you’re on the first page of results. Copy the resulting URL.

Enter this URL in the page variable at the top of the script.

Second: Specify Parameters

Remembering that each page contains 20 works, and that the scraper successfully scrapes any fiction work (no art, no podcasts, no custom-coded pages), specify the number of results pages you want to scrape in the NumberOfPages variable.

Specify the output file name as needed on the variable, nameOfFileCreated.

Third: Run the Script, and Wait

The scraper takes a little bit of time, about an hour for 200 fanfictions. In the end, it’ll result in a text file with the name specified.

Have fun with your fanfiction splice!

How does it work?

The scraping process is based in the beautifulsoup package, in which the HTML of a page is stored in an object, and one can find tags with specific attributes within the object using the find() and find_all() functions. This is a good tutorial to learn how it works. The urllib package is used to safely request URLs.

A note about a potential error one may encounter.

If one requests to see too many URLs in a short period of time from the server, you might get a timeout error that’ll just require you to increase the number in the time.sleep() line of code, to 5 to something longer, so it slows down the rate of requesting URLs.

Wanna try it out for yourself? Here’s the link to the Github page, with the practice data.

--

--

Vee Kalkunte

Austin College ’22. Part time data nerd and terminally online gamer (They/Them)