RedditMiner: A Humanist’s Journey into Enemy Territory
In January, I was sitting in my first digital humanities seminar ever. I had very little idea of what digital humanities meant, but I was beyond excited. I like digital things — my iPhone controls my home lighting, and I can speak to the point of exhausting anyone’s patience about my favorite video game of all time and likely the best game ever produced: Deus Ex (of course). I also like humane things — I am a classicist by undergraduate training, and I have spent the last decade or so of my life trying to wrap my head around the intersections between language, literature, and philosophy. For as long as I can remember, I have been forced to understand these two interests as diametrically opposed and hopelessly separate. So when I saw something called “digital humanities” listed among my department’s course offerings? Done deal.
In that very first class, Prof. Warnick asked us all to come up with something we wanted to accomplish during the semester. I thought for a moment and finally decided that I wanted to try coding something. I had previous experience working in C++ and Java, so this wasn’t entirely out of leftfield. During my first semester of high school, a computer science course had been required. By the end of my sophomore year, I had exhausted all the programming courses my school offered and was coding for the robotics team.
But this was all a long time ago. Since then, because I am interested in liberal artsy things, I’ve been pigeonholed—a computer science course in college was out of the question. I never really wanted to stop working with computers and technology, though, so I saw this whole “digital humanities” thing as an opportunity to shake off some of the rust. I submit for your consideration, in partial fulfillment of the requirements for three credits of digital humanities coursework at Virginia Tech, this Medium post: the story of my final project in ENGL 5074 and of my wandering at the edge of the digital and the humane.
“I Use it to Look at Cats and Argue with Strangers”
I didn’t start my final project with the intention of coding anything, though that objective had stayed at the back of my mind throughout the semester. I started out simply with the intention of examining Reddit, the internet’s favorite repository of cat pictures and related bombastic, oftentimes fallacious ephemera.
If memes and pun threads aren’t your thing, I’ll provide a bit of a primer: Reddit is an online social media sharing community where anonymous users can post links or text and then vote and/or comment on the submissions of others. A sorting algorithm populates the front page with the newest, most community-approved (“up-voted”) posts. Smaller communities, called subreddits, allow posters to direct their content toward people with similar interests, and all of these subreddits have their own smaller version of the main front page.
Reddit is the self-proclaimed “front page of the internet,” and while that may have been a bit of an overly-aggressive claim when it first launched, it has since gained some cultural clout. Politicians and celebrities including Barack Obama, Ron Paul, Snoop Lion, and Arnold Schwarzenegger all maintain Reddit presences, and the news media seem to be increasingly turning to Reddit for tips.

I am interested in studying Reddit for several reasons. First, I have used the site myself for several years now and watched it grow and change. It always fascinates me as an online instantiation of real-world communal behavior, for better or for worse. Moreover, as a rhetorician, Reddit seems like an interesting place to investigate. Due to the vote-based infrastructure, it is easy to see immediately if a given comment or post has “worked” or not: to what extent it has addressed a specific rhetorical situation. Finally, because Reddit posts are published live online, Reddit is ripe for the sort of digital analysis that copyright laws will not allow us to apply to recent works or that privacy settings make difficult to implement on other social networks. Posts and comment threads on the front page or in other public subreddits are already digital and freely accessible to the public — millions of texts just waiting to be analyzed by rhetoricians, linguists, anthropologists, sociologists, or anyone else so inclined.
There is, however, a small catch: if you want to run these texts through textmining programs and generate useful results, you need to systematically archive the text and metadata you wish to study in some permanent, meaningful way. Simply copying and pasting an entire thread into a text analysis tool will capture a lot of text you don’t need that could skew your findings. It also fails to capture some other metadata that only exists in the underlying HTML—things like absolute time and date and complete vote counts don’t show up on instances of the page. The problem of transcription would ultimately lead me back to my goal for the semester. It would lead me back to code.
Trial and Error
When I started thinking about my Reddit project, I didn’t really have a set question in mind. I first wanted to see what my options were as far as getting data from Reddit comments assembled in some way as to be easily readable both by me and by textmining software like CATMA. I began the “old-fashioned way”: simply copying and pasting data, one field at a time, from a Reddit page into an Excel spreadsheet. The process was incredibly time consuming. One comment thread, small by Reddit standards at around 50 comments, took me more than five hours to completely transcribe. While this would have been fine if I only wanted to look at a few threads, I figured that the real interesting stuff we could learn from Reddit would involve large-scale text analysis—looking at tens of thousands of comments to see what specific patterns of language and usage “fail” and which ones “succeed” over time and between communities.

While to human beings these sorts of mundane transcription tasks are incredibly time consuming and tedious, this is exactly the sort of work that computers were made for. Computers love iteration. In fact, that’s exactly how they work: they run through simple programmed commands again and again until told otherwise to do everything from basic arithmetic to drawing this very Medium post on your display as you read it. In the time I could parse 50 Reddit comments into their relevant content and metadata fields, a computer could work through thousands, sorting and storing them for future analysis. I started to shift my project’s focus from analyzing Reddit to making a tool that would help me and others analyze Reddit. I began to think about coding again. I checked a Java textbook out of the library.
As I mentioned above, I had already been using the HTML code for individual Reddit pages to find information like time and date stamps while manually transcribing data, so when drafting my initial basic transcription program, I first turned my attention to these files. I found which HTML tags indicated the start of a given piece of metadata through a painstaking examination of samples of code and wrote a test program that scanned through locally saved HTML Reddit pages looking for and returning that data.

There were several downsides to this initial approach. First, it necessitated saving .txt HTML files manually for the program to work though, adding an additional step that could grow cumbersome on a large scale. Additionally, as subreddits are free to some extent to alter their physical appearance, this could mean that tags could look different depending on how pages render posts and comments, necessitating different versions of the program for different subreddits.
Around this time, I stumbled upon Reddit’s API, or “application programming interface,” a feature common on many sites that allows third-party application developers to access certain data. Reddit’s API allows programmers to access posts and comment threads in a variety of formats, including as RSS feeds, XML files, and JSON strings. I reviewed how to open an HTTP connection in Java, the implementation of which allowed my program to directly access Reddit’s data via the API, eliminating the need to locally save data files for the program to work with. In addition, using the API to access site data rather than working with HTML files allowed the same mining program to work effectively across all subreddits as the underlying data structure remains static even as the physical layout and appearance change from subreddit to subreddit.

I decided to work with JSON strings from Reddit’s API because they contained richer metadata than the other available options. It was also easier for me to work with JSON objects and arrays. Because JSON data is rooted in the Java programming environment, its structures felt very familiar to me as a Java coder. The new, more robust version of my program successfully pulled the JSON information directly from Reddit and parsed the JSON string files, implementing freely accesable JSON.org libraries.
The Rule of Modularity
At a very basic level, this version of my program worked to access the appropriate data and could probably have been made to ultimately output it in the desired format, but it wasn’t what one might call elegant. It was one gigantic program, more than 200 lines long, that brute-forced its way through the JSON strings, adding little in the way of additional functionality and user-friendliness. It also wasn’t very efficient. While this didn’t make a big difference to me for pulling a thread or two down at a time, it could add up when trying to mine thousands upon thousands of comments down the road. So I set about working to take advantage of Java’s object-oriented environment.
Programing environments like Java work well because they easily allow for the creation of extremely modular code. “Methods” are the basic building block of Java programs and carry out the nuts and bolts tasks a program needs to complete to perform a given function. A “class” is a set of methods that work together to bring about a desired data transformation. Methods in a class can call other methods in that same class to take advantage of their features, and can also call methods from other classes by importing or extending those classes or by constructing “objects,” instances of a different class. This modularity makes it very easy to add additional functionality to Java programs by implementing functionalities from other programs. It also allows individual classes to be kept very short and efficient. Because my program was just one big class, it was not taking advantage of the flexibility and fluidity that the object-oreinted Java enviornment could offer.
But what should count as an “object” for the sake of my Reddit-mining program? I turned my attention to the JSON strings from Reddit to see what Reddit counted as an object and defined my classes the same way. The quick and dirty version: in JSON string form, a post on Reddit is defined as an array of two objects: one for the post and one for the accompanying user comment thread. A post is an object and an individual comment is an object, but a comment thread is an array that contains an object for each individual comment and its replies. I defined my program’s class structure based around this data structure.

In my current program’s class structure, RedditMiner provides the user interface (such as it is) and defines the ultimate output format of the mined data and how the program recurs. RedditMiner instantiates a new RedditUtil object for each post to be mined, which the user indicates with reference to the unique ID36 code in its URL. RedditUtil opens an HTTP connection to Reddit’s API and accesses the JSON files of the selected post, which it then parses into post and comment data. RedditUtil then instantiates a RedditPost and RedditCommentThread object defined by the relevant JSON data. A comment thread is parsed as a set of individual RedditComment objects. In my favorite recursive move, RedditComment recalls RedditCommentThread to parse a comment if it finds that it has responses, just as Reddit treats a comment with replies as its own thread. Parsing comments with replies proved to be one of my biggest challenges, but it was solved quite simply and elegantly with reference to the Unix philosophy.
Final Steps
While I am excited to be where I am, there is still work to be done. I have started to study CSV data structures in order to add methods to my classes that will write the parsed data to spreadsheet files that can then be fed into textmining software. This aspect has been challenging, as characters like quotation marks, commas, and carriage returns in the text of posts and comments must be dealt with so as to not be read incorrectly by spreadsheet software. I am working through this and don’t expect it to be a major challenge. In addition, I hope to add functionality that will allow users to select which metadata points they want mined to eliminate unnecessary fields and a crawling feature that will capture entire pages of posts at once.
My goal was to have all of these features implemented by this time, but it was a goal that appears now to have been a bit too lofty. In the last couple months, I’ve re-taught myself Java, scrapped several versions of my program to start over from scratch, and learned about some programming languages and data structures I’d never even heard of previously. I’ve spent at least 100 hours on the project and, at this point, I need to take a break before I start defenestrating things. I also had to write my project up at some point for a grade in my aforementioned digital humanities class, and now seemed as good a place as any to stop, reflect, and take stock. Rest assured, I’ll be back to finish it up soon.

It would not surprise me were someone to respond with “Oh, here’s the GitHub page for the code that already exists that does everything you just talked about doing plus it makes a really mean vichyssoise,” but regardless, this project has helped me to get back some of my programming chops and to understand Reddit and its underlying structures in a new way. Admittedly, I do feel displaced at times in seminar: the traitorous positivist in the room pouring over Java textbooks and JSON files against a backdrop of comments like “we shouldn’t expect humanities students to know how to code,” but I see a bright, humane future for RedditMiner and for this sort of work in general, so long as we keep grounded in the self-aware theoretical and critical frameworks that provide the foundation for any work in the humanities. Even coding.
I am indebted to the friendly folks at StackOverflow and r/JavaHelp for helping me out of some initial jams, as well as to the Newman Library reference desk librarians for helping me find some really killer Java textbooks. I hope once I finish this project to make some version of my code available for general use. If you have any suggestions for additional functionality that I could add, or if you might be interested in using my software to mine Reddit posts for your own analysis, send me a tweet. I’m always happy to help out a fellow humanist, digital or otherwise.
-A