Computational Thinking with Reddit at the Wolfram Summer School

Around this time last year, I was trying to figure out what to do with my summer. It had been a rather hellacious semester in graduate school, but I was finally done with core coursework in my doctoral program in sociology at Louisiana State University. At the end of the semester, I decided to drive from Baton Rouge to Knoxville where the heat and humidity don’t cause your skin to melt. I was debating taking the summer “off,” though in graduate school that typically means collecting data and working on papers. Since I had either been working and attending school or taking independent reading courses over the summer since 2009, I figured I deserved a break.

However, I was too preoccupied with text analysis and my dissertation project to relax. One of the last seminars I took was social network analysis. The professor leading the seminar had come up with a novel way to analyze interview transcripts as part of her dissertation work (PDF link) at Duke University. She proposed network text analysis as a way to analyze large amounts of interview data.

If you’re not familiar with these types of data, sociologists often rely on transcripts from in-depth interviews to better understand the social world. For example, my MA work involved interviews with 20 subjects. (You can read about it here on VICE.com, and read here why I chose to publish the work in a nonacademic outlet.)

In some cases, sociologists use interview data from hundreds of subjects. I was planning on using a combination of survey and interview data to investigate the use of pre-exposure prophylaxis and the drug Truvada among HIV positive individuals and those at risk of contracting HIV.

During my MA work, my mentors taught me how to use something called grounded theory, handwritten notes, and a software program called Atlas.ti to organize and analyze the interview data.

Example of a transcript. Imagine this times one thousand.

The process was…daunting, unorganized, and not reproducible. It was also incredibly old fashioned. So, I was trying to think of better and more efficient ways to analyze interview data. I came across a few software packages, including Provalis QDA Miner, that seemed to do a decent enough job at text mining, but it still wasn’t good enough. I started searching around online, and I happened upon Wolfram Research’s Mathematica.

I was somewhat familiar with complexity research and was also thinking of ways to tie it into sociological theory, especially how language and interaction are generative in creating meaning, so Stephen Wolfram’s work was not entirely unfamiliar to me. While searching around the Wolfram website looking for tutorials on text analysis in Mathematica, I came across a link to the Wolfram Summer School. I was intrigued, so I asked a few colleagues if they knew anything about it, and one of them said I should definitely apply. So I did, thinking there was no way on earth a sociology graduate student would be admitted to a program that seemed more geared toward programmers, physicists, computer scientists, and other STEM areas.

What was interesting is when I applied, there was a coding challenge to accomplish before my application would be considered. It was something like, “Write a function so that every other integer in a list of integers is removed.” Having little programming experience, aside from some basics in statistical software, it was slightly intimidating. However, I went to the Wolfram Documentation Center, and I was able to figure out how to do it after a few tries. Next thing I know, I received an email from the academic director to schedule a brief interview. And after talking with him about my interests, research, and goals, I received an email notifying me I had been accepted. I was quite ecstatic, as I had been looking at some of the fascinating projects past students had worked on, along with the impressive resumes of current and past instructors. Oh, and getting to meet Stephen Wolfram was pretty exciting.

So, it was suggested I read, or at least be very familiar with, New Kind of Science (NKS) and practice using the Wolfram Language before arriving in Waltham, Massachusetts for the Summer School. If you’re not aware, NKS is like 1,200 pages long. I had a friend of mine drive me to the Knoxville library so I could use his library card to tackle that tome in a few weeks. The full text is online, but I prefer actual books, for whatever reason. Anyway, I got through as much of it as possible and was awe-struck by the implications of simple programs like cellular automata in nature and elsewhere.

To that end, once I was at the Summer School, held at Bentley University, our first assignment was to search the computational universe for an interesting 2-D, three-color, totalistic cellular automaton (CA). Below are some images of the CA I found. What was interesting is the irregular edges through their evolution.

Code 144305877 of a 2-D Three-Color Totalistic Cellular Automaton at step 500

They all have similar characteristics but with a different boarder and growth pattern.

And if we start to explore some of these rule spaces, i.e. searching the computational universe for interesting CA, it is kind of relaxing — it takes you to a different visual space of alien landscapes and possibilities. Some of the patterns look like they could be used in design purposes, or have already, in applications such as textiles. Visually, CA can be stunning. Academically, they have been shown to produce patterns seen in biology. Practically, they can be used to generate random numbers. For instance, equations have been used since the time of Newton to describe physical phenomena. However, growing evidence suggests CA and other types of programs might better model reality. Agent based modeling is a fine example. But as Thomas Kuhn argued in The Structure of Scientific Revolutions, this type of progress often happens without recognition in real-time. Rather, it happens as a historical process.

The first week of the Summer School mainly consisted of lectures and Stephen Wolfram conducting a live experiment. During the second week, we each met during lunch with Stephen Wolfram so he could choose the project we would be working on with our mentors. He has called this his annual “extreme professoring” moment.

The lunch was really cool, because there were about half a dozen other students. Stephen asked us about our interests and research. It was pretty fascinating, because he had so many questions about such a wide range of disciplinary fields. There were theoretical physicists, someone who studied algorithmic finance, and another person who looked at small world networks found in C. elegans neurons.

After lunch, we then met with Stephen and our mentors individually. When I walked in the room, they already had a project in mind, and it was super cool. I was to use a new feature in Wolfram Language that uses the Reddit API. When I asked Stephen exactly what I should do in the project, he replied, “Show me the sociology of Reddit.” Wow. A tall order, but it allowed me the freedom to take the project wherever I wanted. So, I immediately started doing a little research. I initially wanted to figure out a way to profile certain users using the Big Five psychological scheme from user generated text. I also wanted to use network text analysis to efficiently map out what was going on in subreddits. And with the help of my mentor and some of the other students, I was able to build up a little code to generate some networks.

I first analyzed an AMA (ask me anything) Stephen Wolfram did with some very simple code and got a nice network.

Text network from a Stephen Wolfram AMA

From there, I got rid of directed edges and put a tooltip in the graph so when you hover over the nodes you can visualize what chunk of text (a question in the AMA) it represents.

Text network from Stephen Wolfram AMA with Tooltip function

So, I then went a little further to try out some of the graph/network functions in the Wolfram Language and Mathematica. In social network analysis, we’re often interested in similarities, cliques, social capital, and other measures. Wolfram Language has something called CommunityGraphPlot that makes it easy to group nodes together. In this case, it will be grouping similar questions from the AMA together.

Community graph plot of Stephen Wolfram AMA

And to further visualize this, we can adjust the node size based on the upvote score of a particular question.

Community graph plot of a Stephen Wolfram AMA scaled by popularity.

I did all sorts of things from there, such as using a built-in sentiment classifier, a topic classifier that uses Facebook data as a training set, and started work on a classifier that would be able to identify certain social-psychological characteristics in text. But that’s for another post.

As you can see, I was able to accomplish a lot in two weeks working on the project with the excellent help of my mentor. In fact, it was so useful that I had a dissertation project to pursue when IRB yanked approval of my HIV/PrEP work due to the sensitive nature of health-related issues in clinics. Considering I entered the school with fairly novice-level programming abilities, it was quite impressive (at least I thought so) I was able to carry this project forward.

And as a bonus, guess what? I now work for Wolfram Research, as do several other alumni over the years. So, not only did I get to learn valuable new skills, I also got a dream job working at a tech company.

So, if you’re looking for something to do over the summer, or if you want to learn how to program, consider the Wolfram Summer School or playing around with the Open Code function in Wolfram|Alpha. You never know where it will take you, and you may even get a job out of it.