Reddatait: Analyzing over a TB of Reddit Comments to Construct the Largest Publicly Available Social Network Evolution Dataset
TL;DR: We used the recently published Reddit dataset, containing over 1.65 billion comments, to construct the largest publicly available social network corpus to date. This dataset contains detailed information on the evolution process of 11,965 social networks of various scales. This dataset opens the door for new and exciting research opportunities in a variety of fields, such as social networks analysis, social network security and privacy, and complex networks.
In our first study using this social network corpus, we considered how new users joining a network affect the network topology. Our results present evidence that the different patterns in which users join the network have a vast impact on the network topology. Additional details on our study, including an interface for interacting with the data, are available on the Network Dynamics project website.
As I mentioned in my previous post, I find it really fun to be a data scientist these days when there are so many diverse research opportunities out there. New datasets are becoming publicly available all the time. One of the most interesting datasets that I stumbled upon a few months ago was the Reddit dataset, which was released by Jason Michael Baumgartner in July 2015.
To those of you who aren’t familiar with Reddit, it is a news aggregation website and online social platform launched in 2005 by Steve Huffman and Alexis Ohanian. Reddit users (also known as “redditors”) can submit links on the website, which are then commented upon, and upvoted or downvoted by other users in order to increase or decrease the submission’s visibility. Redditors can also create their own subreddit on a topic of their choosing, make it public or private, and let other redditors join it. This makes Reddit a collection of online communities, centered around a variety of topics such as books, gaming, science, and asking questions.
The released Reddit dataset contains over 1.65 billion comments that were posted from October 2007 through May 2015. These posts were created by 13,213,173 users, with unique usernames, in 239,772 different subreddits. The dataset contains information on the exact time and date each comment was posted. Moreover, for each comment, the dataset contains the comment’s ID, as well as information on the user who posted it and the ID of the parent comment, i.e., the ID to which the current comment replied.
This dataset is one of the largest and most detailed social network dataset that I believe exists out there. Moreover, it is definitely one of the largest dataset I have had the opportunity to analyze to date.
My advisor, Prof. Carlos Guestrin, and I used the Reddit dataset to explore the evolution process of social networks by using the numerous examples of real-world social networks that can be constructed utilizing the Reddit dataset. We utilized the Reddit dataset to construct the largest publicly available social network corpus, with a size of over 41 GB, which contains detailed information on the evolution process of 11,965 social networks created from some of the most popular subreddits. Due to its huge size, analyzing this dataset was very challenging. After several trial and error attempts, we succeeded in extracting each subreddit’s social network evolution data using GraphLab Create and AWS. During our study, we mainly utilized EC2 memory optimized spot instances (r3.4xlarge and r3.8xlarge) and stored the analyzed data to S3.
Then, we used this social network corpus to study the effect of the patterns in which new users join a network on the network topology. Intuitively, we wanted to check if the social network structures vary if users join the network one at a time or if a group of users joins the network all at once. Using the Reddit dataset, we discovered evidence that the rate of users joining the social network is a central factor in molding a network’s structure; that is, different arrival patterns create different topological properties.
What is more interesting is that we showed it is possible to uncover the types of user joining patterns by analyzing a social network’s topology and using machine learning algorithms. In other words, we demonstrated that by looking at a social network’s structure, we can in many cases predict the rate in which users join the network. This gives insight on how social networks evolve and can be used to develop more accurate complex network evolution models that are useful for wide range of research fields.
For helping other people better understand the social network corpus, we developed a web interface to investigate and understand the various subreddits’ social networks. More details on our study can be found in our paper:
- a Michael Fire and Carlos Guestrin, “Analyzing Complex Network User Arrival Patterns and Their Effect on Network Topologies,” 2016.
- IPython Notebook Code Tutorial
Personally, what makes this work really exciting is that the social network corpus that we created in this study can open the door for a realm of new studies. Moreover, this corpus can be used as a ground-truth dataset for many studies in field of social networks. Some examples of what can be done with this corpus are:
- Understanding topological factors that may help a post to go viral.
- Helping to better understand diffusion models and also assisting to validate diffusion models using real-world data.
- Using this dataset as an excellent ground truth dataset for evaluating entity-matching algorithms, based on our observation that many redditors are members in several subreddits.
- Helping understand the connection between content and social networks; for instance, this dataset can provide insight on what type of content makes users more likely to interact with each other.
We also created the user arrival curves of each of the subrredits, which can be downloaded from our website. This time-series dataset is really interesting. Similar to Google Trends, it can help in understanding how the popularity of different topics changes over time. For example, our dataset revealed how certain subreddits oriented around TV shows increased in popularity when a new season of the show was about to be aired (and we also developed an algorithm for identifying interesting anomalous users arrival curves).
If you have any questions regarding this dataset, please free to contact me.