Scraping your Shower Thoughts

Acknowledgments: This story is adapted from excellent notebooks and tutorials by Dr. Brian Keegan’s Information Exploration class.

Scraping and analyzing web data is a great part of data science. Pulling information from a website can allow one to understand human behavior or challenge the way we think and do things. Of course, scraping can be a bad thing and used against humanity or specific people. But regardless, the code is not as difficult as it might seem.

Reddit, “the front page of the internet” is full of users and data. Subreddits are a glorious look into our current culture, both locally, nationally, and in ways, globally, too. People from all over are able to find memes, weird content, answer questions, or just follow along.

Step 1: Create a Reddit account.

Afterwards, enter in: https://old.reddit.com/prefs/apps

This will allow you to create a web application and obtain data from different subreddits. You will need to take note of the web app and secret codes.

Step 2: Install Praw

From your terminal, type in: conda install -c conda-forge praw

Praw is a wrapper library built to communicate with the Reddit API called Python Reddit API Wrapper.

Step 3: Scrape Reddit

After opening up a new jupyter notebook, you’ll want to first import a couple of useful libraries, including requests, JSON, BeautifulSoup, Datetime, Pandas, Matplotlib & praw.

From there you can create an API connector object (r) below that will authenticate with the API and handle making the requests.

From here, you’re all set up to start diving into a specific subreddit. Take for example, r/Showerthoughts.

Create an object to store the various attributes about this sub-reddit. This subreddit object will have a lot of different attributes and methods you can call on it. Mine will be called: shower_subreddit.

Step 4: Check the Packaging

A few attributes you can call on it include:

  • The time the subreddit was founded: shower_subreddit.created_utc

Note that python will likely format it into a UNIX timecode, meaning it’ll display the answer in seconds since 1 January 1970. So you’ll want to convert the datetime into a more readable timestamp, using utcfromtimestamp.

print(datetime.utcfromtimestamp(shower_subreddit.created_utc))

  • Number of subscribers: ‘{0:,}’.format(shower_subreddit.subscribers)
  • If the subreddit users must be over 18: shower_subreddit.over18
  • The active user count: shower_subreddit.active_user_count
  • The description: print(shower_subreddit.description)
  • The rules of the subreddit: shower_subreddit.rules()['rules'](this will return a list of dictionaries as rule objects)
  • When each rule was created (use a for loop to loop through each rule and find the timestamp)
  • Get a list of moderators
  • Get a list of submissions using various methods:

.controversial()

.hot()

.new()

.rising()

.search()

.top()

For example, you can use the .top() method to obtain the top 25 submissions of the subreddit from the past 12 months. (This portion uses PRAW library.) The top25_news is a Listing Generator object defined by PRAW. It allows you to loop through and perform operations, but doesn’t actually go out and get the data.

top25_news = r.subreddit('showerthoughts').top('year',limit=25)

From there, you can create an empty object to look at the top 25 submissions.

top25_submissions = []

for submission in r.subreddit('showerthoughts').top('year',limit=25):

top25_submissions.append(submission)

Step 5: Analyze the Data

When you extract features of each individual submission, you’ll make an API call for each one. This can be done by storing them in a dictionary (create an empty one) and saving it to an external list. You can then turn that dictionary into a pandas DataFrame.

Say you want to know whether there’s a correlation between the popularity of a post and the amount of comments received with it. You can plot out the relationship between different pieces by using the score and number of comments. (The score is the average number of upvotes-downvotes a post receives.)

This graph shows that this subreddit is not a highly debated one. This plot shows that despite the increase in score, the number of comments don’t typically increase with it. People are more interested in the main post and tend to upvote/downvote it. Many don’t seem to stay to read the comments that follow it.

Conclusion:

This small post barely scratches the surface of what you can do with scraping websites. The best advice I can offer to become better at scraping is to continue practicing with different data as much as you can. As I am not an expert on this topic, I, too have lots to learn from this and appreciate any comments, tips or suggestions!

--

--