r/singapore Simulator

Generating Reddit comments with Markov chains

Previously, we generated some new sentences from a small pool of existing sentences. Now, we’ll generate some new comments based on all the existing comments in a Reddit thread!

This post is a continuation of “A Primer on Markov Chains”, which introduces Markov chains and how they can be used to generate text:

r/singapore viewed from space

After reading the daily r/singapore random discussion and small questions thread last week, I wondered if it would be possible to generate comments which looked like they came from there. Now, we’ll be doing exactly that with today’s thread:

We’ll use the Python Reddit API Wrapper, or PRAW, to download all the comments from a Reddit thread. While the Reddit API is quite interesting and working with the Reddit comment tree could be an interesting exercise on its own, for now we’ll stick to PRAW for simplicity.

We’ll also use the Markovify library for building our Markov model instead of our own implementation, once again just out of convenience. Markovify is already widely used for a variety of text generation applications, while our own implementation doesn’t even handle casing or punctuation at the moment.

Downloading Reddit comments

Did you know that you can append .json (or .xml) behind a subreddit or thread URL to get the JSON (or XML) representation of its content? For example:

https://www.reddit.com/r/singapore.json
https://www.reddit.com/r/singapore/comments/8eoesx/rsingapore_random_discussion_and_small_questions.json

This is similar to what you’ll get through the Reddit API itself, although probably with stricter rate limits than through the API.

In order to use the Reddit API, we’ll have to create a new Reddit application and obtain some OAuth2 credentials, which the Reddit API uses for authentication.

Creating a Reddit application

Log in to your Reddit account and head to the bottom of https://www.reddit.com/prefs/apps to create a new Reddit application. You should see the following form:

Take a look at the API usage guidelines, but don’t worry about registering for production usage right now — we won’t be doing anything that could be considered production usage.

Make sure you’ve selected “script”. Give your new application a name, description and put in an about url (perhaps this blog post? haha). The redirect url shouldn’t matter for a script application, so put anything you want or just point it at localhost:

After you’ve created your application, you’ll be able to obtain your OAuth2 client ID and secret. Your client ID is at the top right underneath the app name and type, and you can see your client secret in the “secret” field.

Iterating over comments with PRAW

I want to write about working directly with the Reddit API next time, but right now this is all we need to start using PRAW. Comment extracting and parsing seems to be a pretty common use case for PRAW, so there’s a comprehensive tutorial in the PRAW documentation already:

The following code will iterate over all the comments in a thread and print the author and comment body. The Reddit API does not necessarily return every comment at once; sometimes the response contains placeholders. submission.comments.replace_more(limit=None) ensures that we replace all the placeholders with the actual comments they represent. For large threads, this step will take up most of the time spent.

You’ll notice the comments are a little messy, especially with large amounts of whitespace here and there. We’ll need do a bit of cleaning up before we can feed them to Markovify to build our Markov model.

Note on user agents

https://github.com/reddit-archive/reddit/wiki/API

The Reddit API rules call for applications accessing the API to declare unique User-Agent strings and include your username as contact information, hence our user agent string in the example. This is also mentioned in the PRAW documentation.

Cleaning comments

The most important thing we need to do is ensure that sentences in comments end with a punctuation mark, so that Markovify knows where sentences end. This depends on the commenter’s style but is usually not the case for short comments. We’ll have to go through the comments and add full stops where necessary.

We’ll also remove unnecessary whitespace to make the comments more readable without affecting the generated comments too much (since whitespace is ignored in the Markov model anyway).

We’ll perform the following transformations on the comment bodies:

import re
text = comment.body

# strip whitespace
text = text.strip()

# collapse multiple line breaks
text = re.sub('\n+', '\n', text)

# strip each line
text = '\n'.join(line.strip() for line in text.split('\n'))

# add a full stop if a line doesn't end with a punctuation mark already
text = re.sub('([^.?!])(\n|$)', '\\1.\\2', text)

Writing our script

Here’s a simple script which can be run from the command line to download and clean comments from a given thread and save them to a file. You’ll still have to manually put your OAuth credentials and username into the script though — perhaps it would be better to use environment variables?

Let’s download all the comments from today’s r/singapore random discussion and small questions thread:

(venv) PS > python .\get_comments.py -o comments.txt 8eoesx
15:36:32 Downloading comments...
15:36:32 Expanding more children...
15:37:07 Cleaning comments...
15:37:07 Done!
(venv) PS >

Time to generate some new ones!

Generating new comments

Using Markovify

Markovify’s basic usage is extremely simple. From its README:

import markovify

# Get raw text as string.
with open("/path/to/my/corpus.txt") as f:
text = f.read()

# Build the model.
text_model = markovify.Text(text)

# Print five randomly-generated sentences
for i in range(5):
print(text_model.make_sentence())

# Print three randomly-generated sentences of no more than 140 characters
for i in range(3):
print(text_model.make_short_sentence(140))

We just need to pass a string containing our corpus — in our case, all our comments — to the markovify.Text constructor to create a model, then we can call .make_sentence() on our model to get as many new sentences as we want.

Markovify options

The .make_sentence() method takes several parameters which affect the generated output:

  • max_overlap_total and max_overlap_ratio
By default, markovify.Text tries to generate sentences that don't simply regurgitate chunks of the original text. The default rule is to suppress any generated sentences that exactly overlaps the original text by 15 words or 70% of the sentence's word count. You can change this rule by passing max_overlap_ratio and/or max_overlap_total to the make_sentence method. Alternatively you can disable this check entirely by passing test_output as False.

By reducing the values of max_overlap_total and max_overlap_ratio, you can make the generated sentences less similar to the original text.

  • tries
By default, the make_sentence method tries, a maximum of 10 times per invocation, to make a sentence that doesn't overlap too much with the original text. If it is successful, the method returns the sentence as a string. If not, it returns None. To increase or decrease the number of attempts, use the tries keyword argument, e.g., call .make_sentence(tries=100).

Sometimes, you may notice that .make_sentence() returns None. This is because it tried tries times to generate a new sentence, but each time the generated sentence was too similar to the original text. You can increase the tries parameter to try more times.

Writing another script

This cript reads text files containing the comments we downloaded and cleaned, builds a model using Markovify, then generates new comments:

Here, I’ve set max_overlap_ratio=0.5, and tries=20.

The results

Let’s generate 10 new comments from today’s r/singapore random discussion and small questions thread:

(venv) PS > python3 .\generate_comments.py -n 10 .\comments.txt
Hope you have to be the smartest person on Earth, and I can tell, I am a spoilers kind of person too but oh man, it's taking all my guy friends cupcake.
pls you this kind of person you have to be cheated. im sure someone out there will be after 5pm till closing, because people are working.
This movie gives the audience a chance to see when the add to cart option would be a starlet that emerges haha.
Dint manage to get over this heartbreak. friends say going through the had work of working in retail before so this is an HR issue.
He deleted his acc already, and is trying to come up with people who live in war-torn regions would have to come back later saying it's not super urgent then wait lor haha.
Someone wanted to get in a giant spaceship, sucking the life out of bed, spend time on yourself.
On the other end that I know what I'm watching, and yesterday, I read that sunblock is the most profitable.
I think he alr got devoured by the Russo Brother since Age of Ultron were by Joss Whedon.
It's just the owner, so it's not being taken for granted if you see it as clean and safe as singapore though?
Maybe I would think that is going through the pre-movie advertisements.
(venv) PS >

If you look carefully, you can barely spot where these fragments were generated from:

But could you tell the generated comments apart on their own?

Final thoughts

This was a fun diversion. Before I did this, I didn’t realise how simple and effective Markov chains were for text generation.

At the same time, it almost feels like cheating the way one just needs to call the (excellent nevertheless) Markovify library and throw some text at it. This was why I decided to try making my own simple implementation as well, which I demonstrated in my previous post.

(For the same reason, I’ll also be working on my own wrapper for downloading and expanding all comments from the Reddit API without using PRAW, and hopefully I’ll be able to share it in the near future.)

Perhaps an extension to this project could be to create a Reddit web application (no longer just a script) which users could put in a Reddit URL and have fake comments generated and displayed in a mockup Reddit comment.

Another application of this technique could be something similar to r/SubredditSimulator, but for invididual subreddits: like a parody subreddit, but with entirely automatically generated comments.