Making Reddit Safer For Work with /u/RiskyClickerBot
In my previous post, I described the basic idea behind RiskyClickerBot — a Reddit bot which scans comments for links to risky images and analyzes those images. This is a more in-depth and technical post describing how the bot works. If you want to dive right into the source code, you can do that here.
The Gears And Wheels
This bot runs on Reddit- which means it requires access to Reddit’s API. PRAW: The Python Reddit API Wrapper is what you’ll need for the job. The bot is meant to analyze images, and Reddit uses Imgur heavily for hosting images. So access to Imgur’s API is very handy. The python bindings for that are available here.
The core technology behind the bot is a Convolutional Neural Network, which is trained to classify images as either NSFW or SFW. Now this could have been a tall task in itself. Fortunately, Yahoo open-sourced this exact model last year. Unfortunately, the code is written using Caffe, and it requires setting up a Google Cloud or AWS GPU instance — which while free, is fairly labour intensive. Clarifai on the other hand, has the same tech usable through API calls.
All of this allows us to build the bot in Python.
But how do you keep it running?
Typically, a bot can be kept running locally using a cron job. This however requires your PC to be running for it to work. The neater way of making this work is to use a cloud platform like Heroku. So that’s what I did.
So we have our tools for the job ready. Let’s actually build the bot.
Setting up /u/RiskyClickerBot
I started building a simple test bot by following this tutorial. Since that covers API set-up for PRAW, I’m not going to go into that again.
The Imgur-API can be setup by using this link. This video tutorial came in handy. Note that you need an imgur account to access the API. However, since the bot is not going to post/vote/comment on Imgur, we can use the API from an anonymous context as shown here.
Clarifai’s API can be set up very easily with three lines of code.
Once all the API has been setup, we can actually get into the bot’s code.
Getting the comments stream
Let’s say you have a reddit bot set-up like the tutorials showed earlier. We require the bot to read the comments stream for reddit. To make the bot available site wide, we read the comments from r/all.
Calling the bot into action
We only want the bot to act when someone says “risky click”. So…
This phrase is usually said in response to some other comment like in this image here. We want to analyze that parent comment.
And to do that, we first have to get the parent comment:
Analyzing a comment
The code snippet above shows the basic action that the bot performs when triggered.
The generate_bot_comment() function has a lot of bells and whistles which you can find in my source code, but the basic idea is this:
We need to find the risky URLs in the parent comment, and then handle those URLs. This we can do by using a Regular Expression.
The regex was largely taken from this stackoverflow answer. So once we have the list of URLs to analyze, we need to handle them. URLs are handled in a few different ways. They may be links to imgur albums (multiple images), an imgur image, an imgur gallery, or a direct URL to the image on a different host server, or maybe just not an image at all. Each of these are handled slightly differently.
Since the Clarifai API produces SFW/NSFW score for a single image, the score for an album is the average over all the scores. (Not ideal, but works well in practice.)
Imgur links have a few different formats. So it’s useful to know where it comes from. Gallery links could be to an album, an image, a gif, etc.
Once we know the kind of link that we have received, we can actually do the analysis using Clarifai’s API.
The Clarifai API returns a dictionary containing the results. So the result is a key:value pair that you need to access.
prediction = model.predict_by_url(link)
The Clarifai NSFW/SFW prediction model
The crux of it is a Convolutional Neural Network which is trained specifically to distinguish between NSFW and SFW images. A good explainer on ConvNets can be found here.
Sending the bot’s reply
With this result, we have to frame a reply message, keeping in mind the Reddit markup syntax. In all honesty, this was the bigger problem in comparison to getting the rest of the bot to work.
Once the reply is framed, we can make the comment by doing:
Some bells and whistles
Heroku has been a great way to set up this bot. However, one key issue is that the bot should not process the same comment twice. This can be handled by setting up a log of all comments previously processed.
The gotcha is that any log stored on Heroku is lost when the bot is periodically restarted by Heroku. Solving this problem isn’t too difficult however. We need an external storage server for the log. Heroku’s MemcachedCloud add-on is a great tool for this. I found this tutorial really easy to use in setting that up.
Another really great add-on is the Papertrails log keeper. This add-on allows you to set up alerts. I’ve set up alerts to let me know whenever a bot crashes, and hooked it up to my slack account!
(These two add-ons are free to use, but they require adding a credit card to your heroku developer account.)
And there we go. This is how I built RiskyClickerBot. Let me know if something isn’t very clear — I did indulge in a certain amount of hand-waving here :D
If you see RiskyClickerBot do a good job on Reddit, don’t forget to say “good bot”! And if it botches up, let me know! I use the handle /u/PigsDogsAndSheep. :)