AI: Is This Porn?

Daniel Shapiro, PhD

4 min readAug 13, 2017

--

Porn drives the format wars. It has for a generation. The internet is full of porn.

The internet is very very great… For porn.
-Avenue Q

Detecting and filtering out unwanted content is essential for an open society. Keep it in the bedroom is a good business policy. But what is “bad/naughty” content? How do we identify it?

I shall not today attempt further to define the kinds of material I understand to be embraced within that shorthand description [“hard-core pornography”], and perhaps I could never succeed in intelligibly doing so. But I know it when I see it, and the motion picture involved in this case is not that.
— Supreme Court Justice Potter Stewart, 1964

Filtering out porn is a key business objective for search giant Google and social media giant Facebook, who need to filter out porn to present clean content. Those selling access to online porn are not eager to be censored. An old trick by porn sellers was to take a censored image and change one pixel, turning a blocked image into an unblocked “new” image. But technology caught up, and perceptual hashes can now identify content marked as porn, even with watermarks and other manipulations. In recent years, Convolutional Neural Networks (CNNs), which are exceptional image classifiers, have been deployed to classify and filter content, and there is no shortage of porn and not-porn images on the internet to train these binary image classifiers.

Is this picture of a woman behind a surfboard obscene? Obviously not. However, a computer does not know that there are likely clothes behind the surfboard, unless we put in some effort to train it.

Detecting objects, even hot dogs, is a very nuanced business. Even more so, filtering through pictures of people to filter out porn is really nuanced. Let’s give it a try.

DETECTING AND FILTERING

To keep this article nice and squeaky clean, let's filter out faces instead of porn. So, any pictures of humans are fine, as long as there are no faces.

Let’s start by collecting images of faces. The Visual Geometry Group (VGG) at Oxford built their Oxford Buildings Dataset using a short set of keywords to search Flickr. Yes, this is the same VGG that many famous CNNs are named after. Let’s do something similar. Using a nodejs scraper, we grab images from Bing and google images using the keywords: faces, faces angry, faces happy, faces sad, and just for fun, celebrity faces. We end up with 2,945 jpg images. To flatten this all into one folder and remove junk, we use a couple of the commands from my article on commands I always forget:

We now end up with 1,372 faces files. After removing a LOT of smiley faces and other junk, we end up with 789 images. The quality of these images is poor, but that’s what you get for low effort. Below is an example image from the remaining images in the dataset. It is a distorted image of Gary Busey. If this dataset does not work out, we can always grab the VGG faces dataset here. The trouble there is that each file in the dataset is not the image of a face. Instead, it has a link to another file to download. Annoying, much?

RESULTS

Now what do we get when we ask a CNN to find images that contain faces and filter them out? Let’s use the CNN from my article on video analysis “Exam Time for a Binge Watching AI”. The code will run in an AWS EC2 instance, specifically a p2 running Deep Learning AMI Ubuntu Linux — 1.5_Jun2017 (ami-96796fef). Our goal is to remove images from the final episode of The West Wing that contain faces.

Let’s see the results:

89 randomly picked video frames that were tested, and of those, 64 contained images of faces. My labels on the images agreed with the CNN 66.3% of the time. Of the 30 images that were labeled incorrectly, 17 had faces that were not detected (false negative), while 13 had no faces but were labeled as having faces (false positive).

CONCLUSION

We could further improve the system by gathering more and better data, and by detecting images similar to the ones our CNN blacklisted, that are not caught by the CNN but ARE caught by the similarity filter. Face detection is a whole field unto itself. The goal here was really to show that content in videos can be filtered out in an automated way, using training data to teach the AI what to key in on. Porn on the internet is not going away, and so the techniques used to filter content will have to continue to stay one step ahead of the bad guys.

If you like this post, then please recommend it, share it, or give it some love (❤). I’m also happy to hear your feedback in the comments.

Happy Coding!

-Daniel
daniel@lemay.ai ← Say hi.
Lemay.ai
1(855)LEMAY-AI

Other articles you may enjoy:

Machine Learning

Artificial Intelligence

Neural Networks

Daniel Shapiro, PhD

Written by Daniel Shapiro, PhD

Passionate About Machine Learning R&D and Value Creation. ✍ daniel@lemay.ai ⬱ https://lemay.ai

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams