My journey using machine learning to find porn star lookalikes

Lookalikes taken from google images

A few weeks ago I thought about writing my experience into machine learning and building a side project which lets you find a porn star lookalike using a picture: faplike. First, you tell yourself that nobody will be interested in something that is not “How I built a 1$ million MVP in 3 days” or “Scaling X to 5 million users” but here I am. Many things have been written about the impostor syndrome so let’s beat it.

Getting into the machine learning hype

I got caught almost one year ago when I took the famous machine learning course by Andrew Ng (I’m doing now his new course on deep learning) and after weeks of watching videos I realized that despite the media coverage this was something big. After I finished it I wanted to build some of the many ideas I had. One will be real, thanks to the machine learning series from Adam Geitgey, in concrete the post about face recognition.

Examples of neural networks

You have to finish your side project

Before building this idea I had a few projects which weren’t completed probably due to one main reason: using every new technology out there. Of course, there are other reasons, you’re working 8 hours per day in your company using a computer and somehow you need to disconnect.

My first goal was to apply some machine learning algorithm and the second was to learn another language, in this case python. So here I am again: I need pictures, let’s use python. I need a webpage, django and react because why not? What about face recognition? dlib library sounds interesting and you can either use their c++ examples or python. But finally, probably because my previous experiences at work, I just focused on my main goal and nothing else.

I know symfony and jQuery, the web will be built with them. Want to use a bit of python? scrapy is the best framework for it, seriously, you just need a few lines of code. What about face recognition? openface has everything you need, you will read python code, tweak it and I already have seen how to use it.

openface

Every model needs its data

In order to train your machine learning model you need data. But before everything else what you really need is to test your idea and to find out in which format you need it.

It’s pretty easy to start collecting data for days/weeks/months and then realize that it can’t be used, so again, try your idea with a bunch of examples first. My examples will focus on images because this is what I need in order to predict how close the face in the picture you upload is to the ones in my dataset.

As said before, scrapy is really useful. I built a few spiders to go through different web pages and all you need to do is to put some time in your expressions using xpath or css. The framework does the rest for you, including user agent, concurrent requests, delays and different things you may need as for example a pipeline. My pipeline was just to get images and store them with the actress name as folder name.

Then you run it and wait until you get your data. I just spent a bit more getting images from google because it loads them after the html is generated. In case that you wonder why I used google for getting porn actresses, it’s because it has a filter to search by face .

You can see the spiders code here.

Having “fun” training the model

Second step after you get the data is to train it in order to predict what’s the output Y of your input X. In my case, which is the closest face from my dataset. I got around 700,000 images from something like 12,000 porn stars.

When you start training, you soon recognize that maybe the time you spent getting data wasn’t worth it. For example, images are too small(16x16), prediction confidence decrease when the number of actresses you have increase, you need a better computer to save hours or not to crash on memory overflows, etc. Regarding the last point, python pickle format is definitely not the best choice for generating a classifier.

But the fun begins when you try to improve the accuracy and take a look at your pictures. The library will detect all the faces in a picture and crop them. A lot of them are from sex scenes so you’ll have more than one person on it which means I have to manually remove men and different women. But that isn’t the worst. I don’t know the reason but many many times nipples were detected as a face. Not to mention the painting in the back of the images with something similar to a face and much more.

Examples of images detected as faces

Of course, there are other ways to avoid you doing this manual work, an example is amazon turk but being a side project just to learn something it wasn’t worth it.

Showing your idea in a web

The easy part comes, I just have to build the web with the technologies I know. Just a few lines of PHP and javascript did it.

What I couldn’t resist at the beginning was to made it super fancy with the great animation I have in mind. Animations like moving parts of the web on image upload, fade the image from grey scale to real colors, quick carousel of random images while prediction is done. One thing: forget about it. Go back to your goal and let the fanciness in your mind. I did it responsive though which reminds me how much I hate CSS(if you’re like me check these points).

faplike.com web

Even if this project wasn’t made to earn money, I added some videos of the actress predicted using hubtraffic but unless you get thousands of unique visits per day you won’t make money.

You can see the web code here.

The learning part

After having your project ready in your machine the next step is to find a server. Looking for cheap options as I didn’t want to spend much on it I started using hetzner recommended by a friend and later on I switched to contabo.

You soon realize what was working in your laptop it won’t be the same in a server. I switched to the second server due to lack of RAM. At this point you start testing it with more than one concurrent user which crash it. This can be solved using a queue system for example, but the process already took 30 seconds and the more people I add to the classifier the worst it gets. Another solution is just paying more money in some cloud server as AWS but for me it was enough as a side project.

That’s why you won’t see the project running as some of my friends did as I only kept it online for one month. You can still use the code anyway if you want to test it with someone you know :)

linux top command

I started using docker too with openface which comes as a docker image. Once you have to install it a dozen of times you start learning the basic commands: keeping the image up, calling your image with a name because the id is generated every time you reset it, how much memory garbage it leaves behind or sharing data between volumes. Certainly, I don’t have the knowledge of using docker compose for a microservices architecture but I’m still glad that I got an idea.

I had to use bash many times too mostly for moving files, creating directories with the structure I wanted or merging them. Even if I had used it at work, trying new things was refreshing. You can’t just know every bash command out there and there will be always someone who can do it in one line on stackoverflow :)

What’s next?

As a side project I have to say that I’m satisfied with the results but I still have some more ideas that I left in my TODO list. Adding the male gender to it, a history of your searches, connecting it with Facebook so you can check who from your friends look similar to a porn star or maybe even you find out what he’s doing on his free time…but all these ideas won’t be done as my real goal is already done

If you have to keep something from this post it should be to try to finish what you started and to avoid getting frustrated by all the new technologies and crazy ideas that you may have.

I’m now moving to my “ideas list” that I keep on my phone where I add something every now and then. For example, I’m pretty interested in the concept of Generative Adversarial Networks, another topic that upsets me is all the fake news out there and how something related to natural language processing can solve it. Not everything is related to machine learning and I have some funny ideas which probably will never see the light but who knows?


You can follow me on Twitter @biruwon, write me an email or take a look on my LinkedIn profile.