Learning London Dating Profiles
This blog post describes how I optimised a Generative Adversarial Network and a Recurrent Neural Network using data from 40,000 scraped Tinder accounts to create generative fake dating profiles.
At no point in the blog post or elsewhere on the internet will I share the data that I obtained, so please don’t ask for it — I have already deleted it.
Our Public Data
Recently I have been thinking about the privacy, ownership and availability of data.
Like many, in the wake of the Cambridge Analytica scandal, I found the magnitude of data harvested from Facebook by a third party astonishing. Collectively I believe we are becoming more conscious that we, as the users, are the products for sale for online services like Facebook, and are increasingly becoming aware of the dangers of being caught in an echo chamber, or worse, a deliberate and targeted echo chamber. It was put articulately in an article by Francois Chollet, the author of the renowned deep learning library Keras:
Your [data] can be cross-correlated with that of thousands of similar people, achieving an uncanny understanding of what makes you tick — probably more predictive than what yourself could achieve through mere introspection (for instance, Facebook “likes” enable algorithms to better assess your personality that your own friends could). This data makes it possible to predict a few days in advance when you will start a new relationship (and with whom), and when you will end your current one. Or who is at risk of suicide. Or which side you will ultimately vote for in an election, even while you’re still feeling undecided.
The last sentence is particularly poignant given that this quote predates and predicts behaviour demonstrated in the Cambridge Analytica scandal. Chollet then goes on to say:
If Facebook gets to decide, over the span of many years, which news you will see (real or fake), whose political status updates you’ll see, and who will see yours, then Facebook is in effect in control of your worldview and your political beliefs. Facebook’s business lies in influencing people. That’s what the service it sells to its customers — advertisers, including political advertisers.
It is this ability to manipulate views, and profit from increasingly relevant advertisements, that explains why Facebook spends so much money on the best Machine Learning engineers to optimise their newsfeed. These services exploit us at a root level, and other corporations are able to get secondary access to our data on a mass scale. This data is not just being used analytically, it is being used algorithmically and generatively not only to take, but also to put back in to the digital sphere. A particularly recent example of this are headlines of mass Twitter accounts used to propagate propaganda from the Russian troll factories.
Today it is possible to create convincing fake data which is very often generated using neural networks. Examples of these emerging trends are deepfake pornography and Nvidia’s recent GAN efforts. Consuming this data at regular intervals may become much more normalised than it is today, and this is a worrying notion given the scope of what has already been happening globally, exemplified by the Cambridge Analytica scandal.
Prior to this project, and on a smaller scale, I created a Facebook chatbot. It only interacted with around two hundred people to date, but I was astonished by how willingly those people parted with sensitive and identifying information about themselves including; birthdays; full names; email addresses; education histories and even signatures!
The ease with which data can be harvested, stored and shared, raises a crucial question that is imperative to the progression of this project. Naturally, releasing, sharing or selling this information would be immoral, but is it okay to release a neural network that models merely the probability distribution of the data?
Provided the model has not overfitted the data and ‘memorised’ the original dataset, then we can assume that the neural network in question is merely approximating the data, and thus not divulging any truly personal details. It is for this reason I have no ethical qualms about this project, however prevent any premature mob-lynching, I will not actually be releasing the model in question, just some data sampled from the model.
As part of my degree, I am exploring artificial intelligence within an artistic context, and I wanted to train some generative systems to examine a strand of this narrative of data harvesting, privacy, on a social networking application. What can one guy do with a laptop over a few weeks?
Artists And Programmers Exploring Privacy
Some fantastic artists are already exploring the role of privacy in data and studying things like dating profiles. Roger Luke DuBois, speaking at a TED talk in 2016, described how, amongst other projects, he scraped 21 different online dating sites, by pulling 19 million single Americans profiles and exploring the users’ self-identity by location.
Artist Kyle McDonald also has a history of creating projects built around privacy and data. After recording his face while using his computer over a period of time, he was surprised to note a lack of facial emotion. What of other people? To question this, he decided to test a wider audience. McDonald was subsequently raided by the American Secret Service after he installed a script that pushed faces taken in an Apple store on the Mac Photobooth application to a private website. He talks about this and other projects here.
He closes the above video with the comment: Artists are always going to be necessary for pushing boundaries and helping us understand where things break.
Contrarily, plenty of programmers and artists are looking at the problem of data and privacy from a preventative stance. On my own turf at Goldsmiths, one such artist and colleague that I respect is Orange, and I have come to appreciate his stance on privacy, security, technology in general. He has recently been building a prototype that enables the highly privacy focused operating system Tails to be discretely and easily installed on a USB that subsequently can be deployed on most computers. This democratises the ability to use the Internet anonymously, without leaving a trace on the computer you are using, as well as providing numerous cryptographic tools to encrypt your files, emails and messaging. Tools in a similar vein to these may well become more relevant in our society as data becomes more readily available to third parties.
Getting The Data
To get the project off the ground, I needed to acquire image and textual data to train the neural networks on. I subsequently harvested 40,000 Tinder profiles. There are publicly documented ways to get your hands on Tinder data, so I won’t cover them in this post. All you need is a few Python modules and some fake Gmail profiles. A decent VPN might help too!
I sent my scripts to various geolocations around London and collected the tinder profiles from these areas.
Generating Dating Biographies
After downloading the profiles, the first thing to do was to create a class that could store the sequences of numerical tokens that the neural network learns on. The idea is that the downloaded biographies are stored in a list of strings and are converted into larger list of fixed-length sequences of tokens.
The class is documented and is featured below. Excuse the size of the code — each function is relatively small, and hopefully, the comments make it clear what each part is doing. If you find you are building a similar dataset, but using higher dimensional information like images, then implementing a generator that uses the Python yield operator to yield batches of data points could be a good idea to prevent overzealous memory consumption.
I should add that the dataset contained emojis, which I initially stripped out, but then thought it was more fun to keep them in!
Once the textual data has been munged, it was ready to be fed into the neural network. I organised the neural network into a class that yields the two principal methods in deep learning; train and inference. The model can also be saved and loaded. Basic usage of the neural network with the dataset class above would look something like this:
The model is defined as below. Again — apologies for the size — it’s mostly documentation strings, but each function should be reasonably easy to break down. I defined the neural network in the
build model method of the class. It is merely a multi-layer recurrent neural network, that is arranged in a many-to-one manner, so that it takes a sequence of embedding vectors and outputs a prediction vector that is the length of the number of unique characters in the dataset. The recurrent cell is a LSTM, which is widely famed for its ability to prevent vanishing gradients, amongst other benefits, and there are many great resources where you can read more about that.
Training the model on the data produced comme ci comme ça results. During sampling the model, I tried different temperatures. Having a higher temperature yields a softer probability distribution over the possible characters, and encourages higher diversity on the generative material. At low-temperature settings, the model didn’t include emojis, which is likely due to the fact that they were far less frequent than the average alphabetical character. I tuned the temperature value to ensure the majority of the generative text featured emojis, whilst minimising the apparent randomness of the text. After settling on a temperature of 0.5, the model generated somewhat entertaining biographies. They almost make sense — here are a few of my favourites:
- Looking for someone to see what I always like:
Looking for someone to the world to see what I always like.
2. Likes to chill, photography and dogs — whats not to like?
I like the chill and photography and dogs of a lot of studying but don't be a bit of.
3. Looking for someone to make my someone.
Looking for someone to make my someone to see what you're a good.
4. Possibly incestual AI.
I like me the play music, conversations, travelling, see the sister with.
The results aren’t coherent — they are often not correct spelling wise and nearly always grammatically incorrect.
However, some of them are ‘almost’ sentences, and the incoherence is to be expected if you consider the dataset; the neural network’s highly varied and random output likely reflects the bios written by the thousands of different authors. Popular charRNNs used in the wild have often produced more consistent results, however often the training material has almost always been from a singular or small number of authors, or has strict styling formats, like in the case of the charRNN trained on the Linux kernel which many developers contribute to.
Generative Profile Pictures
Another objective of this project was to generate images for the profile pictures of the fake Tinder dating profiles. Early experiments using unedited photos to train the GAN to generate new photos showed that the source images varied too much in terms of their backgrounds, orientations, resolutions and dimensions. Because of this, Generative Adversarial Networks (GANs) I used didn’t produce particularly recognisable results.
To make the image generation problem easier I created some code to crop the faces from each photo. Using cropped faces enabled the learning process for the GAN to be massively simplified, as it increased the level of correlation across the data points in the dataset.
In pictures, the below code would take a photo like this:
And get images like the faces below and store them in some new dataset, which the GAN would then learn on.
The code makes use of dlib, which has an excellent face detector with a low false positive rate. Of course, constructing a dataset of cropped faces to improve the reconstruction of the GAN discards a massive amount of aesthetic information present previously in the original images downloaded from Tinder. It was a conscious decision to abandon this information to produce pictures that somewhat resembled dating profiles. With more time, it would be fantastic to return to this project without needing to crop the source photos, improving the reconstruction of a dataset of images with more varied background and better framing of individual photos.
Once a dataset of faces was constructed, a neural network could be trained to produce new images of faces. The current state of the art is the Progressive Growing of GANs (PGGAN) from Nvidia that blew everyone away 177 days ago (as of the time of writing this blog post.) Given more time it would have been great to have a crack at implementing this model, as I find the idea of dynamically growing the depth and size of the model as the optimisation progresses fascinating. However there wasn’t enough time for this, or enough of a magnitude of available compute power, which would be needed to hack around with a model of this scale frequently.
The availability of computing power was an issue in this project, due to the scale of the neural networks that I was considering using. I ended up signing up for a free Google Compute cloud account. On signup, you receive $300 worth of your local currency to spend on whatever services you want. Handy tip — you can get the same signup bonus with every new Gmail account you make with them! ;)
When increasing your maximum allocation of GPUs from zero to greater than zero, you will, however, need to pay Google a small amount of money to show you are serious about being able to front the bills if you run out of free credits. Once this is done, you will need to create a new instance on the cloud. If you want a tutorial on that, this might help. When you first SSH into your new instance, I wrote a small script to easily install CUDA 9, CuDNN 7.0, TensorFlow and PyTorch, without any hassle. Hopefully, this is useful (but quite possibly out of date by the time you read this)!
Once I sorted the issue of available compute, I elected to download an implementation of PGGAN to use with the newly created Tinder faces dataset. I chose to use a PyTorch implementation of PGGAN, although since I started using it, the TensorFlow version from the authors of the original paper has officially been released and should probably be your starting point if you are looking to do similar work. The implementation I used was incredibly useful, so kudos to the author, but there were a few issues with saving the model and later loading it, which made inference challenging to get the right resolutions. As a warning, training these kinds of models to resolutions of 256, 512 or higher can take weeks on a single GPU — be sure to make sure the model saves and can be recovered correctly before devoting such time to training it!
After a lot of lengthy training, spanning many weeks, I finally got a model making some relatively decent generative images. For instructions on how to use the models, see the respective GitHub pages.
Creating the website
Because of the dataset, it seemed appropriate to present the GAN and the charRNN’s generative material in a card-based interface similar to Tinder. I made use of the swing.js library and webpack to bundle everything into a small and simple website that displayed a stack of dating profiles as cards.
It is hosted here, and some of the examples look like this:
Concluding Remarks And Future Work
The image generation was far from perfect — there were some apparent blemishes, and it would have been great to have the compute power to train a model that produced higher resolution images like the famed Nvidia ones. It would have been better to collect more data, and attempt not to crop it; collecting more data might have enabled the GAN to produce fake profile pictures more stable than it initially did. Of course, this project was a highly compute-intensive one; creating the minimum viable project of some fake profile photos still took weeks!
The text generation for the fake biographies was much easier to implement and comprehend. Upon looking at the outputs, which are highly incoherent (while capturing the textual essence of the nonsense often present in the biography), it is questionable whether the RNN would have performed any better than a more straightforward approach such as a Markov model. It would indeed be interesting to compare the two methods together.
It would also be interesting to further tie in the resultant ‘gander🔥’ website with extensions on the themes of data, privacy and curatorial control mentioned in the introduction of this blog post. Earlier, on the topic of curatorial control, we looked at how algorithms determine what we see, which in turn can impact our outlook on life. The user of the web app could swipe right or left to either like or dislike the fake profile. The swiping action would provide a fitness signal for the current dating profile, and we could optimise the generated profiles for each user with a simple metaheuristic like a Genetic Algorithm. The Genetic Algorithm would tune the noise fed to the generator of the GAN and the starting state for the RNN, to produce more relevant content for the user.