Crowdsourced OKCupid Crawling

Toward Automating OKCupid with Question/Answer Data

Ryan MacArthur
3 min readFeb 4, 2014

Wired recently published an article about a math wizard who ‘hacked’ OKCupid to find true love. Cutting through all the fluff in the article, it became apparent the most important data on OKCupid was the questions. If you want to automate any part of OKCupid, you need the question and answer data across a large number of users.

When I brainstorm ideas with my friends, that perennial research idea keeps coming up; “Let’s automate OKCupid”. In light of this recent Wired article, we decided to quickly whip up an OKCupid crawler implemented in JavaScript (by Mohamed Mansour) and distributed as a Chrome extension. The goal of this bot is to collect all OKCupid questions and answers and store them on our Google App Engine NDB backend. This will enable us to do some pretty cool and dare I say useful analytics, since we haven’t seen anything recently from OKTrends.

Most previous attempts at crawling OKCupid, including the one in the Wired article, are stunted by running a few bots locally. The so-called math genius was quickly banned by OKCupid and had to implement simulated user activity to evade IP/Username bans. Instead of this, we dreamed of a crowd-sourced model where OKCupid users would opt-in to helping us collect data. The benefit to the users who use our crawler is their profile will browse thousands of other profiles, netting them visibility without doing any work.

The current implementation is available on Github here

One goal of this bot is to eventually enable automatic messaging to users with intelligent and enticing ‘openers’ based on questions the bot has collected. The openers are seeded with predefined content, with a tone leaning toward each possible answer.

The logic behind this is that it’s hard to take a users profile and automatically create a dynamic, enticing message. As an example, try parsing three different profile snippets:

“I put Cheese on everything”

“Things I am allergic to: Ryan, Cheese”

“I once heard a band called Cheese”

… and automatically generate a message to each user that references “Cheese”

My NLP/ML/IR/Data Mining foo is weak, but to create an accurate, intelligent, witty, and most importantly human response dynamically is hard.

An easier approach would be to crawl all user questions and answers. Rank them based on the OKCupid-provided importance, and then find matches for the person using the Bot:

Regardless of whether or not you smoke marijuana, do you think it should be legalized for adults?”

Her answer: Yes

Your Answer: Yes

This data is much easier to parse and requires no interpretation. The users have to answer each question using a pre-defined answer (no fill-in the blanks) so we know a-priori all possible answers to the OKCupid question set.

Armed with this data, we can build a list of ‘ice breakers’ , such as: “Hey {target}, I also agree that marijuana should be legalized for adults. Think of the tax incentives!”

The goal is to send quick one liners that pass the ‘sniff test’ and entice the target to, at a minimum, browse to the users profile, and at best respond to the message and open up dialogue. This is where you as the user come in and take over the conversation, and find your true love.

Install the Chrome extension now! Here.

If you are interested in this dataset feel free to mail me: ryan.macarthur / gmail

--

--