Stalkers or just hopeless romantics? Here’s what I found..
Analysis of the infamous Craigslist add section dedicated to lost love and strangers hoping to find their missed connection.

Craigslist’s missed connection provides internet’s most comic attempts at finding love. Most of us use craigslist to find some old furniture(hoping its an antique of some sort?!?) or perhaps an apartment or a car, but there are those optimist that are looking for love. Craigslist has numerous personal sections with posts ranging from individuals seeking simple platonic relationship to posts with some weird … lets just leave it at stuff! Over the years it has been a source for several viral stories ranging from sweet heartwarming love stories to prostitution scandals.

Craigslist started as an informal and simple mailing list for free classified ads in San Francisco and soon became popularized for its personal post section. Although I have never posted, I have often used it as a source for quick humor. There is complete anonymity and no judgement.
The idea hit me over some drinks with friends. As we were sitting there talking about dating and the surprising variety in dating apps, the conversation led to craigslist’s missed connection. I asked myself who is still posting and why are they still posting on this site? I quickly realized that I can potentially answer these questions using the natural language processing tools.
I was able to collect around 8000 posts from 10 different US cities over a 10 day period. I picked 5 cities from each coast as shown in the map below:

The circles in the map above represents the population in each city and the graph below shows the total number of posts acquired during the 10 day period:
I was then curious about the time of these posts. Are people posting from work? from home?
I had a strong feeling that most of them were being posted later in the day when people are at comfort of their bed, contemplating their life, especially the thing thats missing from their life. Well.. I was wrong. To my surprise the distribution shows that users are posting throughout the day (even when they are at work?!?)
Next I looked at sexual preference and the distribution based on wether the post was from “[m4m]”, “[w4w]”, “[m4w]” and “[w4m]”. What cities do you expect to have an equal gay : straight ratio?
Whats next?
To my surprise most of the posts are coming from men and women around their 30’s. Are 30 year olds paying more attention to their surrounding 🤔? Another interesting thing I noticed was that, there was a little spike about every five years.
Is there a difference in the writing between male and female? Especially in these posts? To analyze this, I first looked at the word count(total number of words used in each posts) between male and female in different cities.
In all the cities female generally tend to write more then males. I kind of expected that but how about age? and using other measurements such as flesch readability index
In the heat map above, if you look at the bottom two rows and you can see that none of these matrixes have a strong correlation with sex or age. This was a bit of a disappointment because I wanted to use these as some of my features for building a model that predicts users gender and age as some of the posts lack this information.
I then went on to use a countvectorizer which returns the most frequently used words in a document. I also used tf-idf which returns a score for words which shows how important certain words are to a document.
The three different buckets below shows words being most frequently used by straight male, female and from individual that target their same sex(can you guess which one is which):

The top left is from straight males, the top right is from straight female and the bottom one is from gay individuals.
Then I went on to look at the similarity between the posts by combining the features extracted from count vectorizer and tfidf to calculate the cosine similarity between these posts. I did this to compare the different cities:
Basically the dendrogram above shows the similarity between the 10 cities based using words extracted from count vectorizer and tfidf. Below you can see top words that occur in Newyork and Honolulu but not in both. This kind of gives you an idea about what sort of features should be extracted in order to classify users from these cities.
Even without the labels on the top you can easily guess which of these graph shows words collected from New York or Honolulu(Hawaii).
I used the same method using a heat map to show writing styles between different genders in all the 10 cities.
As you can see that males in all 10 cities are talking about similar stuff and have a higher score for cosine similarity compared to females(also keep in mind that 30% of the male are writing to other males).
to be continued ..