The “tiny” field guide to dating - A Data Science Tutorial
Before I get into it, spoiler alert! The same way reading about martial arts doesn’t make you a black belt, reading this article won’t land you dates directly (if it does, do buy me a pizza). But what it might do, is improve your chances and hopefully give you an insight on how to extract information from data.
Dating has always been one of the most interesting and hard to understand parts of life for almost anyone. But can we try to identify some trends which could be followed to potentially get you a date? I found a dataset that will allow us to explore this question a little further. So let us begin.
There will be lots of figures so go ahead and look at them.
A little introduction about the data set must be given before we end up using it. So the data is from a speed dating experiment which is conducted by the University of Columbia. Experimental dating events that were scattered through our 2002 to 2004 were observed and at the end of every speed date, the participants were asked if they would like to continue and see their date again. Throughout different points in the process, data was gathered such as demographics, beliefs, even attributes that they would consider in themselves.
So let’s dive into it. Follow along in this notebook
We first have a look at the data and find that there are 195 attributes and more than 8,000 rows.
We also find out that there are a lot of missing values. So rather than try to work on them, we drop all the attributes which have more than 4,000 rows missing. This is because it might skew our data and give us a headache later on.
We get the new number of attributes as 130. Which is a lot of attributes. Now we filter out the people who actually did get a second date. Because at the end of the day, that is the aim of our analysis.
We find out that there are less than 21% of the people who actually got a second date. But are we really surprised?(xD) Let’s dig in a little more. How about we take some of the attributes and try to extract a little more information from them.
Let us first take ethnicity an attribute. We use a function called crosstab from the library pandas. This allows us to find the counts of a specific column for all the unique values present in the column.
We find out that more people got a second date and were not of the same ethnicity. It should be noted that this data could be skewed based on the collection method.
How about we consider age. We now try to plot age against the number of people who actually got to a second date.
We find out that the majority ranges from 21 to 30 years of age. If you are more than 30, there are relatively smaller chances for you, but the very fact that it’s not zero shows that you still have a chance. Go for it!
Now let us consider jobs.
Sadly for the engineers and the math nerds out there, you might have a little harder time. But we all knew that, didn’t we? If you are doing business, economics or finance, or even physics and chemistry (Hats off.)
Now just for fun let us consider an attribute that tries to identify why the person registered for the speed date in the first place.
From this, we find out that among the people who did get a match, a majority of them were out to either meet new people or they just felt that it would be a fun night out. My tip? Go out and have fun. Even if you did not get a date at least you had fun.
Those looking for a serious relationship don’t seem to like speed dates. I wonder why?
Now, how about deciding what to do on your first date? To do this, we take both the genders and find out the overlapping activities for both of them.
That is some very tiny text. But, just by looking at the plots (and zooming in a lot) we can say that you might have a better chance at getting a good date if you work on either Dining (Good food sure does matter), Music (Come on..), Reading (You can always bring up interesting topics) , hiking or clubbing. This tells us that these are some common hobbies that might just be easier to start a conversation using.
Now let us look at how many times the person goes out every week. We first write a small function which we will reuse later that gives us access to the plots from our matched dataset in a single line.
We can see that most of the people who got a match are people who go out a lot. That does not mean if you step out of your room 5 times a week that somehow makes you more desirable. (Depends on where you go from there though. Especially if you go to some fancy office. Just kidding.)
Now for something I found interesting. I wanted to see how many of these people who got themselves a date thought themselves attractive or felt that they were fun to be around.
Okay, wow. This is surprising. We can see that these people have a pretty great amount of self confidence. Awesome! It is great to be confident in yourself and believe that you deserve it.
But… maybe not too much. If we take the parameter which shows us how many matches each person will get. And then divide our data into two parts -> the ones who got a match and the ones who did not.
Oops. The ones that did not get a match seem to have significantly higher self-confidence in themselves. Maybe a bit too much(?). So, believe in yourself but do not believe too much. Just the right amount.
Another interesting aspect of this data was an attribute called rating. So among the ones that did get a match, the participants were requested to rate their date. And here we go.
Hold on.. did.. someone.. get.. a.. 10?!! Multiple people seem to be there. I will skip over the code as I am using the same things as before except this little bit.
Do feel free to use this snippet and extend it for the other bits we talked about. Maybe you might come up with something new.
Hopefully, you now have a date and a nice Jupyter Notebook too.
PS. This is a data analytics tutorial. So don’t come running behind me because what you used from here did not get you a date.