Machine Learning Matchmaking

Using Machine Learning to find compatible partners with R

Shariq Ahmed
Towards Data Science

--

Photo by jisoo kim on Unsplash

A simple question like ‘How do you find a compatible partner?’ is what pushed me to try to do this project in order to find a compatible partner for any person in a population, and the motive behind this blog post is to explain my approach towards this problem in a manner as clear as possible.

You can find the project notebook here.

If I asked you to find a partner, what would be your next step? And what if I had asked you to find a compatible partner? Would that change things?

A simple word such as compatible can make things tough, because apparently humans are complex.

The Data

Since we couldn’t find any single dataset that could cover the variation in persona, we resorted to using the Big5 personality dataset, Interests dataset (also known as Young-People-Survey dataset) and Baby-Names dataset.

Big5 personality dataset: The reason we are choosing Big5 dataset is solely because it provides an idea about any individual’s personality through the Big5/OCEAN personality test which asks a respondent 50 questions, 10 questions each for Openness, Conscientiousness, Extraversion, Agreeableness & Neuroticism to measure them on a scale of 1–5. You can read more about Big5 here.

Interests dataset: which covers the interests & hobbies of a person by asking them to rate 50 different areas of interest (such as art, reading, politics, sports etc.) on a scale of 1-5.

Baby-Names dataset: helps in assigning a real and unique name to each respondent

The project is made in R language (version 4.0.0)
With the help of dplyr and cluster packages

Processing

Loading the Big5 dataset, which has 19k+ observations with 57 variables including Race, Age, Gender, Country besides the personality questions.

Removing the respondents who did not respond to few questions & some respondents with vague age values such as: 412434, 223, 999999999

Taking a healthy sample of 5000 respondents, since we don’t want the laptop go for a vacation when we want to find Euclidean distances between thousands of observations for clustering :)

Loading the Baby-Names dataset and adding 5000 unique and real names to identify each observation as a person than just a number.

Loading the Interests dataset, the dataset has 50 variables, each of them an interest or a hobby

The heatmap shows us that some areas such as medicine, chemistry; theatre, musical; and politics, history show some correlation. This observation is important, as we will be using this knowledge ahead.

After loading all of the datasets we combine them into one master dataframe and name it train, which has 107 variables which are shown here:

A few plots to see how our data lays out in terms of Age and Gender

We can see majority of respondents is youth, and we have more female respondents than male

Principal Component Analysis

Remember we saw little correlation in the heatmap? Well this is where the Principal Component Analysis comes in. PCA combines the effect of some similar variables into a Principal Component column or PC.

For those who don’t know what Principal Component Analysis is;
PCA is a dimension reduction technique which focuses on creating a totally new variable or a Principal Component(PC for short) from all of the variables through an equation to grasp most variation possible, from the data.

In simple terms, PCA will help us in using only a few components which take into account the most important and most varying variables instead of using all 50 variables. You can learn more about PCA here.

Important: We run PCA on Interests variables and Big5 variables separately, since we don’t want to mix interests & personality.

After running the PCA on Interest variables, what we get is 50 PCs. Now here is the fun part, we won’t be using all of them, here’s why: the first PC would be the strongest i.e a variable that will grasp most of the variation in our data, the second PC would be weaker, and will grasp lesser variation and so on until 50th PC.

Our objective is to find the sweet spot between using 0 and 50 PCs and we will do that by plotting the variance explained by the PCs:

The plots show us the Proportion of Variation explained by each PC. Left: Each PC’s individual performance is shown (for example, the first PC explains around 10% of the variation in our data, but the 15th one only 2%)
Right: Cumulative version of the plot on the left.
We see that after 10 PCs there is very low individual contribution.
But we will stretch it a little bit to cover 60% variance & take out 14 PCs.

The result? we just shrank number of variables from 50 to just 14, which explain 60% of the variation in the original Interest variables.

Similarly, we do PCA on Big5 variables:

Again we see the first PC is the strongest, and explains more than 16% of the variance.
While the slope starts flattening after 8th PC, we will go with 12, to grasp around 60% of the variance.

Now that we have reduced the variables in Big5 from 50 to 14 , and in Interests from 50 to 12, we combine them into a dataframe different from train. We call it pcatrain.

A glimpse of the pcatrain dataframe with the variable names

Clustering

As a good practice we first use Hierarchical Clustering to find a good value for k (the number of clusters)

Hierarchical Clustering

What is Hierarchical Clustering? Here is an example: Think of a house party of 100 people, now we start with every single person representing as a cluster of 1 person. The next step? We combine the two people/clusters standing closest into one cluster, then we label another two closest clusters as one, and so on. Finally we have gone from 100 clusters to 1 cluster. What Hierarchical Clustering does is form clusters on the basis of distance between the clusters and then we can see that process in a dendogram.

The red line might be a cutoff point

After doing Hierarchical Clustering we can see our own cluster dendogram here, as we go from bottom to top we see every cluster converging, the more distant each cluster is from the other, the longer steps it takes to converge; which you can see by looking at the vertical joins.

Based on the distance we use the red line to divide a healthy group of 7 diverse clusters. The reason behind 7 is that, the 7 clusters take longer steps to converge, i.e the clusters are distant.

K-Means Clustering

We use Elbow Method in K-Means to make sure that taking around 7 clusters is a good choice, we wont dive deep into it, but to summarize: The marginal sum of within-cluster distances between individuals & the marginal distance between the cluster centers is best at 6 clusters.

K-Means clustering with 6 clusters

We run K-Means clustering with k=6; check the size of each cluster; what cluster the first 10 people are assigned. Finally we add this cluster variable to our pcatrain dataframe, and now our dataframe has 33 variables.

Final steps

Now that we have assigned clusters, we can start finding close matches for any individual.

We select Penni as a random individual, for whom we will find matches from her cluster i.e cluster 2

On left, we first find people from Penni’s cluster, then filter out people those who are in the same country as Penni’s, opposite gender, and belong to Penni’s age category.

Some people who belong to same cluster, country, age-group as Penni’s

Okay so now we have filtered out people, is that it?

No. Remember the question we asked in the beginning?

‘How do you find a compatible partner?’

Even though we have found people with same interests and age-group, we must find people who have personality most similar to Penni’s.

This is where the Big5 personality variables come in handy.

Through Big5, we will be able to find people who have the same level of Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism as Penni’s.

What we did here is find the difference between the response of Penni and the response of filtered people for each personality variable, and then added the differences of all variables.

For example: Brody has sumdifference = 8.9, i.e In 50 questions of Big5, Brody’s responses differ by only 8.9 points from Penni’s responses.

So now we know, if Penni is looking for a partner, she should first try to meet Brody.

A summary of what we did to find a compatible person for Penni:

  1. Clustered people on the basis of their interests.
  2. Found people who have similar interests, belong to same age-group as Penni’s.
  3. Ranked those filtered people on the basis of how closely their personality matches Penni’s personality.

Thank you for sticking till the end!

You can connect with me on:

Github

LinkedIn

--

--