Reengr: Predicting Your Next Friend

14 min readDec 13, 2018

By Andrew Bohl, Emily Buzzelli, Hope Knopf, Sherlley Loo, Jonathan Evans

Have you ever started talking to someone only to realize that you have absolutely nothing in common with that person? Have you ever hated a job because you didn’t fit in with anyone there? Or maybe you’ve been in a relationship that was unhealthy because you realized that you actually didn’t like that person after the honeymoon phase. Most of us have dealt with at least one of these issues at some point in time, and unfortunately in some cases, have wasted a significant amount of time trying to undo our mistakes.

In an attempt to eliminate this issue, Jonathan Evans started a company called Reengr (pronounced “ringer”). At its core, Reengr is a technology that predicts how likely people are to like each other. It accomplishes this by using 21 psychological questions and 5 sociodemographic questions that are meant to create an all-encompassing picture of a user’s personality. However, the psychological questions are far from the typical psychological questions you would see on a Myers-Briggs or a Big 5 test. Reengr attempts to capture a person’s true nature by asking questions like “would you like to start a business one day?” and “do you think you’re funny?”. We also asked respondents to specify if the questions resonated with them. This was meant to add another layer of depth to the survey, but we ended up ignoring it in our models. However, we may come back to it at a later date to test if it has any significant effect on the accuracy of our models. If you would like to see the full list of questions, here is the link to the survey: https://goo.gl/forms/PXTMZVe1ZoQfReIp2

In order to create the initial predictive models for Reengr, we needed a y variable. This came in the form of the “best friend(s)/romantic partner(s)” question. At the beginning of the survey, the respondent is asked to input their first name and the first two letters of their last name and do the same for all of their best friends and their current romantic partner (or partners if they’re adventurous). This allowed us to have a positive class. We knew who people definitely liked, and this was a great start for us. Unfortunately, this created an issue. We had no idea who people didn’t like. Essentially, we had no negative class. This led us down the path of positive and unlabeled learning.

Positive Unlabeled Learning

Positive unlabeled learning has the name for that exact reason, models are fit only on positive and unlabeled values. A value is considered unlabeled if the label of positive or negative is unknown. In our project, the survey asked to list your current friends, without ever asking to list people you are not friends with. Because of the way the survey was created, we assigned a value of 1 (positive) to known friend pairs, and a value of -1 (unlabeled) to anyone not explicitly mentioned. A value of 0 was retained for a negative class but not introduced until later modeling.

The premise of PU learning is trying to classify new, unlabeled, data by comparing it to labeled. One main issue in PU learning can come from the high imbalance of seeing the unlabeled class compared to the positive class. An example could be comparing all the people that a person meets in their life to just the amount of people that they walk by in their life. Here even when our positive class is large, the unlabeled possibilities are far greater. In this sense, PU learning can take on new methods of learning, trying to predict which unlabeled values are negative class may be just as important as finding the positives.

Two-Step

Our first approach to dealing with the highly unbalanced data was to use a two-step decision tree approach. This approach was tried on a person by person basis and thus slowed down the process for finding matches for every person. However, for a single user the process is relatively fast. For explaining the two-step approach, we will create an example of user A, who explicitly listed their friends as H, E, J, and S.

Firstly, a new data frame was created including all users except for the user in question with the Y variable being 1, 0, or -1 depending of the status of being a known match or not to begin with.
Step 1 was to train a decision tree on this initial data frame and then predict the probability of seeing a 1 for our Y variable. In our example, the rows representing users H, E, J, and S, would have a 1 as the Y variable while all others would start with -1.

Step 2 involved creating new labels for unknown matches in the following way; for all known positives, we found the range of probabilities the tree predicted. In our example, if the predicted probabilities for seeing a positive were the following H: .6, E: .55, J: .7, S: .8, a range would be calculated as [.55, .8]. Now for all unknown points we compare the predicted probability of a positive with the range. Any values below the lower bound of the range as considered to be a new negative, any point above is classified as a new positive and if it falls in the range, it remains as an unlabeled point.
Lastly, we replace the initial Y variables with our new labels and continue steps 1 and 2 until no new labels are classified.

df = preproc()
   y = df["Friend?"]
   ys = 2*y - 1                    # pos = 1, unlabled = -1
   X = df.drop("Friend?", axis = 1)def two_step(X,y,ys):
       rf = RandomForestClassifier(n_estimators = 1000)
       rf.fit(X, y)pred = rf.predict_proba(X)[:,1]
       range_P = [min(pred * (ys > 0)), max(pred * (ys > 0))] # Min and Max values for known positivesiP_new = ys[(ys < 0) & (pred >= range_P[1])].index
       iN_new = ys[(ys < 0) & (pred <= range_P[0])].index
       ys.loc[iP_new] = 1
       ys.loc[iN_new] = 0rf2 = RandomForestClassifier(n_estimators = 1000)for i in range(10):
           # If step 1 didn't find new labels, end
           if len(iP_new) + len(iN_new) == 0 and i > 0:
               break# Step 2
           # Retrain on new labels and get new scores
           rf2.fit(X, ys)pred = rf2.predict_proba(X)[:,-1]
           range_P = [min(pred * (ys > 0)), max(pred * (ys > 0))]# Repeat step 1
           iP_new = ys[(ys < 0) & (pred >= range_P[1])].index
           iN_new = ys[(ys < 0) & (pred <= range_P[0])].index
           ys.loc[iP_new] = 1
           ys.loc[iN_new] = 0return ys,pred

Unfortunately, after completing the two-step approach we were left with only new negatives, no new positives. This left us with some questions on what to do with this new information. While it would have been nice to have new positives, knowing negatives cut down on the possibilities we faced in attempting to match two new users.

What To Do?

After performing the two-step approach, we were left with a list of all non-negative entries, which we called our possible friend list. This trimmed down list we determined contained matches that were probable but not necessarily true in all cases. Under an assumption, two different models were fit to determine who of these remaining unlabeled users were actual friends.
Firstly, a simple approach with the assumption that one-sided friendships do not exist. Under this assumption friends were paired up by looking at each of the possible friend lists. For example, if user A was in user B’s list of possible friends and vice-versa then A and B were classified as friends.

Secondly, a more complex approach was taken using a one class support vector machine model for outlier detection. Under this approach, the assumption that one-sided friendships exist holds, and friends were classified based on a per user basis under the prediction of our one class SVM. Unlike most SVMs where the decision boundary is linear in some hyperplane, a one class SVM builds a hypersphere around known positives as the decision boundary. Any points falling in this hypersphere are classified as friends for a given user, while those outside the boundary would be classified as negative (not friends).

Positive Only Method

For the positive only method, we referred to the paper Learning classifiers from only positive and unlabeled data (2008) by Elkan and Noto. The central assumption in this paper is a standard classifier trained on positive and unlabeled examples equivalent to training on positive and negative examples

The approach is as follows: We treated the positive and unlabeled data examples as positives and negatives, respectively. Then, we train a standard classifier model on the data. The classifier will assign a score to each data point, with positives generally scored higher than negatives. Thereafter, among the unlabeled data points, which are temporarily labeled as negative, the ones with the highest scores are considered to most likely be positives.

Clustering

Clustering was used as an EDA technique to get an understanding of the collected data. There were two main goals the team had in mind for this technique. The first was to get an understanding of the types of people in the dataset by grouping them and the second was to understand what types of people those in each group listed/preferred as friends. As the data was all categorical, K-Modes was used the clustering algorithm. K-Modes determines cluster centroids based on the most commonly occurring combination of features in the data set. It then determines if data points are in a cluster based on how many features match between the cluster features and the observation features. For more info on K-Modes, refer to: https://shapeofdata.wordpress.com/2014/03/04/k-modes/. No special weighting was used to determine clusters (meaning every answer not in common between the observation and the centroid was penalized the same amount). Also, missing values in the dataset were taken to be “informational” — meaning a separate category was created for any observation that had a null value for any question. The team took the assumption that someone not answering a question provided an insight about a person’s character/personality.

To prepare the data for clustering, all answers were manually coded into numbers representing each of the different answers by respondents. Each observation’s coded responses as well as coded demographic information was input into the K-Modes algorithm and 7 clusters were determined.

#Initialize KModes
km = KModes(n_clusters=7, init='Huang', verbose=0, random_state = 526747)

#Fit cluster
clusters = km.fit_predict(new_data)

# Print the cluster centroids
clust = pd.DataFrame(km.cluster_centroids_, columns = col)

The number of clusters were determined by trial and error, until the split of number of observations was reasonable between clusters. Each observation was assigned to one of the 7 clusters by the algorithm. The resulting cluster attributes and number of observations assigned to each cluster are shown below.

Next, the “friends of each observation” were also assigned to their cluster. “Friends of each observation” are those observations a person listed as one of their friends. In order to assign the friend to a cluster, the friend also had to take the survey. Once all observations and their listed friends were assigned to a cluster, the number of times people in each cluster listed a friend from every other cluster was counted and recorded. Then these values were normalized by the total number of listed friends for a particular cluster.

#Count the number of times a cluster shows up for every other cluster & the total number of times a cluster appears as a friend
counts = {}
for key in clusterDict:
   key_count={}
   total_count = 0
   for item in clusterDict[key]:total_count = total_count+1
       for cluster in list(data['labels'].drop_duplicates()):if item == cluster:
               try:
                   key_count[cluster] = key_count[cluster] + 1
               except:
                   key_count[cluster] = 1
   key_count['total'] = total_count
   counts[key] = key_count#Normalize result by the total number of people in each cluster
for key in counts:
   for item in counts[key]:
       if item != 'total':
           counts[key][item] = round(counts[key][item]/counts[key]['total']*100,2)

Finally, the normalized listed friends score for each cluster were plotted. The resulting graphs can be seen below:

A few generalizations can be made from these clusters. One thing to note is that clusters with mostly women (0,1,4,5) tend to list members of their own clusters as their best friends, meaning that women tend to be friends with people who are the most similar to them according to our algorithm. Clusters with mostly men on the other hand tend to be friends with other clusters with men, but they have low or 0 scores with people in their own cluster, meaning men tend to prefer friends who are not as similar to them as women do.

Logistic Regression

According to a Scientific American article called “A Friend Like Me,”, people will select friends more like them when given a choice. Therefore, in logistic regression, we coded our available data (known positive matches) for each pair of people to see how similar their answers were from the survey. We want to predict whether two people will be a match 1 as positive and 0 as negative. The given scores from the PosOnly method is the probability two people will be a match (friends or otherwise). The threshold was at 0.5, if the score was above that, the pair of people would be considered to have a high probability to be a match, and if the score was below that, the pair would be considered to have a low probability to be a match based off similarity. We used a 80/20 train-split on our data. We started with 20 true positives in our test set, and logistic regression reported 20x as many positives than the true labeled positives.

# Import necessary modules
from sklearn.metrics import roc_curve, roc_auc_score# Compute predicted probabilities: y_pred_prob
# y_pred_prob = logreg.predict_proba(X_test)[:,1]
a = obj.predict_proba(X_test)
B = np.reshape(a, (-1, 2))
y_pred_prob = B[:,1]# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, a)# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='%s ROC (area = .837)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for LogReg, AUC = .837')

We explored other metrics to assess the performance of the model and referred to “Recovering True Classifier Performance in Positive-Unlabeled Learning” by Shantanu Jain, Martha White, Predrag Radivojac and found that accuracy and ROC curve in positive and unlabeled learning typically underestimates the true performance when the model is trained on positive and negative data.

Neural Networks

We ran a neural net algorithm as well, keeping the train/test split, and coding of the data consistent with the logistic regression model. We used a Multilayer Perceptron Regressor classifier and found the optimal learning rate, number of hidden layers, momentum, etc. The neural net actually performed a bit worse than logistic regression, which surprised us. However, the neural net predicted more positives than logistic regression, so it performed better in that sense.

clf = MLPClassifier(hidden_layer_sizes=(10,), max_iter=500, random_state=50,learning_rate_init=0.0001,  learning_rate='constant', momentum=0.9)
estimator = clf.fit(X_train,y_train)

Conclusion

Ethics

The survey asks for race, age and area where you live (city and urban/suburban/rural) which is often used as a proxy for socioeconomic status. With this kind of data, there is a lot of potential for bias in our algorithm if our training data isn’t representative of all the different groups. For example, most of our survey respondents had at least some college education. Because there’s not a good representation of less-educated people, the model isn’t well trained on that group and may not predict very accurately for someone without a college degree.

There’s also the potential for the model to reinforce social structures that we don’t necessarily want to reinforce. We all know that people are more likely to be friends with and have romantic relationships with people that look like them, have a similar background or beliefs, or some kind of commonality. That’s obviously a generalization, but it definitely exists in our society and is reflected in our data. The concern here is that we don’t want our model to only recommend friends of the same gender and race to a user, and have that user miss out on other compatible friends that are of different gender, race or some other characteristic.

We want to take these issues very seriously, and make sure our algorithm is as unbiased as possible. Going forward, to mitigate these issues, we hope to increase the diversity and size of our training data as well as continually work to reduce bias and the perpetuation of segregated societal structures in the Reengr algorithms.

Check out the following link to learn more about ethical issues in data science: http://deon.drivendata.org/examples/

Takeaways

We were able to learn a great deal from this endeavor. We also realized that we might be on to something with our models. Clustering and logistic regression seem like the best options for long term success since they are lightweight and scalable models. We are also looking for ways to use these models in conjunction with each other to further improve our accuracy.

However, there are also some drawbacks that came with our dataset. Since it was (and still is) a small dataset, the conclusions that we can draw from our results are extremely limited. Additionally, some of our models do not have the same x and y variables. Some of us made a row for each possible match in the dataset, and some of us had one row for each individual respondent. There were also some features (only sociodemographic features) that were excluded from some models that are present in others. Going forward, we should all work with a uniform dataset to ensure that all of our results are valid.

Next Steps

Regarding next steps for the business, Jonathan needs to make sure that the algorithms run at a secure location. If the models run on users’ phones, this could pose a serious security issue. Additionally, if he wants the algorithm to work for companies and groups of people, he needs to make sure that the algorithm expands to groups of people. He also needs to get more survey responses in order to ensure that the results obtained from these models hold true with more data. Finally, he needs to spend time designing the UI and UX and creating user stories that will encourage people to give the final product a try.