Predicting whether partners will stay together
I wanted to answer a question that practically everyone would find interesting, will this couple stay together, and the long story short is that you can predict the outcome and you can predict it pretty accurately.
Try it yourself at couplenet.stephenro.se
Summary
Ensemble test set (out of sample) performance
- Accuracy: 0.84
- F1 Score: 0.88
- ROC AUC: 0.88
As you can see, if the model says you will stay together then that’s good news because it is very likely that it is true (~90.5%). However, the model is weaker at predicting the other case. The ensemble is still better than chance for predicting if someone will break up but it is not as accurate. That being said, from trying it out with many people in the real world that either already broke up or eventually broke up months later, the model appears more accurate for this label that just 60%. This apparent high real-world accuracy could be because all the people had labelled as broken up for primarily the same reason (they were young <25 years old).
Modeling and Performance
The final model used on CoupleNet is an ensemble of bagged gradient boosted classifiers, a random forest, and classification neural networks. Since the majority of people in the data set stayed together all modelling was treated as a class imbalance problem. Also, a lot of feature engineering was required to make these models use the input data well. I used a package I made called rosey to do most of the feature engineering.
Random Forest
Random forests are pretty trivial to train and optimise, which is why I’ll cover it first. Simple grid search cross-validation was enough for me to find parameters that did a good job predicting on the test set. The only caveat I added is that I trained the models with a balanced class weight to correct for the class imbalance shown above.
Bagged Gradient Boosted Classifier
Gradient boosted models are great. There’s nothing about them I don’t like! They also do a great job here. For training, I just set a relatively small learning rate and setting the number of estimators that was way too high.
I then get the argmax to find my best guess for the optimal number of estimators to use in my gradient boosted classifier.
best_gbm_n = np.argmax(validation_scores)
I know train 50 bagged gradient boosted classifiers as a way of smoothing the decision surface.
Classification Neural Networks
I didn’t like the neural networks for this problem. Regarding training time, it was slow compared to gradient boosted models and random forests. Concerning prediction, it was also fine. An advantage of the neural network, however, is that is the probabilities reported for predictions are much better calibrated than those reported by the other two models. This is because neural networks (like logistic regression) can be made to optimise to log odds directly.
Conclusion
CoupleNet was a charming and fun model to make. Many folks that like data science enjoyed the explanations (generated by LIME) and regular people just like these BuzzFeed style quiz games. Keep in mind that all these models are queried on a cloud server with one shared virtual CPU with 700Mb of RAM.
If you have any questions let me know!