Modeling

Grid searching for perfection

There are emerging artists and not so emerging artists in HypeMergence’s world. With this, we have a binary classification problem. Models used to answer this are:

  • Logistic regression: Throwing our data onto a sigmoid to determine the optimal threshold or decision boundary of our points log likelihood in being an emerging musician or not. A well fitted logistic regression model has a low deviance, or not too many emerging artists that are not on our sigmoid. Artists that lie off of the sigmoid are mis-classified.
  • Support vector machines: Under the assumption that our data is linearly separable, — which to my surprise it was — the model creates a decision boundary according to support vectors. SVMs aim to minimize errors — given by points that occur outside of the margins, while maximizing the margins. There are also other kernels that could be used to separate non-linear data, such as Radial Basis Function.
  • Random forest classifier: Composed of multiple decision trees of randomly divided data, our random forest uses the majority rule to classify. Each tree looks at the entropy level of two randomly selected features and then splits the node accordingly to what features will give it the lowest entropy level. From this model, we are also given the feature importance. Our trees permute over each of the features to see the difference in information gained and arranges it accordingly. SKLearn’s random forest gives us the feature importance according to the level of each feature.

For all three of the models above I used SKLearn’s library. I fed my feature matrix into the models, ran a grid search with various parameters including different SVM kernels, did some cross validation by splitting my data into train and test sets, and got the following results:

  • Logistic regression: accuracy — 91.38%, precision — 95.75%, recall — 68.18%
  • SVM: accuracy — 94.73%, precision — 98.39%, recall — 85.91%
ROC Curve: True positive rate vs False positive rate for our SVM model
  • Random forest: accuracy — 89.95%, precision — 97.87%, recall — 69.69%

With the results above, it was a no brainer that I went with an SVM model. However, there’s also another factor: I wanted to optimize the model’s recall value, as I didn’t want to waste my time on artists who really aren’t emerging — false positives. False positives in HypeMergence’s case would be musicians labeled as stars who aren’t really bright. This would cost more to us as listeners and or record labels than if we missed a real star — in my opinion.

I know there might be a few other models I could’ve looked at. Now without a time constraint, do you guys have any in mind?