Solving a weird name tagging rule
In this following post, you are going to solve an exam problem in my Data Mining class. The question itself is not difficult, it is just strange. Really strange. So, here it goes:
Given a file with random names and a label for each, come up with a model to predict the label.
Aaron Feigelson +
Alexander M. Meystel -
Alma Whitten -
Ameur Foued -
Andrew W. Moore -
Aurora Perez +
Avrim Blum -
Bala Kalyanasundaram +
Barak A. Pearlmutter +
Bernhard Pfahringer +
Bhaskar Dasgupta -
Keep in mind that, this is a problem given during a test! So, no computer. Just pen and paper. It is also 15% of your total score, so you do not want to skip it. Lastly, you are given 30 names for the exam, and not just a few like those above. Here is the whole data file.
So, how did I do?
I bombed it. No idea how to solve it.
But, it kept me thinking. So, I went home and tried to solve it immediately. So, here I am, writing on Medium to share my solution with you. There is another test for me tomorrow. I do not really know what my life is about, but today, it is about this.
Split ’em names!
The names are unique. And, I cannot find any correlation just by looking at them. Maybe, female names are labeled with “-” and male with “+”? Even length name, and odd length name? The answer is no for all of them. So, I come up with a more computational method.
I want frequency bins. 26 buckets, one for each letter. For example, “Minh Hoang” has 7 buckets. 1 in m, 1 in i, 2 in n, 2 in h, 1 in o, 1 in a, and 1 in g. Here is what comes out after pre-processing.
Next, I train multiple classifiers with this data and test it with 10 folds cross-validation. My base model is by choosing the most frequent label. And, it achieves an accuracy of 71.42%
Naive Bayes works well for words! Let’s see how it does.
Correctly Classified Instances 192 (65.3061 %)
Kappa statistic 0.0083
Oh wow. A kappa score of 0.0083, my first attempt was horrible.
How about Logistic Regression?
Correctly Classified Instances 205 (69.7279 %)
Kappa statistic 0.1087
Still horrible. How about Random Forrest?
Correctly Classified Instances 209 (71.0884 %)
Kappa statistic 0.086
J48? Pretty, please…
Correctly Classified Instances 189 (64.2857 %)
Kappa statistic 0.0959
Sadly, none beats the base model. I am obviously heading in a wrong direction. How else should I pre-process the data? Well, that’s a topic for another day. Disappointing post. Such terrible. Much sad. I know, but there is a test tomorrow for me.
There is a solution for this problem, my professor already gave it away. But, I refuse to take a look, or to give up! Hints are welcomed though ;) So, reply below or send me a message on Facebook. Until then, stay tune for the second part!