Solving a weird name tagging rule

In this following post, you are going to solve an exam problem in my Data Mining class. The question itself is not difficult, it is just strange. Really strange. So, here it goes:

Given a file with random names and a label for each, come up with a model to predict the label.

Aaron Feigelson         +       
Alexander M. Meystel -
Alma Whitten -
Ameur Foued -
Andrew W. Moore -
Aurora Perez +
Avrim Blum -
Bala Kalyanasundaram +
Barak A. Pearlmutter +
Bernhard Pfahringer +
Bhaskar Dasgupta -

Keep in mind that, this is a problem given during a test! So, no computer. Just pen and paper. It is also 15% of your total score, so you do not want to skip it. Lastly, you are given 30 names for the exam, and not just a few like those above. Here is the whole data file.

So, how did I do?

I bombed it. No idea how to solve it.

But, it kept me thinking. So, I went home and tried to solve it immediately. So, here I am, writing on Medium to share my solution with you. There is another test for me tomorrow. I do not really know what my life is about, but today, it is about this.

Split ’em names!

The names are unique. And, I cannot find any correlation just by looking at them. Maybe, female names are labeled with “-” and male with “+”? Even length name, and odd length name? The answer is no for all of them. So, I come up with a more computational method.

I want frequency bins. 26 buckets, one for each letter. For example, “Minh Hoang” has 7 buckets. 1 in m, 1 in i, 2 in n, 2 in h, 1 in o, 1 in a, and 1 in g. Here is what comes out after pre-processing.

Next, I train multiple classifiers with this data and test it with 10 folds cross-validation. My base model is by choosing the most frequent label. And, it achieves an accuracy of 71.42%

Naive Bayes works well for words! Let’s see how it does.

Correctly Classified Instances         192 (65.3061 %)
Kappa statistic 0.0083

Oh wow. A kappa score of 0.0083, my first attempt was horrible.

How about Logistic Regression?

Correctly Classified Instances         205 (69.7279 %)
Kappa statistic 0.1087

Still horrible. How about Random Forrest?

Correctly Classified Instances         209 (71.0884 %)
Kappa statistic 0.086

J48? Pretty, please…

Correctly Classified Instances         189 (64.2857 %)
Kappa statistic 0.0959

Sadly, none beats the base model. I am obviously heading in a wrong direction. How else should I pre-process the data? Well, that’s a topic for another day. Disappointing post. Such terrible. Much sad. I know, but there is a test tomorrow for me.

There is a solution for this problem, my professor already gave it away. But, I refuse to take a look, or to give up! Hints are welcomed though ;) So, reply below or send me a message on Facebook. Until then, stay tune for the second part!

Links

My code for this problem