Week IV-BOOK GENRE PREDICTOR

Hakan Akyürek
bbm406f18
Published in
3 min readDec 24, 2018

Theme: Multi-label text classification

Team members: Hakan AKYÜREK, Sefa YURTSEVEN

After studying our experimental results we got from our models, we noticed we were doing some things wrong and we need to approach our problem differently.

When we check to the top most 4 classes, we can notice something is seriously wrong. Half of them are misclassified. But why? The reason is rather simple actually. Our model is not misclassifying all of them, some of those books actually belong to both classes.

After a more carefully study on our dataset, we realized most of the books are a part of multiple classes and pretending each book as a part of only a single class is wrong.

So, what to do?

There is something we can do, which is changing how we evaluate our models. Unlike how common evaluation on accuracy works, we came up with something different after discussing it with our TA.

Instead of checking predicted class is equal to target class, we check if the predicted class is one of the classes that book is part of. This way we think that we can evaluate our models better.

But, there is something else…

After some research over the internet, we learned that actually there are some research going on this subject in recent years. It is called multi-label classification.

Multi-label Classification

In multi-label classification, the training set is composed of instances each associated with a set of labels, and the task is to predict the label sets of unseen instances through analyzing training instances with known label sets.

In other words, each document belongs to one or multiple classes. So, in our case it might be a book being both ‘Fantasy’ and ‘Fiction’ at the same time. So instead of predicting only a class, our model also needs to predict 2 or more classes.

Main difference between multi-class classification & multi-label classification is that in multi-class problems the classes are mutually exclusive, whereas for multi-label problems each label represents a different classification task, but the tasks are somehow related.

After trying a different approach in evaluation of our models, we got 63% accuracy at most with artificial neural networks while naive bayes followed behind. So, we aim to work on multi-label classification from now on. Curse our lack of knowledge, but learning is what this is about… isn’t it?

Refrences:

https://towardsdatascience.com/journey-to-the-center-of-multi-label-classification-384c40229bff

--

--