WEEK II- BOOK GENRE PREDICTION

Hakan Akyürek
bbm406f18
Published in
4 min readDec 9, 2018

Theme: Text classification with deep learning

Team members: Hakan AKYÜREK, Sefa YURTSEVEN

SUMMARY OF PROJECT PLAN

The goal of the project is constructing a artificial neural network, which can predict book genres from their summaries. Then, if possible, adding extra features like predicting the author of the book, maybe predicting genre of the author, things like that.

Our checkpoints in the project are:

(1) Cleaning the dataset

(2) Predicting genres from book summaries

(3) Adding extra features like predicting the author of a book if everything goes on course

The first task in our project is to clean our dataset. We have achieved this goal of ours.

DATASET

Our dataset contained 16000 book summaries along with their genres and authors. After deleting books without any genre information the data amount reduced to 13000. Most of the books left had multiple genres, and we reduced this genre count to one for each book in the dataset. While doing that we tried to make dataset as homogeneous as possible. We took most common 15 genres in the dataset to avoid noisy data. Thus, having around 12000 total books in the end.

Pie charm of dataset with respect to genres

Then, we cleaned the summaries. Removing numbers and punctuations. All summaries represented in lower case format. Resulting a clean dataset, which is ready to use.

METHODS RESEARCHED

In our earlier blog we said that we were going to research deep learning methods to classify texts. We researched various methods; CNN, RNN, raw NN, HAN. CNN seems to outperforms to other two in terms of speed, while HAN and RNN return better results.

If we look at the comparisons above, each one of the approaches have very high validation accuracy. But considering our time limit, we are most interested in CNN and raw neural networks since training them is way more faster than RNN or HAN.

Text Classification with CNN

Even though CNN are used mostly for image classification, they’ve recently been used for NLP and it seems like results are promising. There are some things with text classification with CNN that affected our preference.

Unlike images, our summaries, documents, will probably won’t have the same size. Each row of the matrix can correspond to a token, typically a word, but it could be a character. Typically, these vectors are word embeddings like word2vec, but they could also be one-hot vectors that index the word into a vocabulary. To make them equal in size, we will need to add padding. Normally it is not a problem, but considering smallest summary might be 5 sentences long while longest summary can be 50 sentences long, it just might be a problem.

reference: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

As we discussed CNN is performing quite well for NLP and we are really fond of it. Therefore, we can most likely choose to use a CNN model instead of the model the baseline code we found.

Our Approach

While we haven’t come to a conclusion on this matter, we have already found a baseline code, which uses keras library. The sample we found has 87% validation accuracy with its own dataset, which is really promising.

The baseline code we found, uses keras library’s sequential model. It is basically a linear stack of layers. Keras library lets us easily construct the model and train it with a few line of codes. Furthermore, the dataset this sample code we found uses, has 20 different classes in terms of output, similar to ours, and the dataset size is similar to ours as well.

In the next week we intend to come to a consensus on our approach as we discuss it a little bit more with our TA. Then we will focus on constructing that after our neural network assignment.

--

--