[Week 3 — Features and Classification]

Published in

bbm406f16

2 min readDec 12, 2016

The week we passed we started to collect our own Turkish dataset. In this week, we began to investigate how we will use these comments we collected and the classification process. We have the basic two steps we need to do.

Extracting Features

The first step, extracting features. We need to translate restaurant comments into the way the machine can understand. This step is important to classification the reviews by the optimum way. As a beginning, we used CountVectorizer and HashingVectorizer.

Classification

The main part of our project, we will try to find the right classification and obtain as high accuracy as possible. There are too many classifiers we can use. Some of those;

Bernoulli Naive Bayes
Gaussian Naive Bayes
Multinomial Naive Bayes
k-Nearest Neighbors
Nearest Centroid Classifier
Ridge Classifier
Stochastic Gradient Descent Classifier
Perceptron
Passive Aggressive Classifier
Random Forest Classifier
Support Vector Classifier

The results of three evaluation obtained with different algorithms are seen below.

Accuracy rates are lower than we expected. This can have different causes. We can increase accuracy by collecting more training data or finding features that are appropriate for yielding attributes. But the most important problem is that collected meaningless or non-matching comments and most common, reviews that are not compatible with the each evaluation (speed, service, flavor).

Some examples reviews that may be problem;

Another problem for the classification is misspelled words. The last week we have found a simple solution to this problem, but it is very superficial and insufficient.

We need more complex spelling corrector for more reliability and more accuracy. In our project progress, We can try to use different libraries (Zemberek etc.) for data manipulation.

Thank you for your interest…

References:

http://machinelearningmastery.com/how-to-prepare-data-for-machine-learning/

Working With Text Data - scikit-learn 0.18.1 documentation

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly…

scikit-learn.org

http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html

https://github.com/ahmetaa/zemberek-nlp

[Week 3 — Features and Classification]

Working With Text Data - scikit-learn 0.18.1 documentation

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly…

Written by Sentiment Analyzer