[Week 3 — Emotion Detection]

Published in

bbm406f18

5 min readDec 16, 2018

Team Members: Ali Baran Taşdemir, Akif Çavdar

After two weeks we introduced our project’s topic and related works. This week we will get into more detailed and technical parts of our project.

For text analysis, we will use some extra tools and techniques to modify our data and get the best results. So let’s talk about them.

Stemming

Stemming is the process of finding the root of a word. In this context, the stem is not necessarily the exact root. So, it is not the form of a word that you would find in a vocabulary. For example, an algorithm may produce the stem ‘consol’ for the word ‘consoling’.

A typical application of stemming is grouping together all instances of words with the same stem for usage in a search library. So, if a user searches for documents containing ‘friend’ he can also find ones with ‘friend’ or ‘friended’.

Lemmatization

Lemmatization returns the lemma for a given word. Basically, it gives the corresponding dictionary form of a word. In some ways, it can be considered an advanced form of a stemmer. It can also be used for similar purposes, namely, it can ensure that all different forms of a word are correctly linked to the same concepts.

For instance, it can transform all instances of ‘cats’ in ‘cat’, for search purposes. However, it can also distinguish between the cases of ‘run ’as in the verb ‘to run’ and ‘run’ as in the noun synonym of a ‘jog’.

Stop Words

Stop words are words which are filtered out before or after processing of natural language data. Though “stop words” usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.

For example, stop word list used in the sklearn library contains 318 words like ‘how’, ‘of’, ‘do’, ‘to’, ‘too’, ‘were’ etc.. And another stop word list in the nltk library contains 179 words.

Any group of words can be chosen as the stop words for a given purpose. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as “The Who”, “The The”, or “Take That”.

Count Vectorizer

For detailed information…

Count vectorizer is a useful tool for text analysis. Converts text documents to a matrix of token counts. Tokenizes words and converts them to numbers which we can easily perform math operations.

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
'I love machine learning.', 
'Do you love machine learning?', 
'We will convert texts to numbers now.']>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['convert', 'do', 'learning', 'love', 'machine', 'now', 'numbers', 'texts', 'to', 'we', 'will', 'you']
>>> print(X.toarray())
[[0 0 1 1 1 0 0 0 0 0 0 0]
 [0 1 1 1 1 0 0 0 0 0 0 1]
 [1 0 0 0 0 1 1 1 1 1 1 0]]

TF-IDF

The TF (term frequency) of a word is the frequency of a word (i.e. number of times it appears) in a document. When you know it, you’re able to see if you’re using a term too much or too little.

For example, when a 100-word document contains the term “cat” 12 times, the TF for the word ‘cat’ is:

TF_cat = 12/100 (0.12)

The IDF (inverse document frequency) of a word is the measure of how significant that term is in the whole corpus.

For example, say the term “cat” appears x amount of times in a 10,000,000 million document-sized corpus. Let’s assume there are 0.3 million documents that contain the term “cat”, then the IDF is given by the total number of documents (10,000,000) divided by the number of documents containing the term “cat” (300,000).

IDF(cat) = log (10,000,000/300,000) = 1.52

So we introduced some tools and techniques. Now we used some of them for our project and applied some basic Machine Learning algorithms and get some results.

If you want to run at your computer there is a Github repository.

alibtasdemir/MLProject

Contribute to alibtasdemir/MLProject development by creating an account on GitHub.

github.com

We used 5 basic ML algorithm and applied cross-validation tests to see which features effects on this algorithms.

Multinomial Naive Bayes

The first algorithm we used is Naive Bayes algorithm.

There are most accurate 5 combinations for Naive Bayes algorithm. Top scorer is the one with tf-idf, stop words and 3-gram used with 48.67% accuracy.

Stochastic Gradient Descent

For SGD algorithm, the most accurate combination is tf-idf, stopwords, 4-gram with squared hinge loss function. The accuracy of that algorithm is 49.57%.

Logistic Regression

For logistic regression, different from others we see the best result comes with 1-gram and without tf-idf. And accuracy is 45.29%.

Perceptron Algorithm

So another different result here. We see stopwords not used for perceptrons best combination. And also 1-gram and tf-idf used. Accuracy is 45.77%.

References

Analyze and Understand Text: Guide to Natural Language Processing - Federico Tomassetti - Software…

Natural Language Processing (NLP) comprises a set of techniques to work with documents written in a natural language to…

tomassetti.me