How CoinAnalyst identifies positive and negative polarities in social media posts

Kucuksungur
6 min readDec 12, 2018

--

We are approaching a big day for the backers and potential users of our platform! Instead of promotional talks, we prefer to inform you about our real work: what did we accomplish and what are we still working on? Through a series of articles, we intent to enlighten you guys from a technical perspective. How does the intelligent reasoning of CoinAnalyst really work under the hood?

In the latest episode of our Artificial Intelligence series we talked about Support Vector Machines. We explained in dummy language the way this text classification system works in order to categorize imported data. The idea is to find the best hyperplane that divides a dataset into classes or groups. That is the process in which the system identifies the line whose distance to the nearest element of each vector is the largest. In the categorization process we have not been able to handle two extremely powerful techniques that are noteworthy. Let’s take a look, in easy daily language, at the possibilities of the N-Grams techniques.

Sentiment Analysis
With the ever-increasing social networking and online marketing sites, the reviews and blogs obtained from those, act as an important source for further analysis and improved decision making in the complex crypto world.

On the basis of extensive social media analysis our CoinAnalyst platform succeeds to provide a buy or sell advice by means of a thorough analysis that shows that the vast majority has a positive or negative tendency concerning a particular cryptocurrency. A similar research on such a scale is not feasible for an individual. In addition, biased information may unfairly affect one’s trading decision. With our analyzes on which millions of data records are processed, the reliability increases considerably!

These reviews are mostly unstructured by nature and thus, need processing like classification or clustering to provide a meaningful information for future uses. These reviews and blogs may be classified into different polarity groups such as positive, negative, and neutral in order to extract information from the input dataset.

N-gram model
It is a method of checking ’n’ continuous words or sounds from a given sequence of text. This model helps to predict the next item in a sequence. In sentiment analysis, the n-gram model helps to analyze the sentiment of the text or document.
Unigram refers to n-gram of size 1, Bigram refers to n-gram of size 2, Trigram refers to n-gram of size 3. Higher n-gram refers to four-gram, five-gram, and so on.

The n-gram method can be explained using following example:
A typical example of a sentence may be considered as “This token is not a good one”. Its unigram: “‘This’,‘token’,‘is’, ‘not’, ‘a’, ‘good’,‘one”’ where a single word is considered. Its bigram: “‘This token’,‘token is’, ‘is not’, ‘not a’, ‘a good’, ‘good one’ ”where a pair of words are considered. Its trigram: “‘This token is’, ‘token is not’, ‘is not a’, ‘not a good’, ‘a good one”’ where a set of words having count equal to three is considered.

In the previous blog post we had explained the Term Frequency and Inverse Document Frequency (TF-IDF) or ‘bag-of-words model’ with unigram features. Meaning that we split the entire text in single words and count the occurrence of each word. Such a model does not take the position of each word in the sentence, its context and the grammar into account! That is why, TF-IDF model has a low accuracy in detecting the sentiment of a text document.

On the other hand, Multiple studies verified that the N-Grams model is the most suitable representation model for this task, due to a series of advantages it encompasses: first, it allows for fuzzy and sub-string matching, a functionality of primary importance in open domains like Twitter, and second, it is a language-neutral method that makes no assumptions on the underlying languages. Apart from the resulting high effectiveness, though, this model exhibits high classification efficiency, as well. The reason is that it involves a limited number of features that solely depends on the corresponding number of classes. In the future, we plan to investigate its coupling with evidence drawn from the social graph of Twitter as well as with behavioral patterns of its users, thus providing a holistic, and highly effective solution to polarity classification.

For example, with TF-IDF model the following two sentences will be given the same score:
1. “This is not a good coin” –> 0 + 0 + 0 + 0 + 1 + 0 –> positive
2. “This is a very good coin” –> 0 + 0 +0 +0 +1 + 0 –> positive

If we include features consisting of two or three words, this problem can be avoided; “not good” and “very good” will be two different features with different subjectivity scores. The biggest reason why bigram or trigram features are not used more often is that the number of possible combinations of words increases exponentially with the number of words. Theoretically, a document with 2.000 words can have 2.000 possible unigram features, 40.000 possible bigram features and 8.000.000.000 possible trigram features.

However, if we consider this problem from a pragmatic point of view we can say that most of the combinations of words which can be made, are grammatically not possible, or do not occur with a significant amount and hence don’t need to be taken into account.

Actually, we only need to define a small set of words (prepositions, conjunctions, interjections etc.) of which we know it changes the meaning of the words following it and/or the rest of the sentence. If we encounter such a ‘n-gram’ word, we do not split the sentence but split it after the next word. In this way we will construct n-gram features consisting of the specified words and the words directly following them. Some examples of such words are:

There are a few conditions this n-grams function needs to fulfill:

When it iterates through the splitted text and encounters a n-gram-word, it needs to concatenate this word with the next word. So [“I”,”do”,”not”,”recommend”,”this”,”Cryptocurrency”] needs to become [“I”, “do”, “not recommend”, “this”, “Cryptocurrency”]. At the same time, it needs to skip the next iteration, so the next word does not appear two times.

It needs to be recursive: we might encounter multiple n-gram words in a row. Then all of the words need to be concatenated into a single n-gram. So [“This”,”is”,”a”,”very”,”very”, “good”,”Cryptocurrency”] needs to be concatenated in [“This”,”is”,”a”,”very very good”, “Cryptocurrency”]. If n-words are concatenated together into a single n-gram, the next n-iterations need to be skipped.

In addition to concatenating words with the words following it, it might also be interesting if we concatenate it with the word preceding it. For example, forming n-grams including the word “Cryptocurrency” and its preceding words leads to features like “worst Cryptocurrency”, “best Cryptocurrency”, “successful Cryptocurrency” etc.

Using this simple function to concatenate words in order to form n-grams, will lead to features which strongly correlate with a specific (Negative/Positive) class like ‘highly recommend’, ‘best Cryptocurrency or even ‘high potential project’.

Next Episode
The n-gram technique is a powerful technique that complements other methods, but certainly can not operate on its own to perform an optimal sentiment analysis, because this system also has its shortcomings.

Now that you have a better understanding of Text Classification terms like Support Vector Machines, tf-idf, and n-grams, we can start using Classifiers for Sentiment Analysis. Stay tuned for our next release guys in which the Bayesian Networks will be the central topic (and in case it wont be that lengthy) also the Maximum Entropy method.

--

--

Kucuksungur

A Financial Consultant, Business Analyst and High-Tech Researcher