WatermelonBlock Technical Blog #1 — Sentiment Analysis Data Transformation Process
The team at WatermelonBlock are avid proponents of the ‘Make hay while the sun shines’ philosophy, which is why we strive to use state-of-the-art infrastructure and deploy advanced deep learning algorithms (thanks to our strategic partnership with IBM!) to help our crypto-investors make good investment decisions that can not only reduce their short-term/long-term risks, but also boost their profits on a day-to-day basis.
The rapid growth of cryptocurrencies is the result of both increased investor speculation and the introduction of various new cryptocurrencies, with current estimates of the total number of cryptocurrencies topping 1,500 different coins. With so many cryptos in play and investors’ money at stake, this rapid-yet-exponential increase in cryptocurrency prices suggests that price fluctuations are driven primarily by retail investor speculation.
The data scientists at WatermelonBlock, who were instrumental in designing the Melon Score framework, burn their midnight oil in harnessing the cumulative power of social sentiment signals derived from the Internet, alongside other correlation factors like economic signals of volume, prices of exchange for USD, overall trading volume, etc to construct predictive machine learning models. Ones that could actually be used to predict price fluctuations and ascertain whether a particular cryptocurrency is worthy of your investment. More recently, it has been shown that social media data such as Twitter, Reddit, LinkedIn, Facebook, etc can be used to track investor sentiment, and price changes in the crypto market and other predominant cryptocurrencies.
Since the cryptocurrency market is mostly dominated by retail investors and a few institutional investors, their reviews, discussions and critical sentiments on social media websites delivers a viable medium to capture total investor behavior.
So how exactly is the social sentiment captured?
All hail NLP a.k.a Natural Language Processing, which actually helps finish 60% of the tasks involved in the sentiment analysis. With the evolution of the digital landscape, NLP is a growing field in artificial intelligence and machine learning, which helps machines analyze natural language data (English and higher-level languages).
Since our adoption of IBM’s Watson API and their AC922 Power AI Servers (Accelerated Compute) for performing NLP, our output accuracy rates and data crunching power have only been unrivalled.
Text can be of different forms from a list of individual words, to long sentences to complex multiple paragraphs with special characters. We obtain our sample training and testing data scraped in real-time from a diverse range of sources (social media sites, news websites, discussion forums etc), to give an unbiased scoring and help investors. Unfortunately, there’s a lot of work that needs to be done before the real machine learning process is begun.
Raw data like tweets, reddit messages and news comment threads are pretty inconsistent in their syntactical sense and require a bit more cleaning than expected. Transforming text into something, an algorithm can digest is a challenging task.
There are 5 intricate steps that have to be performed for a successful Natural Language Processing project:
● Wrangling consists of getting rid of the less useful parts of text through removal of stop-words (frequently occurring insignificant words in the text corpus), dealing with capitalization and characters, etc.
● Text Annotation consists of applying of a markup scheme to text data and implementing part-of-speech tagging for each word in each sentence (noun, verb, adjective, etc).
● Text Normalization is the most important step in the data transformation process which consists of the translation (mapping) of text terms in the scheme to an machine-understandable format or linguistic reductions through Stemming, Lemmatization and other forms of standardization techniques.
(i). Stemming — It is a process where words are reduced to a root by removing inflection through dropping unnecessary characters, usually a suffix. There are several stemming models, including Porter and Snowball. The results can be used to identify relationships and interesting commonalities across large datasets.
Eg: Unstemmed Version — [“ RT @ Bitcoin is moving to highest places in a weirder way ”]
Stemmed Version — [“ rt”, “bitcoin”, “move”, “to”, “high”, “place”, “weird”, “way”]
It is easy to see where reductions may produce a “root” word that isn’t an actual word. This doesn’t necessarily adversely affect its efficiency, but there is a danger of “overstemming” where words like “universe” and “university” are reduced to the same root of “univers”.
(ii) Lemmatization — Lemmatization is an alternative approach from stemming to removing inflection. By determining the part of speech and utilizing WordNet’s lexical database of English, lemmatization can get better results than stemming.
Eg: Raw Version — [“Ethereum is falling like dried leaves”] Lemmatized Version — [“ethereum”, “fall”, “like”, “dry”, “leaf”] Stemmed Version — [“ethereum”, “fall”, “like”, “dry”, “leav”]
Notice the difference?
Lemmatization is quite intensive and a slower process, but more accurate. Stemming may be more useful in queries for databases whereas Lemmatization works much better when trying to determine the text sentiment.
● Word Embedding consists of statistically probing, manipulating and generalizing important topic items from the dataset for dimensional feature vector analysis. This process involves converting words into their respective numerical vectors along a 2D axes to understand the importance of various terms in the text corpus.
● Feature Vector Analysis
Consider the following cleaned tweet — “ripple has a bright future in the world’s economy.”
The features for text classification consist of a vector of all unique words in the data set lexicon. Since the vector encompasses all possible unique entries, it is sparse even for the longest of tweets. Suppose that there are ’n’ words in the learning algorithm vocabulary. An example of a text classification feature vector for this model with each entry as follows:
When a particular word is observed at least once, a binary value of one is recorded in the position of that word in the feature vector. When the total count of each word is represented in the same format of feature vector, the input is modeled as a multinomial rather than Bernoulli. Therefore, the entries in a multinomial feature vector will take on values:
Training and testing feature vectors for sentiment analysis models are fundamentally different. In order to generate feature vectors of this structure, pre-processed tweets are analyzed word-by-word by the IBM Watson API.
The API returns scores between zero and one for words’ positivity, negativity, and neutrality. These scores are aggregated into a single vector similar to the one below.
Model Training and Validation
After the above 5-step cleaning process is done with, the real model training/validation begins and the machines learn to capture the sentiment signals through ‘ensembled algorithms’ a.k.a weightage based scoring algorithms. For the text classification approach, an ensemble implementation of classification algorithms like Naive Bayes, Logistic Regression, and Support Vector Machines (SVM) is used.
The goal of each algorithm is to predict whether the price of cryptocurrency will increase or decrease over a set time frame based on the overall social sentiment. Stay tuned to find out more on our technology and engineering culture, in the upcoming blogs of this series!