How does CoinAnalyst categorize a large data set by Support Vector Machines?

To recap our previous episode of artificial intelligence implemented in CoinAnalyst’s solution, we explained in the context of text analytics that it can be defined as a statistical technique to identify speech, entities, sentiment and other characteristics of text. The techniques can be expressed as a model that is then applied to other text, also known as supervised learning. It could also be a set of algorithms that work across large data sets to extract meaning, which is known as unsupervised machine learning.

As promised, after reading our series of articles you will definitely be a forerunner in the knowledge of current technological advances of text analytics and have an understanding how machine learning helps to distinguish textual publications on social media and generally the world wide web.

Support Vector Machines

A lot of files are mapped to a predefined text attribute class. And its task will be to divide hypertext files into several categories according to predefined contents. This what we elaborated in our previous post concerning supervised learning machines. Almost all areas are involved in this kind of problems. For example, email filtering, spam detection, web search, subject indexing and classification of news stories. 
 We specified that based on a labeled data set that is being imported into the algorithms, the machine learning technology takes the remaining processes over and classifies a raw batch of data. You might think piece of cake, but how does he do that?

In text classification systems, the classifier is a key part related to the effect of text classification and efficiency. Almost all the important machine learning algorithms are introduced in text classification. Just as, support vector machine (SVM). A Support Vector Machine is a supervised machine learning algorithm that can be employed for both classification and regression purposes, but they are more commonly used in classification problems.

SVMs are based on the idea of finding a hyperplane that best divides a data set into two classes or groups. This hyperplane is a linear separator/line for any dimension; it could be a line (two dimensional), plane (three dimensional), and hyperplane (four dimensional).

The best hyperplane is the one that maximizes the margin, the distance between the hyperplane and a few close points. These points are the support vectors, because they are the nearest to the hyperplane and control it. They are considered as the critical elements of a data set, because deleting these points would alter the position of the dividing hyperplane.

Intuitively, the further our data points lie from the hyperplane, the more confident we are that they have been correctly classified. We therefore want our data points to be as far away from the hyperplane as possible, while still being on the correct side of it, since this ensures the reduction of the generalization error the most.

Until now all these cognitive computing techniques still may sound vague to you. So, lets illustrate this concept with an example. Imagine we have two tags: Red and Blue. And the data has two features: X and Y. So, we are looking for a classifier that given x, y coordinates, outputs either a red or blue symbol. See the following plot.

A support vector machine takes these data points and outputs the hyperplane, a simple line in case of two dimensions, that best separates the tags. This is the deciding factor in which anything that falls to one side will be classified as blue and the other side as red.

But, how do we find the best hyperplane? That is the one that maximizes the margins from both tags. In other words: the line whose distance to the nearest element of each tag is the largest.

Nonlinear data
How about the more complex cases, such as at the nonlinear data in the image below?

It is clear that a straight line is not the case. However, the vectors are clearly segregated, and it looks as though it should be easy to separate them.

So, here’s what we’ll do: we add a third dimension: z dimension and rule that it can be calculated by a convenient equation: z = x² + y² (the formula for a circle).

This will give us a three-dimensional space. Taking a slice of that space, it looks like the left image below. Since we are in three dimensions now, the hyperplane is a plane parallel to the x axis at a certain z (let’s say z = 1). Once this is mapped to two dimensions it creates a circular boundary called Kernels, resulting in the lower right illustration.

Kernels help Support Vector Machines such that non-linear data gets a higher dimension and creates a nonlinear classifier without having to transform data. we only change the dot product to that of the space that we want and SVM will chug along. These Kernels take over the complex calculation and prevent us from touching our data.

Simplification
 Data engineers / scientists will probably wonder now how often it happens that data observations will be distributed in such a clear and easy way to simply draw a hyperplane. True, I share their opinion as well. The following the illustration would be more realistic.

Interpreting these data observations is now more complex and complicates a direct correspondence / relationship. In order to classify this non-linear dataset we need to move from a two dimensional to a three dimensional view, as explained above. How? Well, the easiest way is to imagine that these colored balls are laid on a sheet. Lifting this sheet, launches the balls into the air. While the balls are up in the air, we use the sheet to separate them. This lifting represents the mapping into a higher dimension, known as Kernelling.

Due to the three-dimensional view, our hyperplane can no longer be a linear line. It becomes a plane as shown in the example below. The idea is that data continues to be mapped into higher dimensions until a hyperplane can be formed to segregate it into observable classifications.

Text Processing
 
Succeeding to classify vectors in a multidimensional space provides various opportunities in text classification matters. You might think how all CoinAnalyst’s scanned text documents on social media and other channels are going to be transformed into a vector of numbers. For example, word frequencies can be used in which every word in the text gets a feature. The value of that feature shows how frequent that word is in the text. Dividing that by the total number of words gives us percentages.

Consider statements on social media: “The xxxToken is a successful project due to its cooperation with party xyz” or “I think xxxToken has a lot of potential and will see huge demand the coming weeks”. The word ‘xxxToken’ has a frequency of 2/24, and the word ‘cooperation’ has a frequency of 1/24.

Making use of these frequencies, we generate meaningful observations for our engineers with the help of two methods: the N-Grams and TF-IDF techniques.

TF-IDF
 You might wonder now how the system deals with less/unimportant words within documents. We use a weighing factor, which is often used for information retrieval and text mining called TF-IDF, Term Frequency — Inverse Document Frequency. 
 The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the frequency of the word in the corpus. This helps to adjust for the fact that some words appear more frequently in general.

This technique helps us to filter stop-words in our text classification activities. For example, the terms “the”, “and”, “a” are so common, that term frequency will tend to incorrectly emphasize documents which happen to use these words more frequently, without giving enough weight to the more meaningful terms “xxxToken” and “successful”. The term “the” is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less-common words “xxxToken” and “successful”. An inverse document frequency factor is incorporated to diminish the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely. Thus, the specificity of a term can be quantified as an inverse function of the number of documents in which it occurs.

Now that we’ve done that, every text in our dataset is represented as a vector with thousands (or tens of thousands) of dimensions, everyone representing the frequency one of the words of the text. This is what we feed our Support Vector Machine for training. we use several other supporting techniques besides the SVM, like stemming and n-grams.
 To prevent this article from being too tedious, I will keep it short by simply explaining that stemming helps to reduce inflected (or derived words) to their root/base form. 
 
 To illustrate: the words ‘crypto’, ‘crypto’s’, ‘cryptocurrencies’ are all reduced to the stem cryptocurrency.
 
 On the other hand, the N-Grams technique is a much more advanced method to analyze documents. Due to its comprehensiveness, we need to stick to our current topic and promise that we will take this method into consideration in our next article.

Now that we have the vectors, the only thing left to do is choosing a kernel function for our model. Every problem is different, and the kernel function depends on what the data looks like. In case of data positioned in circles, we need to choose a kernel that matches those data points. Taking that into account, the classifiers for natural language processing use thousands of features, since they can have up to one for every word that appears in the training data

Now the only thing left to do is training! We must take our set of labeled texts, convert them to vectors using word frequencies, and feed them to the algorithm — which will use our chosen kernel function — so it produces a model. Then, when we have a new unlabeled text that we want to classify, we convert it into a vector and give it to the model, which will output the tag of the text.

Conclusion
 
Today’s article can be viewed as a subsequent and clarifying edition of the previous article to offer insight, from a more technical point of view, into text classification. You now have a simple understanding how our systems succeed in creating an overview of the data overload by differentiating cryptocurrency projects into various subjects and categories. If some of you currently think: “what a complex process!”. Then I need to inform you that these are just the preparatory processes to create clear overviews from a ‘crisscross’ data set. The complex brainteasers will begin afterwards, attempting to trace reliable predictions from the data overviews and last but not least making sentiment analysis!

Unfortunately, I failed to elaborate the N-Grams and Bayesian Network techniques in this week’s publication. Let’s take these topics in our next release into consideration! With less than 5 weeks remaining for the ending of the ICO, we are approaching a big day where many crypto enthusiasts have been waiting for a long time. Soon we will say goodbye to all unreliable cryptocurrency projects!


Originally published at medium.com on August 25, 2018.