Practical Machine Learning for Blockchain Datasets: Understanding Semi and Omni Supervised Learning
Part of our work at IntoTheBlock is trying to leverage cutting edge machine/deep learning techniques to obtain the best intelligence possible from blockchain datasets. We are living in the golden era of machine learning research in which big technology labs and academic institutions are pushing the envelope about what’s possible with artificial intelligence(AI) systems. As a result, we have a new arsenal of methods and techniques that can be used against blockchain datasets. A few days ago, I wrote about a few novel machine learning techniques that could be effective when applied to blockchain analytics problems. Today, I would like to deep dive into a couple of methods that are relatively new in machine learning theory but that seem very applicable to blockchain intelligence scenarios. I am referring to semi-supervised and omni-supervise learning methods.
Semi and omni supervised learning methods might sound foreign even to machine leaning experts. Both types of models have been created with one goal in-mind: to operate in environments without a lot of labeled data. This is the nature of blockchain analysis models. While blockchains provide an incredible data-rich environment for machine learning models, the fact that the data is semi-anonymous introduces a level of challenge that can’t be surpassed by most data intelligence applications. Not surprisingly, most blockchain analytic models focus on generic constructs like addresses or transactions that offered limited intelligence about the behavior of crypto-assets. The lack of labeled datasets that qualify the information in blockchain networks remains an important roadblock for the introduction of machine learning models and, as a result, we need to rely on methods that can operate efficiently under those circumstances.
Understanding the Lack of Labeled Data Challenge in Blockchain Analytic Applications
Suppose that we are trying to create a machine learning model that tackles a traditional task in blockchain analytic applications such as identifying malicious actors or centralized exchanges. The traditional school of thought will dictate the following workflow:
1) Create a model based on a specific technique such as classification or linear regression.
2) Train the model using a dataset that includes malicious transactions or centralized exchanges.
3) Test the model against a similar dataset to evaluate its effectiveness.
4) Apply the model against new blockchain dataset in order to identify records that match our target classification.
This traditional supervised learning workflows sounds great in practice but what happens when we don’t have a large labeled dataset to train our model upfront? After all, the number of high quality labeled datasets in the blockchain space is extremely limited.
The challenge of limited labeled data is not unique to blockchain analytic applications. Many scenarios in machine learning are vulnerable to its dependency on large-volume labeled datasets which, quite often, makes it unsustainable to scale in real world scenarios. This was the challenge that inspire companies like Google, Microsoft or Facebook to look for alternative ideas in machine learning research. Among those ideas, semi supervised and omni supervised learning techniques have emerged as some of the most viable models to operate with limited labeled datasets.
In the spectrum of machine learning models, semi-supervised learning sits in the middle between supervised and unsupervised learning. Conceptually, semi-supervised learning tries to leverage accomplish the training of machine learning models using a small amount of labeled data and a larger amount of unlabeled data. The origins of semi-supervised learning can be traced back to the 1970s with the work of scientists like Vladimir Vapnik. However, the seminal paper in semi-supervised learning was published in 1995 by Joel Ratsaby and Santosh S. Venkatesh in which they investigated the tradeoff between labeled and unlabeled datasets when learning a classification rule.
The main objective of semi supervised learning is to overcome the drawbacks of both supervised and unsupervised learning. Supervised learning requires huge amount of training data to classify the test data, which is cost effective and time consuming process. On the other hand, unsupervised learning doesn’t require any labeled data, which clusters the data based on similarity in the data points by using either clustering or maximum likelihood approach. The main downfall of this approach, it can’t cluster an unknown data accurately. To overcome these issues, semi supervised learning has been proposed by research community, which can learn with small amount of training data can label the unknown (or) test data. Semi supervised learning builds a model with few labeled patterns as training data and treats the rest of the patterns as test data. Semi supervised learning models will draw proximity associations between unlabeled data records and groups of unlabeled data.
The relationship between supervised and semi-supervised models can be illustrated in the following figure. In low data regimes, semi-supervised learning can, indeed, yield incredible performance gains during training compared to supervised alternatives.
Going back to our blockchain classification scenario, semi-supervised learning can be used to train a model using a dataset that contains some labeled addresses and a large volume of unlabeled ones. The semi-supervised models will draw associations between those records creating a richer training dataset. In practice is way more complex but hopefully you get the idea.
Omni-supervised learning is a fairly recent machine learning technique pioneered by the Facebook AI Research group(FAIR). Conceptually, omni-supervised learning tries to address some of the practical limitations of semi-supervised approaches. Part of the challenge if the fact that semi-supervised learning techniques depend on simulated labeled/unlabeled data by splitting a fully annotated dataset and is therefore likely to be upper-bounded by fully supervised learning with all annotations. In other words, semi-supervised learning will only be as good as the equivalent fully supervised learning method running against a labeled dataset.
To address some of the challenges of semi-supervised learning models, the omni-supervised approach relies on a concept called data-distillation. The data distillation methods attempts to “distill” knowledge from unlabeled data without the requirement of training a large set of models. Conceptually, data distillation involves four steps:
(1) training a model on manually labeled data (just as in normal supervised learning).
(2) applying the trained model to multiple transformations of unlabeled data.
(3) converting the predictions on the unlabeled data into labels by ensembling the multiple predictions.
(4) retraining the model on the union of the manually labeled data and automatically labeled data.
In the context of blockchain analysis, omni-supervised learning can be applied to the same scenarios that the semi-supervised approach.
Both semi-supervised and omni-supervised learning models look promising to address some of the labeled data challenges in blockchain datasets. Practically speaking, semi-supervised approaches have seen a lot more practical applications and are easier to implement using the current generation of deep learning frameworks. However, some of the ideas pioneered by Facebook with omni-supervised learning looked incredibly promising. If nothing else, blockchain datasets offer a blank canvas to experiment with all sorts of interesting machine learning methods that could produce new levels of intelligence in this emerging asset class.