Using Machine Learning to Understand the Ethereum Blockchain

Consensys

Published in

ConsenSys Media

8 min readApr 3, 2018

Find out how ConsenSys projects like Alethio and Rakr are using AI to make sense of decentralized data.

by Paul Lintilhac, Quantitative Developer at ConsenSys

A hotbed field of study in data science analysis at the moment is machine learning, a form of AI that uses algorithms to study large sets of data. It’s used for everything from sequencing DNA to studying financial markets and brain-machine interfaces. There are many different kinds of machine learning, with differing data requirements and objectives. In the past year, ConsenSys has made a push to develop its analytics and data science capabilities with projects like Alethio, an analytics platform helps users visualize, interpret, and react to blockchain data in real time.

The immutable, public records and decentralized nature of blockchain networks provide an exciting sandbox for data scientists, offering whole new world of data to analyze and patterns to recognize. To begin understanding how we go about pulling meaning out of this seemingly chaotic data environment, we’ll begin by describing two main categories of machine learning that are being developed by data scientists at Consensys, and give a few examples of how each can be applied in practice.

Supervised vs. Unsupervised Learning

Unsupervised Learning involves finding patterns in a large data sets and using them to extract meaning. Unsupervised learning models are not predictive in nature — though they could play a role in a larger predictive modeling system. Rather, unsupervised learning seeks to reduce a large and complex dataset to simpler high-level patterns or themes. These themes can then be used as a reference to characterize individual data points and put them into a useful context.

Anomaly and novelty detection systems are examples of unsupervised learning models. By reducing a large dataset into a small number of common themes, one can learn what it means for a particular transaction or account point to be “normal.” By comparing any given transaction or account to this learned definition of normal, we can determine the extent to which they are anomalous compared to the global average (anomaly detection), or compared to a recent historical average (novelty detection). These anomaly detection systems can then be used to alert users whether anything unusual is happening on the whole blockchain, or within a particular subset of interesting accounts or transactions. Alethio currently offers an anomaly detection system for transactions, blocks, and accounts.

Other kinds of analysis offered by Alethio that could arguably be considered unsupervised learning including ranking algorithms, or influence analysis like page rank. While these are not commonly referred to as machine learning algorithms at all (rather, just algorithms), they do serve the same purpose of finding overall patterns in a dataset and using them to add context.

Supervised learning seeks to take a set of observations with known features, and uses them to estimate the corresponding value of some other variable (a response or label) for each observation. This could be broken down into two common categories: prediction and classification. Trying to use historical data to estimate the future value of variable (a response) is known as prediction. Trying to use existing data about an entity to determine whether that entity belongs to a certain category (assigning a “label”) is known as classification.

Generally speaking, the “knowns” on the blockchain consist of raw, protocol-level data that is available on-chain, such as transaction data. This raw data can be used to extract features for accounts, such as their total balance, average transaction frequency, average age of currency held, etc. Recent efforts by Alethio to augment protocol-level data with semantic lifting have expanded the set of “knowns” beyond the protocol layer to include application-level data, such as whether a contract is a token, and to which standard it complies. All of these known quantities can be used as the basis for features in a supervised learning model.

On the other hand, the unknown quantity (the label or response) is by definition not a piece of currently-available on-chain data; otherwise it would already be known and captured by our data pipelines. The unknown quantity might be the future value of some on-chain data, such as the balance of an account on some future date. More commonly, the unknown quantity is some value that is never available on-chain at all. If you are trying to predict whether the account belongs to some category, such as being a decentralized exchange, a DOS account, or a Ponzi scheme, you will need to look off-chain for this data.

The ETHSTats dashboard, tracking blockchain data in realtime.

The Importance of Datasets

This is where the data requirements for unsupervised learning on the blockchain become an important problem (read: opportunity!). In order to train and calibrate a supervised learning model, there must be some large initial set of data for which the value of the labels or responses is known. This calibrates the model so that the predicted and actual response are as close as possible. This means that when a new observation comes in where the response is unknown, the prediction will be close to the true value, assuming the new observation is being generated by a similar process that generated the original dataset. Once the training phase is complete and the model calibrated, it can then be applied to new observations where the response is unknown.

In the case of price prediction, this means having a large database of historical prices. In the case of classification of accounts, this means having an initial set of accounts that are already labeled as being a decentralized exchange, a DOS account, or a Ponzi.

In these classification examples, the labels in the dataset used for training are often only available through significant effort. One possibility would be to pull data from websites like coinmarketcap or etherscan, building ETLs to import interesting data from other blockchain businesses, or through the painstaking effort of trained research assistants who gather data about on-chain accounts by surfing the web and analyzing source code.

The realization of the importance of gathering external data about accounts (metadata) for the purposes of machine learning was the motivation for creating a new spoke at ConsenSys called Rakr. Through collaboration with Alethio and other spokes and services within the mesh, Rakr hopes to provide a platform for gathering and sharing this valuable metadata. While the implications of integrating blockchain metadata with raw on-chain data go far beyond machine learning, the applicability of this metadata for supervised machine learning will continue to be a primary use case for the Rakr platform. By combining Alethio’s powerful analytics platform with the valuable metadata provided by Rakr, the applications of data science at ConsenSys will be limited only by the imagination.

In Practice

The first example of a supervised learning model produced at ConsenSys was the Ponzi model developed by Alethio, which will be described in more detail during the sequel to this article. The development of this model lays the groundwork for many future analytics possibilities for Alethio. Alethio hopes to expand this model to a more general fraud model in the near term.

More generally, the feature extraction pipelines built during this model development effort can be reused to classify any account according to one of the labels in the Rakr database, including whether an account/contract is an exchange, an art DAO, a casino, a DOS-related account, and much more. As the set of interesting metadata provided by Rakr continues to grow, more new models will become possible. And as the analytics capabilities of Alethio grow and more useful features are created, these models will become more powerful and versatile.

Being able to know whether a given account is a fraud or related to a DOS attack is crucial for managing financial and network risk on the Ethereum network. If we want to productionize models that provide actionable insights about new accounts and very recent behavioral data, they must satisfy special requirements. For example, we must make sure that they are being updated in real time, and that the features being used for classification and prediction are reliable and complete at the time the model is run. This means that certain features that can be used for classification of “old” accounts, such as “whether a contract eventually self-destructed,” cannot be applied to accounts in real-time. Since the value of the feature may change in the future, it’s true value is not really known at the time the model is run.

Real-time machine learning models present unique challenges and opportunities that go beyond those of historical modeling techniques. With that said, the ability to classify accounts as frauds goes beyond real-time risk management; classification models can still be valuable even if they are applied “in the past”. Being able to accurately classify historical frauds is useful for research purposes, even if those accounts are no longer active. More generally, attaching tags to accounts on the blockchain allows users to define semantically interesting subsets of accounts on the blockchain (such as “ICOs” or “exchanges”), rendering the blockchain searchable based on criteria that humans care about.

Creating a database of empirical human knowledge about on-chain entities is already a valuable and challenging task, and a necessary foundation for many other products and services. But with over 30,000,000 Ethereum accounts and contracts to date and roughly 100,000 new accounts created every day, it is simply impossible for humans to tag the entire history of ethereum accounts, most of which have no useful information (such as contract source, a website, or any other identifying information) that could be used by humans to classify or tag them. This is why the machine learning models are crucial: because they are infinitely scalable, and can be used to classify accounts using only the raw data characterizing their on-chain behavior.

By augmenting human knowledge about the blockchain with powerful analytics and machine learning, we envision a blockchain where every account and entity is enriched with useful classifications and properties, whether empirical and created by humans, or predicted and created by statistical models. This will be a major step forward for the transparency and accessibility of knowledge on the blockchain, which are an essential aspects required for blockchain technology to flourish.

Keep an eye out for the next article by Paul Lintilhac, which will give an exposition of one of Alethio’s recent data science initiatives: the Ponzi Model.

Disclaimer: The views expressed by the author above do not necessarily represent the views of Consensys AG. ConsenSys is a decentralized community with ConsenSys Media being a platform for members to freely express their diverse ideas and perspectives. To learn more about ConsenSys and Ethereum, please visit our website.

Using Machine Learning to Understand the Ethereum Blockchain

Supervised vs. Unsupervised Learning

The Importance of Datasets

In Practice

Written by Consensys