The Elliptic Data Set: opening up machine learning on the blockchain
Elliptic has recently released the Elliptic Data Set, a graph of 200,000 partially labeled bitcoin transactions. Here we give some more background and details about the data set, with the hope that it will inspire the academic and crypto community to help build a safer financial system based on crypto currencies.
The bitcoin blockchain as a transaction graph
There are several possible ways of representing the bitcoin blockchain as a graph. Arguably the easiest one is a graph where the nodes represent transactions and the edges the flow of bitcoins between one transaction and the other.
In this representation the bitcoin blockchain is a directed acyclic graph (DAG). Excluding edge cases (where multiple outputs of a transaction are subsequently spent as inputs of the same transaction), the in-degree of a node gives the number of inputs of a transaction, the out-degree the number of outputs that have been spent. The only nodes without incoming edges are coinbase transactions, which give to miners the block rewards plus additional transaction fees and inject new currency into the system.
A time stamp is associated with each transaction, representing an approximate time when the transaction is broadcasted to the bitcoin network. This allows for the inclusion of the temporal information in a graph visualization, as in Figure 1 below,
Anyone running a bitcoin node has access to all the transactions in the blockchain and can thererore build the entire graph from it. As of today, the bitcoin transaction graph is made of more than 438 million nodes and 1.1 billion edges. It is an ever growing graph, considering that there are about 350,000 new confirmed bitcoin transactions every day.
The Elliptic Data Set
The Elliptic Data Set is a sub-graph of the bitcoin graph, made of 203,769 nodes and 234,355 edges. Together with the graph information, it also categorizes the nodes into three classes: “licit”, “illicit” or “unkown”. A node is deemed “licit” / “illicit” if the corresponding transaction has been created by an entity that belongs to a licit (exchanges, wallet providers, miners, financial service providers, etc.) or illicit (scams, malware, terrorist organizations, ransomware, Ponzi schemes, etc.) category respectively.
The task on the dataset is to classify the illicit and licit nodes, given a set of features and the graph topology. Since not all the nodes are labeled, the problem can be approached in a semi-supervised setting that includes information carried by the unlabeled nodes.
Nodes and edges
Two percent (4,545) of the nodes are labelled class1 (illicit); twenty-one percent (42,019) are labelled class2 (licit). No information is given on the other nodes, which are classified as “unknown”.
There are 166 features associated with each node. The temporal information is encoded by a time step running from 1 to 49, a measure of the actual transaction time stamp. The time steps are evenly spaced with an interval of about two weeks; each one of them contains a single connected component of transactions that appeared on the blockchain within less than three hours between each other. There are no edges connecting the different time steps. Figure 2 gives an example visualization of the connected components in the data set,
Each connected component is composed of between 1,000 and 8,000 nodes. Most nodes belong to the “unknown” class — as shown in Figure 3.
Out of the 166 features, the first 94 represent local information about the transaction — including the time step described above, number of inputs/outputs, transaction fee, output volume and aggregated figures such as average BTC received (spent) by the inputs/outputs and average number of incoming (outgoing) transactions associated with the inputs/outputs. The remaining 72 features represent nonlocal (graph) information in the form of aggregated features, obtained using transaction information one-hop backward/forward from the center node — giving the maximum, minimum, standard deviation and correlation coefficients of the neighbour transactions for the same information data (number of inputs/outputs, transaction fee, etc.).
Machine learning on the blockchain
The release of the Elliptic Data Set coincides with a new tutorial paper Elliptic data scientists have co-authored with researchers from the MIT-IBM Watson AI Lab . The paper, entitled, “Anti-Money Laundering in Bitcoin: Experiments with Graph Convolutional Networks for Financial Forensics,” has been presented by IBM Research Staff Members Mark Weber and Giacomo Domeniconi at the Anomaly Detection in Finance workshop of the Knowledge Discovery and Data Mining Conference (KDD) on August 5, 2019 — a leading AI conference.
The paper shares early experimental results using a variety of methods, from standard classification models (Logistic Regression, Random Forest, Multi Layer Perceptron) to the more sophisticated Graph Convolutional Networks (GCN).
GCN are a fairly new class of machine learning algorithms. This field of research has seen an impressive progress over the course of a few years, inspired by some key developments in spectral graph theory  and the realization that these could be used in neural network applications . These algorithms allow to naturally include graph information into the learning process of a model and combine this with the power of a neural network architecture.
The analysis in  compares in detail all these various models, and studies the effect of including graph information — in a tabular format — in standard classification methods. Several conclusions can be drawn from this work, in particular that:
- the inclusion of nonlocal information — specifically, information about the neighbours of a center node — always improves the performance of the models;
- standard (non graph-based) classification models all benefit from using additional features given by the GCN embeddings;
- Random Forest appears to be the best classification model for this task.
That GCN are not the best-performing models is an interesting (if not surprising) finding. More specifically, on the Elliptic Data Set GCN performs significantly better than Logistic Regression but worse than Random Forest. As suggested in , this invites the idea of combining the power of graph-learning algorithms (such as GCN) with decision trees or random forest models.
Another challenge that the Elliptic Data Set poses is in its temporal dynamics with the emergence/disappearence of new entities in the blockchain. An example is at time step t = 43, which follows a Dark Market shutdown. Figure 4 shows that none of the models performs well after this event, including the most recent EvolveGCN which is designed to capture the system dynamics of a graph . Solving problems like this one, which may be interpreted as sudden changes in the underlying behavior of a system, represent some major challenges for the community.
 D. K. Hammond, P. Vandergheynst, R. Gribonval, Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, Elsevier, 30 (2), pp.129–150 (2011).
 T. N. Kipf and M. Welling, Semi-supervised classification with graph convolutional networks, arXiv:1609.2907 (2017).
 M. Weber, G. Domeniconi, J. Chen, D. K. I. Weidele, C. Bellei, T. Robinson, C. E. Leiserson, “Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics”, KDD ’19 Workshop on Anomaly Detection in Finance, August 2019, Anchorage, AK, USA.
 A. Pareja, G. Domeniconi, J. Chen,T. Ma,T. Suzumura, H. Kanezashi, T. Kaler, and C. E. Leiserson, EvolveGCN: Evolving Graph Convolutional Networks for Dynamic Graphs, arXiv:1902.10191 (2019).