Etherie v0.1, Information Extraction on Ethereum Smart Contracts

Published in

Madiba Security Group

5 min readMar 8, 2020

In the past decade, due to the growth of the computational power, digital storage capacity and novel network protocols, huge corpora of written documents have been gathered on the web; from source-code repositories to literature and scientific papers. These repositories of semi-structured textual data are free resources to extract knowledge for educational or even industrial purposes.

Figure 1. Photo by Markus Spiske on Unsplash

One obvious example of these free corpora is Github. Millions of commits from thousands of developers are freely available. Since reading the data is free of charge, it is just a matter of time that some companies try to develop Information Extraction engines and extract around development’s crucial features; Features that would be used to sort developers based on their expertise or behavioural patterns.

Blockchain, Bitcoin vs Ethereum

Let’s look at blockchain technology from another angle. How much information is stored in blockchain and how can we extract and utilise it? What could we gain by applying textual information Extraction techniques on a public blockchain?

In addition to previously mentioned open-source repositories such as Github, now we are in the rising era of blockchain-based data repositories. From Information Extraction point of view, not only reading data from the blockchain is free (or at least efficient), but also it has been assured that:

The integrity of the data has not been compromised. So the outputs of Information Extraction engine would be more trustworthy.
People have payed to store the data (transaction, smart contract, etc) on the blockchain. So, it can be safely assumed that there is a strong collective incentive to keep the documents clean and functional.

Let’s look at two most famous blockchains. Well, Bitcoin is an append-only ledger to keep track of transactions. There are limited and unexploited scripting capabilities inside bitcoin ecosystem. So, it is safe to claim that most of the ledger is just sequential chain of transactions and the only information to extract from Bitcoin ledger is the transaction properties and hard-to-analyze tangled graphs.

In contrary, Ethereum blockchain consists of smart contracts. Thousands of bytecodes’ solidity sources are available for auditing. In addition, Solidity language has been designed to be human-readable and developers have incentive to keep the code clean and non-ambiguous; Both for the logic and also for selecting variable and function names. So intuitively a corpus of Ethereum Smart Contracts should be a great candidate for applying standard Information Extraction techniques.

Etherie, Ethereum Information Extraction Engine

Figure 2. Etherie results on a sample query

Etherie v0.1 is my first attempt to extract information from these fruitful available-to-public smart contracts. To put it in a nutshell, it finds most similar smart contracts to the query entered by the user. In Fig. 2, the pilot front-end of the current version has been depicted. A bunch of words have been entered around a topic and the search engine returns 50 most similar contracts in a 31,000 Mainnet preprocessed smart contracts.

So, what happens under the hood?

To analyse a set of human-generated textual documents, one can utilise Natural Language Processing (NLP) techniques. NLP process can simply be explained as a pipeline of iterative extractions, cleansing and measurements over textual content to find hidden relationships among informative pieces of data and aggregate the knowledge to perform on much more sophisticated tasks. In [1], Adam Geitgey has explained about the typical steps of NLP pipeline and showed some Python script samples of the implementation.

As it was previously explained, for Smart Contract Information Extraction engine, we only need some basic steps. Etherie currently is using these four main steps:

Dumping: Dumping available smart contract solidity codes using Smart Contract Sanctuary [2]. This open-source script dumps the solidity smart contract from Etherscan and Etherchain websites.
Structural preprocessing: Using another piece of open-source script (ConsenSys Python Solidity AST Parser [3]), I have generated Abstract Syntax Trees of the dumped smart contracts and stored them on a no-SQL database (MongoDB) for later nested search on the abstract syntax trees.
Natural Language preprocessing: Traversing abstract syntax trees, a straight-forward and simple NLP pipeline has been used to extract meaningful words in the function names and variable names that the developer used to express the functionality of different parts of code. The steps include: Segmentation, StopWord Removal, Stemming and Name Entity Recognition. From the extracted words, Bag-of-Word model of contracts and TF-IDF (Term Frequency Inverse Document Frequency) metric and also an overall dictionary of all used words have been generated.
Calculating Similarity: The cosine similarity of the entered query (after simple preprocessing) have been calculated over contracts’ TF-IDF representations and a sorted list of contracts would be presented to the user.

Related works

There are lots of Information Extraction studies each has been designed to perform on different context. LexNLP library [4] is a fully implemented example of Information Extraction pipeline that works on legal and regulatory corpus. In [5], Adnan et. al. proposed a generic information extraction system to take advantage of other datatypes such as visual resources. The task in these studies is to process on unstructured documents. However, in our case, Solidity syntax and developers’ incentive keep the smart contract almost fully structured. So, information extraction tasks are less expensive and more accurate.

Future works

Adding other textual mappings such as Latent Dirichlet Allocation (LDA) or semi-supervised techniques to improve the sorting accuracy.
Clustering Abstract Syntax Tree sub-graphs to search based on contract’s logic.
Clustering ByteCode of smart contracts. In contrary to solidity code, smart contracts’ ByteCodes are available through EVM itself, so the domain of study would hugely differ (from tens of thousands to million samples). However, there are no additional data available in byteCode for the NLP pipeline the has been proposed.

Conclusion

Etherie v0.1 is an attempt to take advantage of freely available smart contracts to search in currently developed source codes. A tool that can be used by beginner solidity developers (like me) to find code patterns around arbitrary topics.

References

[1] https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e

[2] https://github.com/tintinweb/smart-contract-sanctuary/tree/master/utils

[3] https://github.com/ConsenSys/python-solidity-parser

[4] https://github.com/LexPredict/lexpredict-lexnlp

[5] Adnan, K., Akbar, R. An analytical study of information extraction from unstructured and multidimensional big data. J Big Data 6, 91 (2019). https://doi.org/10.1186/s40537-019-0254-8

Etherie v0.1, Information Extraction on Ethereum Smart Contracts

Written by Mahdi Nejadgholi