Can we detect incongruous acts in the Ethereum Transaction Network using machine learning algorithms?
Ethereum was announced first in 2015 as an entirely decentralized cryptocurrency. In deviance to other cryptocurrencies like Bitcoin, it does not only support the fund transfer between two parties, apart from this, it is designed to execute and store code in a distributed aspect. Ethereum is an open-source, public platform starting with which smart contracts can be created and deployed. A smart-contract allows possible transactions can be completed without the interfere of unbiased observers. They are not reversible and absolutely identifiable. Ether is a well-accepted cryptocurrency of the ethereum platform, which can be used to transfer between end-users or customers by sending Ether from one user’s address to another user’s address. The transactional data in the ethereum network possibly involves information about the end user’s addresses, but their real integrity is masked.
In Ethereum blockchain, Pseudonymity refers to a stage of masked identity. In economic networks, wrongful activities are generally incongruous in behavior. Members of a network want to detect inconsistency to avoid them from damaging the networks. It is difficult for the government agencies to detect inconsistent patterns on the blockchain, it’s no surprise that the decentralized and anonymous behavior of those currencies makes such malicious activity more attractive.
Many machine learning techniques have been recommended to deal with this problem. In this thesis, we consider the incongruous behavior of data and design patterns into the ethereum transactional network using machine learning algorithms. And our main focus is to discover which users and transactions are the most malicious and disingenuous, In the end, we will be using several supervised and unsupervised machine learning algorithms. We will evaluate the accuracy of each algorithm by performing experimental results. This study analysis is a good kick-off for future research and works to accelerate in the zone of detecting suspicious behavior of data on the Ethereum Transaction Network.
Groundwork Question: As mentioned in the above topic name “ Can we detect incongruous acts in the Ethereum Transaction Network using machine learning algorithms? ”
The question subdivided into the following questions:
- How and from where we get the ethereum transaction data?
- What machine learning algorithm or model we use for our study?
- What can be the feature selection for clustering and classification problems from the attributes?
- How to evaluate the results from the attributed enabled train data-set?
Before moving towards the algorithm part first we will understand the basic concept of blockchain.
1.1 Business Problems
Before we start exploring the ethereum concept in detail, you may be curious why the industry is so much immersed with blockchain. The key is that the blockchain is not only having the purpose of storing data for an individual person instead, its advantageous for numerous people or groups, especially for those who do not trust each other and don’t want to share data.
So, let us figure out several mechanisms used in the current markets by enterprises such as a bank, financial institutions, supply-chain systems. Consider that more than two independent organizations are trying to do trade with each other. Before their partnership, they all have their own data. If all have to work together, what you think the possible ways to share data with each other between these individual and independent companies?
1.1.1 Fully Distributed Model
The most current program must be regulated with the model specified in Figure 1.1 where all three institutions maintain their individual or separate data and communicate through web-service protocol. There might be some problems with such a process [Debanjani Mohanty: Ethereum for Architects and Developers].
- The greater part of the information would be redundant, with every association conveying its own form.
- Data across organizations might not get sync because of the latency issue.
- Processes would be impractical, settlements can be expensive.
1.1.2 Fully Centralised Model
We have watched the problem above with the distributed systems, so now how we can move towards a completely mutual or common model that would be unbiased and secure to all these institutions? As shown in Figure 1.2, Mostly institutions achieve this by the commission this responsibility to a third party which works as a common platform for the group or companies to store and share their data. Let’s see the issues here.
- It’s an expensive method because third parties will take handling charges for services.
- The third-party may have to disagree with particular organizations. A particular group may not agree with the data for some reason.
- There could be some legal affairs governing data regulation.
So, what could be the medium where several groups can share their data in the most efficient way so that
- Repetition is at a nonexistent
- Data is in synchronization across all the networks
- Expenses due to settlement are less frequent
- Auditing is easy
The new mechanism comes later, which is distributed ledger technology (DLT).
1.1.3 Decentralised Peer-to-Peer Model
Distributed Ledger Technology, as seen below in Figure 1.3, it’s a mechanism that operates in a peer-to-peer way, which is particular from old models. With the Use of DLT, we can set-up applications and platforms where proprietorship shared across the network of participating companies, completely removing the need for the third- party to handle the applications on our behalves.
- Common procedures and information are shared as carefully designed single wellspring of truth that completely expel the requirement for customary reconciliation, information interpretation, duplication, and repetition.
- Data synchronization and consensus are provided by the DLT platform. Applications are built once, in shared, and used by many groups.
1.2 Blockchain Concepts
As the name stated, Blockchain is a set of blocks or data cryptographically associated with one another. Blockchain is a structure of data that makes it possible to build a digital ledger of transactions and share it among a distributed network of computers. This follows a “trusted computing service through a distributed protocol, run by nodes connected over the Internet”
“It’s a decentralized transparent ledger who records transactions — the database is shared by all the nodes in the network, miners are updating them, everyone observed them, and no-one owned and controlled. It’s like an immense online bundled application that everyone has access to it and observed and authenticate the unique nature of digital transactions that transfer funds. The blockchain, like in a ledger, needs not only to concentrate on storing money transaction-based data. This can be used as a database, stock or trade across a range of industries such as supply chain, finance or business. The resources registered, followed and observed can either be hard assets, or as physical assets, or intangible assets such as votes or medical data. The blockchain allows decentralization of all transactions of any type in between all the parties on a global basis.
This appropriated records containing transactions by the distributed ledger that have ever been executed. Once the records are entered into transactions it can- not be modified. This gives rise to a change proof ledger as it is computationally impossible to change or exchange. Adding a transaction to the ledger involves all the nodes on the network acknowledging its validity. So as to accomplish this consistent choice every node runs on an algorithm and the exact way a transaction can be accepted is if the majority of people on the chain agree that the transaction is appropriate. In a blockchain, miners are validating the transaction. Miners support the blockchain by approving, monitoring and recording new transactions. Miners can comply with a new entry before it can be joined to the ledger. This process varies from the primary ledger where transactions are submitted to a central party whose only responsibility is validating, updating and distributing the ledger.
When people speak for the blockchain, they are making reference to one of two things. First, they talk about the blockchain network which is the network of nodes or computers. Second the chain of blocks that make the distributed ledger. One main feature of blockchain is that it established a trust-less environment. Blockchain is decentralized. That is, all the data is not stored in central server. Instead, it is a common data structure with every user having access to the same copy of the blockchain whereas the single user is able to view the same replica of the ledger allowing them to see previous transactions.
1.3 Benefits of Blockchain
We must accept that the blockchain was an offer through Bitcoin, a cryptocurrency, and it was theorized to address the demand of digital currency that a conventional database can not.
- The data written in a blockchain can not be altered.
- It’s a most advanced and secured database that uses private and public keys for transactions.
- The database is publicly accessible for everybody to authorize and add transactions.
- Since blockchain is decentralized, there is no downside in the blockchain, and hence we can add transactions at any time from anywhere from the world.
- Public or private blockchain as per demands of a particular, organization or business hence its flexible.
- Auditing is open anytime for the ledger.
As of now, we know the blockchain is a decentralized database where data is stored with a universal consensus between all the groups. Wikipedia portal says, ”Consensus decision-making is a batch decision-making process in which batch members develop, and admit to standing a decision in the best and natural interest of the whole. Consensus may be defined professionally as an acceptable resolution, one that can be supported, even if not the ’favorite’ of each individual. According to the Merriam- Webster, the consensus is defined, as general agreement and second bunch solidarity of conviction or sentiment.” These protocols are used in decentralized systems where there is no central authority to govern you. The protocol is constructed in a way that it should not be biased towards the member. The polling done by each of the members must have equal weightage. Figure 1.4 shows the bitcoin energy consumption comparison with different countries. The consensus protocols for Ethereum are:
1.4.1 Proof of Work (POW)
The proof of work was introduced as the first consensus mechanism with Bitcoin. Proof of work blueprint design by Satoshi Nakamoto (Bitcoin). This protocol is also used by many cryptocurrencies. In POW, all the miners require to solve a computationally tough arithmetical problem to become the miner of the current block. The problem which everyone tries to solve is like: they require to catch a nonce for which the hash of SHA-256 (nonce + block hash) must precede with a given number of zero bits. The level of difficulty is decided by the number of bits to match with zero, and the same goes for computation required to solve the problem. The most complication with bitcoin is the consumption of energy, and it uses are more than some of the small countries do. As soon as other miners start validating it until it reaches an agreement on the percentage (51 percent or 90 percent as per the configuration). In other words, if there are forks created because of different miners agreeing to different side chains, then the longest growing chain that moves faster is more trustworthy; soon others will start following that chain, and another side chain will be discarded.
Used By: Bitcoin, Ethereum
Advantages: Safe, Time checked,
Disadvantages: very slow, hug consumption of power
1.4.2 Proof of Stake (POS)
Proof of stake consensus has no relation with mining, but it still authenticates the blocks and adds to the blockchain. This collateral-based consensus algorithm depends on the validator's economic stake in the network. In other words, each and every validator must own some share in the network by depositing some money into the network. In POS-based consensus public blockchains, few validators take turns proposing and polling on the next block, and according to the weight of each validator’s vote depends on the size of its deposit.
Used By: Ethereum’s next Casper model of consensus
Advantages: Energy efficiency, security and reduced risk of centralization Disadvantages: More chance to attack as no computational factor like with POW to keep the network safe
1.4.3 Proof of Authority (POA)
POA is an up-gradation of POS where identity is at stake instead of monetary value. In the consensus model, blocks and transactions are authenticated by authorized accounts, termed as validators. Every individual gets the right to be an approving authority only after producing their valid integrity proof. Hence, there is no need for mining.
Used By: Ethereum’s Parity
Advantages: Security, high in scalability, performance and no mining Disadvantages: None
All currencies need some way to control supply and enforce various security to prevent cheating. As words theorize, cryptocurrencies makes extensive use of cryptography. Cryptography implements a mechanism for securely encoding the rules of the cryptocurrency system in the system itself, And we can encode rules for the creation of new units of the currency into a mathematical protocol.
A cryptocurrency is a digital asset or currency can be used as a medium of exchange. And it is secured by cryptography and cryptographic techniques which are used to create and secure the control of digital coins without double-spend. That is if the user ’A’ transfers user ’B’ five coins, user ’A’ cannot transfers user ’C’ the exact same five coins. Unlike sanction currencies, the security rules of cryptocurrencies need to be enforced in a technical manner and without relying on any central authority. There are various cryptocurrencies and they are decentralized networks based on blockchain. As pre-defined features, cryptocurrencies are also not defined by any central authority. Cryptocurrencies can be sent directly between two parties along with the use of both private and public keys. And this process is done with minimal processing fees, allowing users to steep fees charged by the traditional financial institution.
In simple terms, Bitcoin is known as a decentralized digital cryptocurrency without a central bank or single administrator. Bitcoin’s initial release was in January 2009. Bitcoin at its most fundamental level is a breakthrough in the computer science industry one that builds after 20 years of research into cryptographic currency and nearly 40 years of research in cryptography, by more than thousands of scientists and researchers around the world.
Bitcoin follows the concept in a white-paper by the pseudonymous software developer Satoshi Nakamoto, whose original identity is undisclosed. Satoshi urged bitcoin in 2008, as an electronic payment system based on mathematical proof. The logic behind was to produce a means of exchange, autonomous of any central body, that could be transferred electronically in a verifiable, secure and immutable way.
Forbes survey said, ”Ethereum is the first global blockchain platform that gives permission to users to easily develop and deploy their decentralized and trustless applications on top of it. It has created implausible opportunities in the fintech space”. In this article, first, we will see who started ethereum, what is the meaning of the ethereum virtual machine. And then how smart contract ideas came up and what can we write in this contract following with which programming language is useful to write a smart contract. Basically there are three programming languages used for writing smart contract but we will discuss, the most popular language which is ’Solidity’. After that, we will see the Gas concept. Then in the next section, we will discuss the Ethereum Accounts and its state types. And then we will explore the Ethereum Mining and Storage cost. Later, we will see the Ethereum Data Access and Token.
Vitalik has proposed the technology back in 2013 with the release of his white paper. In 2015 Ethereum was introduced as a hard fork from Bitcoin. The Russian-Canadian programmer invented the second cryptocurrency, and writer as teenage spectacle Vitalik Buterin. Vitalik, who previously worked on Bitcoin, he was unsatisfied with the structure in which Bitcoin worked, so he came up with his enhanced version of the blockchain framework called Ethereum.
Comparing Ethereum and Bitcoin is like comparing oranges and apples. Both were introduced in the market with different objectives. The sole objective of Bitcoin was to create an alternate digital currency in the market and thus create a payment and transaction system that is completely transparent and safe. Whereas the Ethereum goal was different, it was developed as a platform that facilitates peer-to-peer contracts and applications via its own cryptocurrency called ether. The primary purpose of ether is to facilitate and monetize the working of Ethereum to empower developers to build and execute distributed applications (called Dapps). Ethereum also introduced a revolutionary new concept called a smart contract, which is a ”Turing complete” language.
2.1 Ethereum Virtual Machine
A Turing machine is an intellectual mathematical machine or model of computation invented by Alan Turing in 1936, that defines an abstract machine that manipulates symbols on a strip of tape according to a table of rules. It can simple any algorithm, doesn’t matter how complicated. A machine or computer or language is considered ”Turing Complete”. if it can solve any problem that a Turing machine can be given an appropriate algorithm and the necessary time and memory.
Ethereum virtual machine is a Turing-complete; this means it gives a foundation in a programming language using which we can write contracts that can solve any reasonable computational problem. Ethereum Virtual Machine (EVM) is controlling the Ethereum, a consensus-based virtual machine that decodes the compiled contracts in bytecode and executes them on the Ethereum network nodes. Ethereum blockchain network is a group of EVMs, or nodes, connected in peer to peer mechanism. All node contains a copy of the entire blockchain data store and competes with other nodes to mine the next block by validating transactions. When a new block is created, it’s added, update and is propagated to the entire network, so that every node is in sync.
2.2 Smart Contract
In 1994, a cryptographer named ’Nick Szabo’, came up with the idea of being able to archive contracts in the form of computer code. This contract would be activated automatically when certain terms and conditions met. This idea would potentially discard the need for trusted third-party companies (such as banks).
In simple terms, A smart contract is a set of computer code between two parties which run on top of the blockchain and constitutes of a set of rules which are agreed upon by the involved parties. Upon execution, if these sets of predefined conditions are met, the smart contract executes itself to generate output. This chunk of code allows decentralized automation by facilitating, verifying, and enforcing the conditions of an underlying agreement. Smart contracts allow you to exchange anything of value including money, shares, property, etc.
Smart contracts are sets of sequential instructions, written using the programming language called “solidity”, which work on the basis of the IFTTT logic aka the IF-THIS-THEN-THAT logic. Basically, if the first set of instructions is completed then it executes the next function and after that the next and keeps going on until it reaches the end of the contract.
2.3 Ethereum Data Access
Etherscan is one of the leading BlockExplorer and analytical platform for the Ethereum blockchain. BlockExplorer term is basically a search engine that allows users to easily lookup, confirm and validate transactions that have taken place on the Ethereum Blockchain. That means users can look up transactions and analyze to see if they are pending or have been validated.
Andrew Collier developed and launched the ‘ether’ package for R, in January 2018. This package is helpful to connect the Ethereum network through JSON-RPC API(Application Programming Interface). It provides access to R users to connect with the Ethereum Blockchain through querying blocks and its transactions. Ether package has its own predefined functions which helped the user to access all the information such as the most recent block number, current gas price, the ether balance of any particular or any given account and the number of transactions associated with a particular address.
3. Data Congregation
For retrieving data-set for the Ethereum blockchain transaction network, we have used Etherscan APIs and CryptoScamDB. They accommodate secure and impeccable access to Ethereum APIs. Parameters provided should be named like in the given examples below. All the output of given parameters with the ether-scan API’s is stored in JSON format. The data-set we study for analysis is from block 0 to block 9178295.
3.1 Block structure in Ethereum
In every blockchain’s platform, they maintain their own block structure. Ethereum blockchain structure contains items below as shown in Figure 4.1. Here is one of the examples of a block. The block contains JSON-RPC response of etherscan API with parameter eth getBlockByHash. And this gives information about a block by hash. All the response parameters are in the dictionary. The dictionary is collections or arrays of key-value pairs in parentheses. The figure shows the keys in dark red color and values as output is in green color. In Table 4.1, we explained the roles of each attribute of the response. And the empty field is showing it does not have sizes in bytes.
4. Machine Learning and Algorithms
In this case, there is the need of pre-processing the data, so that the process could understand the data well. For illustrate, in the central market, there is always a requirement of price predictions for equipment or assets and algorithm always helpful for applying to the gigantic amount of unstructured data coming from many source providers. So, in order to apply the genuine algorithm to the data, we need to transform complete unstructured data into structured data. And once data is structured and clean then it is ready for the algorithm to apply on this data. We will be applying machine learning models on data and repeat the process until we do not get the model which is stable and good enough to be deployed. Once we deploy our model, applications start using it. The entire process we need to repeat frequently at regular intervals of time because the situation and factors change and we need to ready and up to date. The complete machine learning process is shown in Figure 4.1.
Machine learning is an approach in which, we write a program or we give an instruction to our computer in a certain way that, allowing the system to read and learn from the data. More precisely definition provided by, Tom M. Mitchell is: “A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”
4.1 Access and Store Data-set
As Ethereum data-set is quite big data-set, so we will be putting the spotlight on analyzing the data-set at a certain time interval. For general purposes, we will choose the time series where there is some intense activity happening on the ethereum network. The purpose of selecting the time interval is to ensure that, there is a prolonged stream of transactions with intermittently blocks being developed. It may be possible that we can regulate the clusters in data just by looking at it with the naked eye. Feeding a mixed data on clustering algorithm could be a waste of time and effort. Assuring the data peaked is of a large volume ensures that executing the K-means clustering algorithm is justifiable, so it is possible to detect inappropriate behavior of data within a data-set.
To access the ethereum data, we use python programming and retrieve the data using the JSON-RPC interface. The data will be stored using the CSV files. CSV (Comma-Separated Value) stores the data in tabular format and is compatible with almost every software or editor to open and visualize. CSV is also compatible with Python and R, which gives us flexibility in statistical analysis and computation. We use Python language to access the database of ethereum for analysis. Ethereum has a huge database with millions of transactions started in 2015 till the present day, so it is very difficult to study the overall data-set and analyze the result. For ethereum token price per day and number of transactions on a daily basis, we did an analysis of transactions on the ethereum network for the last 30 days. But for clustering and classification algorithms study, we use the last two years of data (2018 - 2020). The below figure shows, the Ether price in the last one month ( Figure 6.1) which was on peak in 2018 that is nearly 1105 value. The value of ether was 143.70 on 13th January 2020 and it raised the value to 165.86. Graphically the last noted ether value was 175.24 on 29th January 2020. As per the study, we can evaluate that, the price of the ethereum token is gradually increasing. So, many users are preferring ethereum token instead of other available tokens.
4.2 Esteemed Fundamental Resolution
The intent of the esteemed fundamental resolution was the state of the qualitative facts in the ethereum transaction network and use this analysis as a resource. By using Anaconda-Navigator Software (desktop graphical user interface (GUI)), we launch Jupyter Notebook which allows us to run numerous algorithms at an instance. Jupyter notebook is pre-loaded with frameworks and libraries and also gives us permission and reliability to install the required libraries. With the help of this library, we can bring out specific results without writing our own custom function. We do not have to wait for long, to run the algorithm and get the result. This software helps us to save our computational time. After evaluating the ethereum data-set, we plot the results and its total number of transactions on the Ethereum network on a daily basis. We had run the algorithm for the last 30 days, see (Figure 6.2). The maximum number of transactions was captured on 15th January 2020, which is 698399, which means the maximum transaction occurs between senders and receivers accounts. And the lowest number of transactions was captured on 25th January 2020, which specifies that less number of user account was active on this particular date. So according to the data interpretation in the graph, the probability of suspicious transactions in the ethereum network at this point was low and unusual.
What is the main category and subcategory of the scam?
The data-set we retrieved was highly unstable and with malicious addresses. We have around 6911 out of 81 million addresses. All these 81 million addresses are of externally owned accounts (EOAs). This all scams mostly belong to phishing, scamming or Fake ICO category, Malware and Scam. The maximum number of malicious activity marked under the category ’Phishing’ and its total of 4623 activities was found in data- set. And ’MyEtherWallet’ is the most common subcategory of the phishing category, which reflects 2791 activities marked only from MyEtherWallet. Since it is a freeware open-source, client-side interface for generating Ethereum wallets (www.myetherwallet.com), most hackers attack the private keys and execute their potential activity. Here are the top 5 categories and sub-categories see (Figure 4.4) and (Figure 4.5).
Which tokens are most favored by the user?
The most favored tokens captured during the time-interval (January 2020) was ETH 1, which is involved with the token address. With the help of measuring the density of each and every tokens, we can see the ETH is the most popular token with 96.43% of the total token transactions. It means maximum numbers of the user using the ethereum transaction network for their data exchanges. The second most used token in the plot is BTC 2 with 2.10% density. The token density diagram shows below (Figure 4.6). We ignored some of the tokens whose density is around 0% or less than 0.05%. All the tokens except ETH, BTC is below 1% and less popular.
Which one is the top miners?
During the last fifteen days of January month, we saw that a total of 93,240 blocks have been mined by 87 miners who are active in creating blocks. We can see that Spark Pool is the topmost miner in the list. Over the period of January, the ’Spark Pool’ had 32.9354%. That means they mined approximately 33% of the total blocks. ’Ethermine’ is second on the list with 20.74% and the third-largest is ’F2Pool’ with 11.13%. The rest is under 10%. If we look closely at Figure 4.5, for the top 25 miners by block number in the last 14 days (16/01/2020–30/01/2020). We can see analogous patterns in different time frames. In the year 2018, Ethermine was the top miner but at the end of the year 2019 and the beginning of the year 2020, Spark Pool is on the top miner and Ethermine holds the second position on the block mining list.
4.3 Unsupervised Learning
In this experiment, we took a clustering algorithm using K-mean, we will also find out the optimal number of clusters (k) in the ethereum data-sets using Silhouette Method and Elbow Method. We will not use K-Medoids clustering since it's not so important, we will show in the result section. We came to the conclusion by performing all experiments with or without the utilization of the principal component analysis.
It is very useful to focus on the most correlated variables in the data-sets. In the below plot, the color of correlation coefficients is according to the value. If the value is increasing then the correlation coefficient will move towards a blue color or if the correlation coefficient is decreasing then it will move towards red color. The correlation coefficient indicates the direction as well as the strength of the relationship. The invisibility of color is based on the value of the variables if the value is towards one it more visible, if the value is towards zero it is less visible. See (Figure 4.6)
The above graph shows a total eight features and their relation with each other. The positive correlations are displayed in blue and negative correlations in red color. The value of exact 1 shows a perfect positive relationship between two variables, and -1 represent perfect negative relationship between two variables. And the Empty box represents a relation between two variables is 0, which means there is no relationship between two variables. So we can say the incoming ether value and outgoing ether value are two separate instances they do not depend on each other. We can see BlockNumber, TimeStamp, TransactionFeeInEther, and TransactionFeeInUSD hold perfect positive relationship with each other. The TransactionFeeInEther holds good relationship with HistoricalPriceEther, which means in previous years, the transaction fee was high because of ether token price was on high-side. But in the start of 2019 year, the price of ether token was low. The TransactionFeeInUSD also hold the same relationship as Transac- tionFeeInEther because if the ether token value is high then equivalent price in dollars will also be high. There is dependency of CurrentValueEther with InValueEther and Out- ValueEther. If the transactions frequency is low in between senders account and receiver accounts, then the value of ether is depended on the incoming and outgoing transaction ether value which is also low.
Case I: Silhouette Method
In this experiment, we used silhouette method to determine the optimal number of clusters(k). We made data consistent and apply k-mean clustering algorithm using the initial configuration ’nstart’ is equal to twenty-five and the value of k generated by the silhouette mechanism. In the Figure 4.7, We plotted the number of clusters k against the average silhouette method. We calculated the number of clusters k=6 as average silhouette whose width is equal to 1. In the silhouette method, values nearby 1 are considered to be optimal value which are closer to its cluster. And the value close to -1 means value assigned to the wrong cluster. Out of sixteen features from the data-set we choose only 8 features, which is ”BlockNumber”, ”TimeStamp”, ”InValueEther”, ”OutValueEther”, ”CurrentValueEther”, ”TransactionFeeInEther”, ”TransactionFeeInUSD”, ”HistoricalPriceEther”. And out of these eight features we got 6 number of final optimal cluster. We plot the cluster with six different colors, which shows six different features with the region or boundary which closes by its mean value.
Case II: Elbow method
In this experiment, we used the Elbow method for finding the optimal value of k in K-means. We made data consistent and apply k-means clustering algorithm. We have considered a total of eight features for the value of k. For each k we calculated total within the sum of squares. The value we got at the bend of the line in the plot is considered as the best-fitted value for the number of clusters. So, in this case, we achieved the best value for k=4. See (Figure 4.9)
4.4 Supervised Learning
In the Supervised Learning, first of all, we have set up an extremely asymmetrical data- set. We marked 694 malicious addresses from 87 million addresses. And these 87 million addresses are ethereum unique address. We use these 694 addresses to split in 554 training data-set and 138 test data-set. Data preparation is one of the biggest challenges in the supervised machine learning algorithm because collecting data from different sources and to analyze the important features for our algorithm. Or we can say the data preparation method took a long time to understand the data structure and its importance. After data preparation, feature importance and feature selection is required as per our requirement. For the ethereum transactional network, we marked the features are address of the ethereum developers, date and category of the data. We extract another data-set from CryptoScamDB with JSON-API. CryptoScamDB is an open-source data-sets that exposes malicious categories, subcategories and url’s. As we know the categories of the data from the data-set, we had labelled ’0’ for non-malicious addresses and ’1’ for malicious addresses. We labeled the overall 7681 malicious address. The idea behind to label the data-set and create training with 80% and testing with 20% of the original data-sets so with the help of training and test data set we can predict output. We are not able to figure out most the accurate function that will help us to make a flawless predictions.
In machine learning, confusion matrix is a table that help us to characterize the performance of a classification classifier (or ’model’) on test data-set for which accurate values are established. We will see below table, the confusion matrix which made an assumption of malicious as negative and non-malicious as positive.
Explanation of the Terms:
True Negative (TN): In this case, algorithm gives correct observations and predicted as non-malicious.
False Negative (FN): In this case, algorithm gives incorrect observations and predicted as malicious.
False Positive (FP): In this case, algorithm gives incorrect observation and its predicted malicious.
Accuracy: ‘ Its a fraction of correct observations to total observations by the algorithm’.
Precision: ‘How accurate our model is out of those predicted positive’.
Recall: ‘How many of the actual positives our model capture through labeling it as positive’.
F-measure: ‘How seek a balance between Precision and Recall’.
4.4.1 Supervised Learning Classifier
K-Nearest Neighbor (k-NN): The first classifier or model we prefer ’k-NN Classifier’. We have trained our model with 80% and with 20% test. Table 6.4 below shows the result when the model is achieved on the test set. The test set contains 249 non-malicious and 6 malicious data points out of total 255. We can see the classifier doesn’t classify the user as malicious.
The below figure is showing the best value of k. For improving the performance model of k-NN, we did tuning parameter ’k’ and add the number of feature selection. But as we increase the value of k, the accuracy decrease. We analyzed the value at k = 1 when accuracy records is its highest value 99.5, but when we increase the value k from k = 1 to k = 32, we can see the K-NN model accuracy continuously dropping and at k = 31, accuracy is 96.8 percent. We run the loop upto 32 times according to the euclidean distance.
Decision Tree: The data set is highly unbalanced. For classification, we need to balance data- set with homogeneous features and remove unused character based data to achieve better efficiency of the classifier.
The table contains the output when the classifier is calculated on the test data-set. The test set contains 2 malicious points out of a total of 279 data points. In Figure 6.11, we plotted the Four-fold plot. The Four-fold plot is highly esteemed when it comes to visualizing odds ratios. Especially when the results are filled with 2 by 2 k contingency table. The plot ideally shows the cell frequencies numerically, but margins for both non-malicious and malicious are equated. It is clearly indicating that non-malicious data ’0’ is quite bigger than malicious data. As our model, the prediction does not show the classification of any regular user who is malicious, means generally the normal end-user is not involved in malicious activity.
SVM Classifier: Support Vector Machine (SVM) Classifier is another supervised machine learning algorithm, which is used for classification problems. We are trying to plot each and every data point in n-dimensional space (n is a number of features we have). Now the situation arises how we will find the right hyper-plane. Again we split the data-set into 80% training set and 20% test set. And for better classification, we did feature scaling additionally because our variables units are not the same. We evaluate the accuracy from the output of the confusion matrix which is mentioned in the table. We got an accuracy of 0.94. We have 119 malicious data out of a total of 282 data point.
In the figure below, We predicted the test set results. The green color showing the non-malicious data we observed from the confusion matrix and the red color displays malicious data in confidence ring. We achieve 94% accuracy which is not so good. The classifier we choose to increase the accuracy of the model on data is not so helpful. In the (Figure 6.13), We plot the prediction on the training set.
The dots showing in the plot is real data, in which red dots is showing the malicious data points and green dots is showing the clean data (non-malicious data points) and the background is the pixel by pixel determination. We use a linear kernel type trick to reduce the computational cost. As we see in the plot that two far data point at 40 and 50 is outliers. And the green dots and red dots looks almost linearly seperable, which means the hyperplane divided the data-set in two sets which is malicious and non malicious. So we ignore the training model and we choose Test data-set, all data points lying close to the hyperplane only one outlier See(Figure) below. So we can conclude that our model performs well on the test data-set.
Naive Bayes Classifier: It is used to solve for classification problems by probabilistic approach. The goal behind using the Naive Bayes algorithm is that the predictor variables in the machine learning model are not dependent on each other. This means the output of the model depends on the collection of independent variables that have nothing to do with each other. Like another above-mentioned approach, we first prepare data and for the train Naive Bayes model we split data into 80% training set and 20% test set, Figure 6.15 for Confusion Matrix plot showing the distribution of actual and predicted data. And table is showing the accuracy, precision, recall, and f-measure. Which we got from the calculation.
In the Figure , we can see the confusion matrix with total 278 data nodes, In that out of 278 we have 31 malicious nodes or error point. Our predictions is based on test data-set. The outputs was achieved when we already consider out-degree nodes as malicious nodes. In the given illustration, in Figure 6.16, it is very clear that ’gas’ is the most significant variable and gasUsed the second most significant for predicting the output.
We discuss the following metrics we evaluated from the mentioned unsupervised and supervised machine learning algorithm in Section 6.3. It’s essential to understand which machine learning algorithm achieved the best number of clusters and the accuracy. Finally the result differences between the algorithms we used for clustering and classification of ethereum transaction network data-set.
In the unsupervised learning approach, first, we had seen how to prepare the data and how correlation works well with the attributes with self and with others. Positive correlations are in blue color and negative correlations are mentioned in red color. The color density is based on its numeric value and it’s proportional to the correlation coefficients. If a relationship exists between the features, then the coefficient indicates both the strength of the relationship and the direction of correlation. We can clearly make sense with the plot that BlockNumber, TimeStamp, Transac- tionFeeInEther and TransactionFeeInUSD has a strong relationship with itself and one other attribute. Its mean number of blocks created as per the time captured by the system is the major factor of creating blocks, and it is showing dependency on each other. Like the transaction occurs between the sender address and the receiver address in the ethereum network’s is in form of token and the transaction fee is also related to its equivalent currency in the US dollar(USD).
In K-mean clustering, we first saw the silhouette method, initially, we considered k = 2 clusters, and k-mean function has nstart option that is trying to attempt multiple combinations of initial configurations and generates the best one.
For example, we choose k = 25, so as an output we received information: one of the results is cluster information, a vector of integer indicating the cluster to which each point is allocated and, another is the size which shows number of points in each cluster and we received 2 clusters with the sizes of 10 and 2990. Which explains, the cluster centres for two groups across the 6 Variables (”InValueEther” ”OutValueEther” ”CurrentValueEther” ”TransactionFeeInEther” “”TransactionFeeInUSD “HistoricalPriceEther”). Figure 7.1 shows the result of the algorithm when k = 2 and we obtained the size of data. Formation of data-pattern in graph is not well distributed when k = 2, and its happens due to mean of every cluster is not showing exact center but as we increase the value k and with the best accuracy of k = 6, we achieved six different cluster with their mean is showing exact center. And with the elbow clustering model, we achieved four number of clusters. But for our ethereum data-set, the optimal cluster is 6. The data-set well-formed group with six clusters as we look in the figure.
In Figure 5.2, it shows the data distribution based on a number of clusters, if we increase the value of k, the number of clusters increases and so data form new and more closer group.
On the other hand, we use the K-Medoids clustering approach, each cluster represents one of the data points in the cluster and these points are known as cluster medoids. In our experiment, we used partitioning around medoids(PAM)
and we see here the sensitivity of K-medoids is less for outliers, as compared to k-means, the reason is, it uses medoids as cluster centers instead of means. See in Figure 7.3, K-Medoids cluster with k = 2 and we can see here k-medoids are not properly taking care of outliers, and in the figure, we can clearly differentiate between k-mean and k-medoid. When k=2, in k-mean, point in the plot is well covered by the fence but in k-medoids it is overlapped with another cluster. So in this scenario, we conclude that K-Means is better than K-Medoids. In the clustering method, finding a particular malicious address that came from a particular account is quite challenging for our ethereum data-set. But it is possible, once when we have a balanced data-set.
In Table 5.1 shows the result comparison of all the classifier we used to perform on ethereum data-set:
So, first of all, we used K-NN classifier from the table. We got 96.8%. We run the K-NN algorithm to improve the model performance with tuning hyper-parameter ’k’ and the number of attribute selection. As we increase the value of k, it decreases the success rate. Initially, we achieved accuracy 99.5% when k = 1, but we had ignored this iteration because the accuracy we achieved that was selected on a few attributes only with unbalanced data. The accuracy continued to drop till k=13. Again accuracy went up 98.2% for the iteration k=14 and decreased further. So at iteration k=31, we achieve 96.8%. Precision is 81% which is low, the sensitivity of the model is high which is 97% and F-measure did not made a good balance between precision and recall. So we learned that K-NN model was not a good fit for the data-set that we selected. As we increase the value of ’k’ we have seen the accuracy is continuously decreasing. After we saw, only 2 cases found with malicious data out of 279 total cases. We achieved the highest accuracy of 99.8% from the Decision Tree model. The data we feed in the Decision Tree model performs better with test data so we can say the model we applied on training data-set work well with accuracy. The precision we calculated is 99.8% correct, recall also performs well with data which is 98.9%. So we can say the balance between precision and recall is 99.3%. We can conclude that the Decision Tree improved the accuracy of correct data in ethereum data-set.
On the other hand, SVM had performed a linear-classification using kernel type tricks to reduce the computational cost of the algorithm. So we achieved 94.0% accuracy. The balance between the precision and recall is low in the SVM model which is 73.2%. At last, we worked on the Naive Bayes model and we achieved 96.7% accuracy which was better than SVM but less than K-NN. We fed only six features to the final Naive Bayes model with a 10-fold cross-validation method. So we got less accuracy than Decision tree. if we could have more numeric data to feed into the Naive Bayes model, we would had better results. The final result is different on each iteration run, as the sample is a random function. The output we produced isn’t high but good, for a very simple model, its really good. As we consider Naive Bayes as our second-best model, with respect to K-NN and SVM models. Because K-NN was slow, continuously provides reducing accuracy and it is also time-consuming.
The reason we applied for many unsupervised and supervised learning algorithms on ethereum data-set is to detect which algorithm or classifiers are more accurate for data to generate good accuracy. For unsupervised learning classifiers, we used K-Medoids (Partitioning Around Medoids) which has less sensitivity for outliers in comparison to K-means. And for supervised learning classifiers we used four methods in which we had seen we achieved 6 malicious data out of 256 data-points in K-NN and in Decision Tree, we had predicted 2 malicious out of 279. In SVM,
we got 119 out of 282 known cases and also in Naive Bayes, we evaluated 31 out of 278. So we can say in all the above classifiers executions, we got Decision Tree as best in predicting error and good data. The rest of the classifiers produced the most malicious data. In this chapter, we had seen the comparison of all the supervised learning algorithms and unsupervised learning algorithm results.
According to the study, we can say that it is possible to find malicious data activity in the data-set and detect data patterns in the Ethereum network using machine learning algorithms. But finding a particular malicious address from malicious activity where false transactions are taking place is quite difficult. Since ethereum has a highly unbalanced data-set. We found out that K-means clustering is more efficient than K-medoids clustering. Using silhouette, elbow and principal component analysis method, the accuracy of the cluster or group formation increases. We also marked outliers very well in K-means clustering instead of k-medoids clustering. There is the only drawback in K-means where we have to assume pre-condition of k = 2 and nstart = 25.
With the help of machine learning classifiers or models, we can characterize all the addresses in the data-set as malicious or non-malicious. We can also find highly impulsive data on applying predictive class method. We can make any kind of visualization, predict the accuracy of data and plot graph to understand the behavior of data. It was a good start to understand the ethereum transaction data from the machine learning approach with our data-set which was highly unbalanced and we found 7487 potential malicious addresses out of 87 million addresses.
I fondly anticipate the most important thing to do before applying your machine learning algorithm is to prepare data. Data pre-processing is the factor that affects the performance of our model. If our data is highly balanced then we can make good prediction. After all these conditions, we can tell the performance of computational power is directly proportional to the software like R-Studio or Python IDE (Integrated Development Environment). Suppose if our machine is not fully functional with high processing speed, it causes our model to perform slowly which affects the cost. So to improve the performance we can use cloud computing technologies like Anaconda Navigator cloud or R-Studio Cloud. Cloud computing is good for multitasking and multi-threading. Especially, when we have to analyze and process the highly unbalanced data like ethereum transaction network data. We had seen the importance of file format for durability, long-lasting, transition state and scalability of data. It is very important for a machine learning approach to have data in shape because of the data stores in the right format which is very easy to retrieve and manage.
As per our study, there is no such benchmark set for achieving the exact output. Here in this study, we made an attempt to understand the ethereum transaction network data. With the above-mentioned approach we can also build a recommender system because we have found that on applying the random algorithm and evaluating precision and recall on the test data-set, we have seen the performance was random (that means, the recommender system was inefficient). We also understood that we should not make long enough leaps by only tuning hyperparameters. We can switch to a specific model or classifiers according to requirements and to generate expected output. The qualitative evaluation will give pretty much good recommendations on small data-set.
And this study is also helpful for all the users to understand the concept of ethereum and its token. We can do similar studies for other coins or tokens as well. It is good to build a recommender system and also help us find out what we want from our data, which token future looks promising and which token is the best fit for us.
In the end, we finally can say that it is not simple to work with highly unbalanced data but after the preparation of data we can find incongruous behavior of data in ethereum transaction network using machine learning algorithms.
6.1 Future Work
At the same time, it will be really fascinating and challenging to focus on the data pre-processing in depth. Because ethereum data is like a dense cloud and we have to detect incongruous behavior of data or anonymous activity in the ethereum transaction network. The process of data pipelines may lead us to historical ethereum data and future ethereum data transactions, and perhaps we group all the scams and token separately. It is easy to study the ethereum data if they labeled with correct categories because the ethereum data we studied is not well labeled. This means that the current data we studied is with scam categories named with Scamming or scam. Basically both words have the same meaning. Once the balanced data is receive then we can think of applying new machine learning algorithms which will be appropriate.
At last we emphasize on the data which we access through etherscan in the JSON type response is not well structured. If data is well structured we can easily apply a dedicated algorithm based on our research questions. We can also study and understand ethereum with graph analysis. And we can provide recommender systems. Beyond existing ethereum data-set on market there are still many areas of the ethereum blockchain, which is largely unexplored. We are further looking forward to see more ethereum data in the future.