Classifying Ethereum users using blockchain data

A data-driven examination of the largest addresses on Ethereum

The Ethereum blockchain hosts many different types of users — Most are ordinary people, but the bulk of holdings are distributed among specific categories. Some of the most prominent ones are:

Icons made by turkkub, Freepik, Ddara, srip from www.flaticon.com

The wealth of the holders above, even for a single entity, is often distributed among many Ethereum wallets. Quite often, people intentionally make their holdings harder to track. However, we can still glean insight from examining the addresses with the most substantial balances by reviewing their on-chain transaction activity.

Why is this interesting?

There are many reasons why one might want to track these accounts. These could be:

  • An investor on the lookout for signals for predicting price movements.
  • A token team looking to airdrop to the top active addresses.
  • A government agency looking to audit suspicious blockchain activity.
  • A researcher interested in better understanding what’s occurring on the blockchain.

Luckily, there is a plethora of on-chain information on addresses that can help track the different groups of ETH accounts.

The most obvious place to look for these is in the Ether balance distribution. If you were expecting more wealth equality in crypto compared to fiat, you might be surprised!

The graph above is log scaled, so the first section represents the top 10 addresses, then from 11–100 and so on until the 35 million mark, ordered by ETH balance.

Put into other words…

The top 10,000 addresses own 83.3% of total Ether!

And if that’s not surprising enough — the top 10 addresses own 11.4% of Ether holdings. This concentration of Ether among a small group makes tracking the top accounts even more valuable.

Anonymity in the Ethereum blockchain makes big holders hard to track — however, the blockchain is luckily also transparent, and every movement is traceable. At the time of writing, there are 5.7 million blocks in the Ethereum network — each of these holds valuable transaction data that can be used to help classify addresses amongst the different potential groups described in the introduction (e.g., Exchanges, Token Teams).

Visualising on-chain data

By plotting blockchain data of each of the top 10,000 addresses, we can observe if any patterns emerge. The amount of data each address has is impossible to fit coherently into one graph, so let’s start with four important variables: Sent and Received transactions (sideway axes), Current Account Balance (vertical axis), and Different Tokens Held (colour intensity). These variables represent the activity and holdings of an account — arguably its most important values.

It’s a lot to take in. This data emphasises the fact that there are very diverse groups of actors amongst the top Ether holders in the network. Firstly, it shows how ‘power-senders’ don’t receive many transactions and vice-versa. It also shows that the bulk of top Ether holders (with balances ranging up to 1.5mil Ether!) are relatively not very active. Addresses with a large number of unique tokens held also seem to receive many more transactions than they send.

By using a dataset of labeled addresses, we can further examine whether the behaviour of these groups differs — this time only looking at received transactions, sent tractions, and balance (colour becoming the address type).

The labeled dataset is restricted to Exchanges, Mining pools, and Token Wallets, so we will focus on these categories for now.

Categories manually derived from labelled addresses on etherscan.io

There are some clear distinctions between each user group.

On average, Mining Pools send many transactions, Token Team Wallets have high token balances, and Exchanges send and receive many transactions.

The fact that these three account types are so distinctive from one another is impressive, although logical. Pools will be receiving block rewards and sending transactions to miners; token teams will be holding on to their funding; people will be using Exchanges to both send and receive Ether and Tokens.

This is not the only available on-chain data, however. Another dimension of data we can examine is the unique token holdings of each address. Unique token holdings refer to the different types of tokens an address has. For example, holding 50 MANA and 30 EOS would still count as two unique tokens. A further valuable dimension is differentiating between smart contract addresses and externally owned accounts. Smart contracts are a distinctive address type that rule out most normal user types, and can thus help categorise what an address’ function is. Unique token holdings could also give us some insight into addresses, as some types of entities, for example, investors, may be more interested in holding many types of tokens.

Some insight can be quickly gathered from the graph above:

  • Token teams are primarily smart contracts; mining pools are all externally-owned accounts.
  • Exchange addresses often have 100+ different tokens, while the others do not.
  • Exchanges can be addresses or smart contracts — an indicator of whether they are decentralised exchanges (e.g., EtherDelta) or not (e.g., Binance). This relationship may seem obvious, but it allows us to further categorise different address types.

Creating archetypes of users on the blockchain

We can summarise these findings into archetypes — consisting of a combination of on-chain metrics which allow us to more easily identify unlabelled addresses. These models are represented in the diagrams below.

Future Exploratory Paths

With a dataset of only 77 labelled addresses, we were able to map some of the most prominent Ethereum address groups. This emphasises the need for more labelled data, as it will allow for the development of a more comprehensive picture of entity types.

The area for future exploration is limitless. For example, active trader accounts can be better tracked by examining interaction metrics with exchange address. As another idea, early investor activity can be observed by analysing accumulation patterns in addresses spanning from the first Ethereum blocks.

These different paths will be further examined in future posts. If there is anything you think would be particularly interesting to explore in-depth, or if you have any thoughts or criticisms, please do leave a comment with your feedback!

Data used for this post comes from tokenanalyst.io and etherscan.io


TokenAnalyst parses and classifies every on-chain transaction (currently from the Ethereum blockchain) with the goal of deriving fundamental insights to value crypto assets.