Analyzing Ethereum Classic with Google BigQuery

I’m happy to announce, through collaboration with Allen Day and Evgeny Medvedev of Google, Ethereum Classic is now part of the Google BigQuery dataset. There’s even a CoinDesk article about it.

This means that now, it’s more easier than ever to query the Ethereum Classic network using regular SQL which allows for more seamless data analysis. You can also download the datasets directly in Kaggle to use in your notebooks for analysis.

Why is this a Big Deal?

Blockchains are most accessible to cryptography and blockchain engineers, who are most familiar with the inner workings of the client and viewing the data. On the finance side, analysts mostly observe market data. For your average data scientist or entrepreneur who wants to do quick analysis of block data, they’re stuck with running their own Geth or Parity node and trying to parse it. Even if that’s successful, they need to continue doing so to keep up to date with more recent Ethereum Classic data.

With the Ethereum Classic dataset being constantly available on Google BigQuery and continuously updating daily, researchers, entrepreneurs and stakeholders can quickly analyze Ethereum Classic’s network and blocks without having to worry about the data engineering aspects or their cloud infrastructure or node setups. They can just focus on the data science and let us do all the rest!

How Did This Project Come About?

We at the ETC Cooperative always believed the most important stories can be told with data, and Ethereum Classic is no exception. We were planning on quantifying Ethereum Classic’s decentralization for a while now, along with many other analysis we want to explore. For your average data scientist, getting blockchain data can be tricky, if not daunting.

It immediately seemed clear to us there was no easy way to query Ethereum Classic’s blockchain history, so we set about looking for solutions.

Apache Airflow is used to update the dataset daily

I’ve stumbled upon the Ethereum-ETL library developed by Evgeny Medvedev to parse EVM based data, and quickly worked with him on adding more features to allow it to parse Ethereum Classic nodes. We also used Apache Airflow, a powerful workflow library, to update the Ethereum Classic dataset daily. It was all adapted from the original project for Ethereum dataset. If you would like to know how it was built or you’d like to build your own, check out this post.

Analyzing The Ethereum Classic Dataset

The Ethereum Classic dataset contains several tables, like blocks, transactions and traces which contains lots of interesting information about the blockchain activity. Furthermore, you can even analyze traces and smart contracts (who wants to analyze the DAO smart contract before and after the hack?). For more ideas on how what you can analyze with smart contracts, check out this post.

In this Medium post, we will show our analysis of a few interesting things we found in Ethereum Classic.

We set out to quantify decentralization in Ethereum Classic. For that, we will measure the Gini Coefficient of Ethereum Classic using BigQuery. The Gini coefficient is a measure of the income or wealth distribution of a population.

We will use it in two instances here: 
1) Daily Top 10K Account Balances
2) Daily Mining Rewards.

Furthermore, we will be using Balaji Srinivasan’s “Nakamoto Coefficient” for further analysis of Ethereum Classic, as discussed in his blog post Quantifying Decentralization. Nakamoto Coefficient is a proposed measure by Balaji regarding a blockchain’s subsystem. It measures what is the minimum number of entities that can influence more than 51% of a subsystem. To check out that, head on over to Analyzing ETC With Nakamoto Coefficient section below. Over there, we also introduce the nakamoto Python library I’ve built for data analysis.

Running SQL on BigQuery

The following queries and plots have full source code in this Kaggle notebook that you can clone and run on your own.

With BigQuery, we can run a query to get the average daily hash rate as shown in the following plot.

We can plot the output of the following query for the hash rate using Plotly with the following graph shown here, measured in Giga hash.

Now, let’s run an analysis of the top mining addresses by the number of block rewards they received in the past 30 days. The SQL query is found here. The plot for it is shown below. I’ve limited the plot to top 10 miners to show it much clearly.

What about if we try to combine the total daily mining blocks found for the miners and see the change in miner accumulation of blocks found over time since genesis?

BigQuery to the rescue! With the ETC Dataset, (SQL Query is found here), you can.

The result of the query can be plotted like shown here with Google DataStudio.

Daily Miner Rewards on Ethereum Classic

As can be seen here in the Google Data Studio chart, there was a sharp drop in mining rewards around the DAO Hard fork.

Now, let’s do some deeper analysis.

Given our daily block rewards are distributed to different addresses, we can make a few assumptions. We can first assume that each mining address belongs to a single entity. This helps us in our analysis by simplifying things. We will also not try to make any guesses on which addresses belong to a pool. Here, 1 address is equal to 1 entity.

Also, for simplification, we will only limit our analysis to miners who have mined more than 100 blocks a day, so that we can ignore small time individual miners. Sorry, fellas.

Now, we have the perfect parameters to measure an inequality distribution.

We can measure the daily Gini Coefficient of the total mined blocks of each miner to see if there’s a centralization or inequality around a few certain miners.

The Gini coefficient, as mentioned previously, is a statistical measure of distribution used to measure income or wealth distribution among a population.

From the Investopedia article:

A country in which every resident has the same income would have an income Gini coefficient of 0. A country in which one resident earned all the income, while everyone else earned nothing, would have an income Gini coefficient of 1.

The formula for the Gini is shown here:

The SQL for calculating the Gini is borrowed from Evgeny Medvedev and Allen Day of Google. What I have focused on here is to write the SQL necessary to generate the daily mining reward and how many each miner received. Then, I applied the Gini formula to measure the Gini index for each day.

Let’s run this query and plot it using Plotly.

For fun, we will also compare this query to the one generated for Ethereum and plot both together.

The plotly link is shown here. I have attached a screenshot below.

Where to even begin!?

What’s interesting here is that after the DAO Hard Fork, the centralization of block rewards among certain miners dropped for ETC for about a good year and a half until come early 2018 where what I call the “flippening” of Gini coefficients happen to both ETH and ETC.

I am not quite sure what happened around the early 2018 period to account for the Gini coefficient of block rewards by miners to increase on ETC. My main hypothesis here is, it must be a miner or two shifting more hash power to ETC and concentrating the mining rewards daily for themselves. It could also hint at less major mining operations happening around ETH due to fears of Proof of Stake, so the lesser Gini for ETH might just mean miners have a more equal opportunity of dividing the daily block rewards among themselves.

Let’s look at another analysis.

For here, we will get the rich list of the top 10k accounts. The query to do this can be found in this post by Evgeny Medvedev. If we plot the top 20 balances, we will get this pie chart.

Now, let’s try to think about the other Gini study we can do.

We can measure the Gini coefficient of a rich list. It’s basically taking into account the assumption that one address belongs to a single entity. It also is ignoring the fact that bigger addresses belong to exchanges and just look at top 10k balances.

The query for the Gini index for daily rich list balances is written by Evgeny Medvedev and Allen Day of Google, which I adapted it to include the Simple Moving Average of the Gini index for the 7 days and 30 days window. Link to the query is here.

For the top 10k addresses, there seems to be a major rise in the Gini coefficient around the DAO Hard fork, where it goes up by about 13 percent! We don’t know yet what caused this, but a few hypotheses are that, major movement of Ethereum Classic on exchanges post the DAO hard fork. It’s possible it was major buys of ethereum classic when the price was cheap post the fork and this was a movement of ether to whale accounts. One other thing might be it’s the DAO hacker moving funds. We will be investigating this further in future Medium posts.

The Gini remains around 0.85 from the DAO Hard fork until early 2018 where it starts dropping for wealth distribution of ethereum classic by wallet addresses.

The time periods of DAO Fork to Early 2018 are very interesting for daily top balance Gini because they form a similar behavior to the Gini plot of daily block reward for miners. The daily reward Gini in that time period has dropped, while in the case for Gini daily balances, it rises. There is a clear negative correlation happening here.

This reverse pattern in Gini coefficient movement between mining rewards and Ethereum Classic balance centralization/decentralization does warrant further investigation into what’s causing this.

What makes a blockchain’s Gini coefficient rise up after it becomes a minority chain due to the majority chain hard forking away? Is it being heavily manipulated by others as the distribution of wealth points to inequality?

What about the drop in the Gini coefficient for mining rewards per miner right after the fork? Clearly, the DAO Hard Fork of Ethereum does tell an interesting story about the behavior of chains before and after community splits. We need to investigate this much deeper, and luckily, with BigQuery, we have all the tools needed to do just that.

Now, let’s take Ethereum into account and compare the daily Gini coefficient of both of them.

It’s the same query but you target the Ethereum blockchain dataset on BigQuery to compare.

Here’s the plot for it.

Interesting results! Not only is Ethereum Classic’s Gini rising by 13%, but Ethereum’s actually drops over time.

My only explanation for this is Ethereum’s relative stability in the hype market post-DAO Hard fork followed by the ICO season and further distribution of Ethereum to smart contract addresses in exchange for tokens. ETC stabilizes later around the same point as the highest Gini point for ETH.

We can also compare with ETC with Bitcoin in Gini of daily balances.

It’s worth noting that the UTXO record keeping model of Bitcoin would result in more accounts than needed because many transactions are actually tied into a wallet. In the account-balancing model of ETC and ETH, it’s more like a bank so you get a more accurate picture of the balance per account. Therefore, this might actually push the Gini up or down depending on where there’s more transactions tied to a wallet in the Bitcoin case.

Here, we see how clearly distributed Bitcoin is, hovering evenly around 0.55 before reaching 0.5 during the market crash. Bitcoin has the best Gini by far.

Analyzing ETC with Nakamoto Coefficient

We wanted to make things a little bit interesting before you took off, so we added one final set of analyses you can do with the BigQuery ETC dataset.

The Nakamoto Coefficient was proposed by Balaji Srinivasan as a measurement to compliment the Gini Coefficient as a way of quantifying decentralization in his post “Quantifying Decentralization”.

The Nakamoto coefficient is taken from the Lorenz curve and counts the minimum number of entities required to gain more than 51% influence in a sector of a blockchain.

The Lorenz Curve is a graphical representation of the inequality distribution.

Nakamoto coefficient measures the minimum number of individuals or entities you need to compromise or control more than half of the network’s subsystem. The following is a mathematical formula taken from Balaji’s Medium post for the Minimum Nakamoto Coefficient.

Formula for Minimum Nakamoto by Balaji Srinivasan

The higher the Nakamoto Coefficient, the more decentralized the sector or subsystem is.

6 Sectors or subsystems were identified in the article that must be measured to quantify the decentralization of the blockchain.

It’s not necessarily only applied to network hash rate, but can be applied to:

  • Daily Balances: How many entities own more than 51% of currency.
  • Mining Rewards: How many miners discover more than 51% of the blocks.
  • Codebase Repository: How many developers contribute more than 51% of code commits for the blockchain client.
  • Market: How many exchanges control more than 51% of the volume of the currency.
  • Client Usage: How many clients are used by more than 51% of the miners.
  • Miner Geography: How many countries have more than 51% of all miners.
New Nakamoto Library We Built

For that, I felt it’s best to create a Python library specific for the analysis, so I went ahead and built nakamoto a tool that helps one analyze the Gini and Nakamoto coefficient of a blockchain. It can also be used to plot Lorenz Curves. The Github link to nakamoto is found here if you’re interested.

You can easily install it with:

pip install nakamoto

Furthermore, I’ve provided a Kaggle notebook that goes over all the analysis and plotting with fully open-sourced code in more details, which you can check out.

The nakamoto Python module can analyze all 6 sectors for you.

To demonstrate the library, we will do 2 analyses, one using BigQuery SQL, and another by analyzing a Github repository.

For this analysis, we will measure the decentralization of Ethereum Classic’s top rich list and codebase.

For the top rich list, we will measure the minimum amount of entities that have more than 51% supply of Ethereum Classic. We will also plot the Lorenz Curve.

For top rich list, we run the query for daily rich balance here. We then use nakamoto as the following:

from nakamoto.sector import CustomSector

rich_list = CustomSector(rich_list_data, 
currency,
sector_type,
**nakamoto_config)
rich_list.get_gini_coefficient()

This would result in:

0.90683

That’s a fairly high number, meaning more inequality of ETC as it’s mostly in bigger pockets or it could be because it’s a smaller network. Also, over time, it’ll probably improve as the network grows.

If we run rich_list.get_nakamoto_coefficient()

The Nakamoto coefficient here would return 48 which is the number of entities with more than 51% supply of all of ethereum classic circulating.

Now, let’s plot the Lorenz curve like this:

import plotly.tools as tls
from IPython.core.display import HTML
plot_url = rich_list.get_plot_url()
plot_html = tls.get_embed(plot_url)
HTML(plot_html)

The entities marked in red are the ones with collectively more than 51% influence on the daily balance. The red mark is the upper 51% bound of the Lorenz curve and the lower blue marks what’s lower.

Measuring Decentralization of IOHK’s Mantis Client

Mantis, the Ethereum Classic client built by IOHK

For the codebase, we will be analyzing IOHK’s Mantis Client. More specifically, we will measure the code contributions by each developer calculated by number of commits. This will give us an indication of how decentralized the IOHK Mantis Client is and what is the minimum amount of developers you need to compromise the system. We realize it’s not always a practical take, but the idea behind this was inspired by the 51% attack needed to control a network. In reality, you’d just need to comprise a few people to compromise a codebase due them injecting harmful codes that affect miners and network.

You can do codebase analysis using the special sector class Repository found in nakamoto

from nakamoto.sector import Repository
github_url = 'https://github.com/input-output-hk/mantis'
currency = 'ETC'
repository = Repository(github_url, 
github_api,
currency,
**nakamoto_config)
repository.get_gini_coefficient()

The Gini coefficient of Mantis will be:

$ 0.381

This means that the IOHK repository distribution leans heavily towards the side of equality and decentralization. That’s good to hear!

Let’s get the Nakamoto coefficient by calling:

repository.get_nakamoto_coefficient()

This will give us: $ 3

3 developers are the minimum amount you need to influence 51% of the Mantis Ethereum Classic client Github repository. That’s a good sign of more distribution among developers, but obviously 4 is better. Still, along with the gini coefficient leaning towards line of equality, it’s good enough. If we plot the Lorenz Curve, we get the following graph.

Minimum Nakamoto

One last thing you can do with with the nakamoto library is getting the minimum Nakamoto Coefficient of all the sector classes to find the most vulnerable sector. You can also get the maximum Gini, which shows most centralized/most unequal distribution among sectors.

from nakamoto.coefficient import Nakamoto

sector_list = [sector_1, 
sector_2,
sector_3,
sector_4,
repository,
custom_sector]
nakamoto = Nakamoto(sector_list)
nakamoto.get_summary()

This generates a nice summary of all your sectors as a Pandas Dataframe.

Minimum Nakamoto Breakdown
nakamoto.get_minimum_nakamoto()
`2.0`
nakamoto.get_maximum_gini()
`0.930`

Wrapping Up

Well, I hope you enjoyed this guide into Google BigQuery and the Ethereum Classic Dataset! It would be awesome to see what sort of stories the data tells us. The Gini signals around the DAO Hardfork and its aftermath indicate it’s worth analyzing what happened around the DAO. With the Google BigQuery Ethereum Classic Dataset, you can do just that. You can also analyze the 51% attack in more details.

If you want more code examples, I’ve written two Kaggle notebook kernels to accompany this Medium Post.

One Kaggle notebook is the general BigQuery analysis queries and plotting.

The other Kaggle notebook goes over the Nakamoto library and BigQuery in more details.

The Nakamoto library is fully open sourced and I welcome all contributions and PR. For a Github repository on all queries I’ve run for Ethereum Classic SQL on BigQuery, check it out here. Please provide any feedback and I hope you enjoyed this data analysis guide.