Exploring Ethereum with BigQuery and Jupyter Notebook

Yusup
Yusup
Sep 4, 2018 · 4 min read

Raison d’etre

As of now, Ethereum is 3 years old, and it requires several hundred Gigabytes of disk space to run a full node.Thus it is a bit painful to explore Ethereum if you are limited by hardware resources. Since Ethereum API is pretty limited for many cases, even if have limitless resources, you have to load daily transactions into some database day by day.

Last week, Google announced the addition of Ethereum to its BigQuery public datasets, which ignited my interest to tinker with BigQuery for the first time.

BigQuery is NOT FREE! Only the first TB for each month is free, and please use it at your own risk!

First Impression

Here is the query for finding out the top gas burners for the August, 2018.

SELECT transactions.to_address AS addr,  sum(CAST(transactions.gas as float64) * CAST(transactions.gas_price as float64)) AS gas_cost
FROM `bigquery-public-data.ethereum_blockchain.transactions` AS transactions WHERE transactions.to_address IS NOT NULL AND DATE_TRUNC(DATE(transactions.block_timestamp), MONTH) = '2018-08-01'
GROUPY BY transactions.to_address
ORDER BYgas_cost DESC
LIMIT 100

CAST to float64 is necessary to avoid int64 overflow.

When transactions.to_address is null, it means new contract creation

https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators

BigQuery supports wide range of functions and expressions. It is super fast, and the query output is exportable via json, csv, and Google Spreedsheet formats.

Only caveat is that there is no builtin tool to visualize query output and more importantly sometimes simply querying a dataset alone does not cut it. This is where BigQuery Jupyter Notebook integration comes.

Jupyter Notebook Integration

Follow the instructions in the Getting Started with Authentication page to set up application default credentials.

export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/[FILE_NAME].json"pip install --upgrade google-cloud-bigquery[pandas]
jupyter notebook

Jupyter should now be running and open in a browser window.

I always thought that daily/monthly/… transaction count is somewhat correlated to the trading price. Since the big guns are ready, now I can prove/disapprove this hypothesis. For this particular hypothesis, I use Ethereum data from 2015–07 to 2018–08 period.

# init bigquery client
from google.cloud import bigquery
client = bigquery.Client()

Create a bigquery client.

sql = """
SELECT DATE_TRUNC(DATE(transactions.block_timestamp), MONTH) AS tx_month, COUNT(*) AS tx_count
FROM `bigquery-public-data.ethereum_blockchain.transactions` AS transactions
WHERE DATE(transactions.block_timestamp) > DATE('2015-07-31') and DATE(transactions.block_timestamp) < DATE('2018-09-01')
GROUP BY tx_month
ORDER by tx_month ASC
"""
df = client.query(sql).to_dataframe()
df.head()

Run the query.

df.plot(kind='bar', x='tx_month', y='tx_count',title='Ethereum Monthly Tx Count', figsize=(15,8))

Trading price data can be fetched via Coinmarketcap API.

import pandas as pd
import time
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
import numpy as np
eth_market_info = pd.read_html("https://coinmarketcap.com/currencies/ethereum/historical-data/?start=20130428&end="+time.strftime("%Y%m%d"))[0]
# convert the date string to the correct date format
eth_market_info = eth_market_info.assign(Date=pd.to_datetime(eth_market_info['Date']))
# convert to int
eth_market_info['Volume'] = eth_market_info['Volume'].astype('int64')
# look at the first few rows
eth_market_info.head()

Now Ethereum daily price is ready , and monthly mean can be calculated as following code.

import numpy as np
import matplotlib.pyplot as plt
eth_market_info.sort_values(by=['Date']).head()
grouped = eth_market_info.groupby([eth_market_info.Date.dt.year, eth_market_info.Date.dt.month]).mean()
grouped_tuple = grouped.aggregate(lambda x: tuple(x))
prices = grouped_tuple['Open*']
prices = prices[:-1] # exclude september
plt.plot(prices)
price_delta = np.diff(prices)tx_count_delta = df['tx_count'].diff()[1:]
price_delta_arr = np.asarray(price_delta)
normalized_price_delta = (price_delta_arr - price_delta_arr.min())/(price_delta_arr.max() - price_delta_arr.min())
tx_count_delta_arr = np.asarray(tx_count_delta)
normalized_tx_count_delta = (tx_count_delta_arr - tx_count_delta_arr.min())/(tx_count_delta_arr.max() - tx_count_delta_arr.min())
plt.figure(figsize=(15,8))
plt.plot(normalized_tx_count_delta, label='tx count change')
plt.plot(normalized_price_delta, label='price change')
plt.legend()
plt.show()
np.corrcoef(normalized_tx_count_delta, normalized_price_delta)[0][1]# 0.8030832019911807

Finally, we can get the correlation coefficient 0.803, which is pretty close to 1.

We can conclude that Ethereum price has a strong correlation with Ethereum transaction count.

Full Source Code: https://github.com/lsgrep/notebooks/blob/master/BigQuery%20Jupyter%20Integration.ipynb

DISCLAIMER: I am pretty new to Python, and Statistics , suggestions & corrections are welcome. Thanks

Yusup

Written by

Yusup

doing some compression

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade