How to interact with the Ethereum blockchain and create a database with Python and SQL

Introductory workshops about blockchain often start with the easy-to-digest story of peer-to-peer network and bank ledgers, and then jump straight to coding smart contracts, which is quite abrupt. So instead imagine yourself walking into a jungle and think of the Ethereum blockchain as a strange creature you are just about to study. Today we’ll observe the creature, interact with it and gather all the data about it into a centralized storage for your own use.


Setting up for the first encounter

First, you will need to install web3py. Web3py is a Python library for connecting with the Ethereum blockchain. What you need to know beforehand is that there is no central administrative system where the data can be downloaded from. The interconnected nodes (“peers”), which share resources amongst each other, store a verified copy of the data (or its part). The network executes the Ethereum protocol, which defines the rules of interaction of nodes with each other and/or smart contracts over that network.

If you want to access information about transactions, balances, blocks or whatever else is written into the blockchain you don’t know of yet, the protocol requires you to connect to the nodes. The nodes continuously share new data with each other and verify the data, so in this way you are sure you get 1) data which has not been tampered with and 2) which is most up-to-date.

There are two basic categories of nodes you could use in your first approach to the creature: local or hosted. A local node can run on your machine, which means you first need to download a client like geth that will sync the blockchain to your device, occupying storage and taking time to complete. For the first encounter, a hosted node is a better choice — it is controlled by somebody else but you can easily connect to it and play around with the blockchain on your own.

Go to Infura and make your own free account to access such a hosted node. When you’re done, you’ll see a list of networks you could connect to: the mainnet (the main Ethereum blockchain), and a bunch of testnets, which are there to basically test your smart contracts, so that you can make mistakes on them and correct them before you deploy costly code to the mainnet.

Time for the first approach. Import the Web3 object and establish an HTTP connection.

from web3 import Web3 
web3 = Web3(Web3.HTTPProvider("https://mainnet.infura.io/your-own-personal-number"))

And you’re all set! Now you can explore the data structure with the web3 API.

Retrieving info about specific blocks…

#current block number
>>> web3.eth.blockNumber
5658173
#get the content of the most recently mined block
>>> web3.eth.getBlock('latest')

This command returns the AttributeDict data structure, which is a dictionary of key-value pairs that looks like this:

Not all of these variables will be immediately useful to you, as some are quite technical and their meaning will only make sense once you have a deeper understanding of how blockchain actually works. You can read more about them in the so-called ‘Yellow Paper’ or skip them for the time being and work with the easily understandable ones.

In short, a block contains the block header, a list of verified transactions written to it and a list of uncles (block identifiers of miners who were slightly too slow with their blocks to make it to the main blockchain but still got rewarded with Ether for their computational effort). Below you can read what the meaning is of each variable, which I divided into subcategories.

General

Mining-related

Uncles

Technical

…transactions and their receipts

Now we can also look up single transactions in a block by their unique identifiers, i.e. transaction hashes.

As previously, web3py returns us an attribute dictionary. The table below summarizes what each key stands for.

Finally, we can also look into transaction receipts:

A transaction receipt contains a few repeated and new entries; the new ones are explained below.

For the reference, I included various additional resources besides the Yellow Paper to compile these tables [2, 3, 4, 5].

As you can see, with just a few simple commands you can already connect to the network and get basic info about the transactions, blocks, or states in the raw format. This opens a new window to what can be done with such data!

Database management system

When planning to write your data to a proper database, you probably realize that there are many solutions for management systems out there for Python enthusiasts, such as serverless SQLite, or server-based MySQL, PostgreSQL, or Hadoop. Depending on what you are intending to do, you will have to determine which option is the best for your project. In general, I’ve found these points to be helpful:

  • What is the intended size of the database (i.e. can it be processed on a single machine system)?
  • Are the entries going to be frequently edited or will they remain fixed?
  • Is the database supposed to be accessed and edited by multiple parties/apps simultaneously?

The Ethereum blockchain is growing steadily over time, getting close to 1 TB as of June 2018, which is small, hence not optimal for a distributed processing system like Hadoop. The blockchain database will be written once and then only expanded with new entries, leaving old entries unchanged. The intended use case of this database is to be written by one channel and accessed read-only by the other channels, so we do not really need to run it on a server. Keeping the database locally on your machine will result in a quick read-out, which is desirable and achievable with a serverless management system like SQLite. And Python has a built-in library sqlite3, thus we don’t even need to install new packages.

Database design

The next step is designing your database. Keep in mind which data fields are the most relevant for your analysis, and aim to optimize both search and storage. For example, if you do not plan to use stateRoot, you may want to completely skip it or keep it in a separate table. A table with fewer columns can be searched through faster, and if you later on realize that you actually have a use-case for the stateRoot, you will still be able to access it. You may also want to separate block information from the transaction information; if you don't, block properties like timestamp will be repeated N times for all transactions in the block, wasting lots of space. Matching a transaction with its block properties will be easy with the JOIN operation later on.

The database I designed consists of 3 tables:

  • Quick: most relevant transaction info for quick access & analysis,
  • TX: all remainder transaction info,
  • Block: block-specific info.

The naming convention of variables has been slightly altered with respect to the original web3py to get rid of ambiguities, such as calling both block hash and transaction hash the “hash”, or using “from”/”to” as column names, which in SQL have a different meaning and would crash the program.

Transaction values, balances and other big numbers need to be stored in the database as strings. The reason is that SQLite can handle only signed integers stored in up to 8 bytes, with a maximum value of 2⁶³-1 = 9223372036854775807. This is often much lower than the transaction values in wei (e.g. only 1 ETH= 10¹⁸ wei).

Building your mini database

The full code can be found on GitHub. It will organize the blockchain info according to the upper schema and output a blockchain.db file containing data of a pre-specified number of blocks. To test it, go to database.py file and pick a reasonable number for the number of blocks to be written, e.g.

Nblocks = 10000

By default, you should point the web3 object to your Infura endpoint. You can also switch to the IPC Provider if you have one (i.e. your local node), just uncomment the line

# or connection via node on the VM 
#web3 = Web3(Web3.IPCProvider('/path-to-geth.ipc/'))

and fix the path. Then simply run in your command line python database.py. The code will dump the number of the last written block into the file lastblock.txt, in case you need to restart where you left off.

How to use the database

Once you’ve written the first entries to the database, you can start communicating with it via ipython shell. For example, to print the first 5 rows of the table “Quick”, you can run the code below.

Local node vs. Infura

If you want to build a big database, you should download geth and sync a node. The synchronization can be done in 3 basic modes:

If you do not need past account states, you can sync your node in fast mode [6].

Below is a plot showing you the speed at which this code writes to a database, communicating with the fully synced node locally (IPC) vs. an address on Infura (Infura). As you can see, it pays off to run this code on a local node, as you get the speed boost of nearly 2 orders of magnitude (aka 100x)!

The time it takes to write 10 blocks of transactions between blocks 2000000 and 2000400. Time is on the logarithmic scale (10⁰=1, 10¹=10 and so on ;).

Summary

Now that you have your own local database of what happened and happens on blockchain, you can start exploring it. For example, you can count the number of transactions since its genesis, see how many addresses are generated as a function of time — the sky is the limit for what you can learn about your creature. We just set the stage for your data science playground. So go ahead and explore it, or check the next posts for potential applications.


Contact analytics@validitylabs.org if you are in interested in blockchain analytics services of Validity Labs.