Transforming Blockchain Data for Better Analytics

Dan Kleiman
Flipside Crypto
Published in
7 min readApr 26, 2019

You may already be familiar with blockchain as an immutable public ledger — a way to capture and store information as a shared source of truth. But have you ever asked exactly what type of information you can pull from the blockchain? What kinds of questions can blockchain-backed data answer?

In this post, we want to explore those questions, highlighting what kinds of questions are naturally and easily answered by blockchain-backed storage and what kinds of questions are more easily answered by utilizing other storage formats.

At Flipside Crypto, our customers — crypto projects who want to deeply understand their users — need to answer a whole series of questions that are pretty much impossible to answer by pulling data directly from the blockchain. Instead, we transform, store, and aggregate data into more traditional relational formats in order to calculate metrics like network activity or visualize trends in adoption and retention.

We’ll get into those metrics below, but first we should explore the data that we can pull directly from the chain.

What data can you actually pull from the chain?

While any particular blockchain might have its own implementation, at a high level, you can generalize the shape of blockchain data as being a series of linked blocks, each composed of a group of transactions, where the transactions are interactions between “addresses” — cryptographically controlled identifiers on the chain.

Common blockchain libraries allow you to query for things like “the latest block in the chain” or “balance for a specific address”. The query interfaces all make sense when you consider how the chain gets built up. Users, who control addresses, submit transactions they would like to record to a pool of potential transactions. These transactions are then “mined”, where the miners attempt to group them into a valid block that will eventually be recorded on the chain.

Using the Python web3 library as an example, we can explore the Ethereum blockchain over JSON-RPC. Once you have a connection to an Ethereum node, an easy entry point is pulling the latest block:


from web3 import Web3
from web3.providers.rpc import HTTPProvider
web3 = Web3(HTTPProvider(YOUR_ETH_NODE_ENDPOINT))
web3.eth.getBlock(‘latest’)

Which returns:


{
‘difficulty’: 1795771723407806,
‘extraData’: HexBytes(‘0x4477617266506f6f6c’),
‘gasLimit’: 8000000,
‘gasUsed’: 7967459,
‘hash’: HexBytes(‘0xe65324119713c5452c0c4c1d03567d0ec9fff7ff87a385e5f995fa11b74d468f’),
‘logsBloom’: HexBytes(‘0x0808202800200011040000040a004807000082988110090005c08405d30010000000002020041c00290020088050040480508890c0200014418a0860233d500c0001144201000c010112000820520c0440400420820410400a3020000d0002000800000233004a048108418004e26a101400840008c0200141110850003a620180810805000208480002400108c0100908d1400900240120100eb0690480001126228840081890140504000004d08600018001000100102c0080824024004540820480824002040091101040000302203488620a400404104d08400001206801119040253009508000880002810888201004920020000000c040000040908008’),
‘miner’: ‘0x2a65Aca4D5fC5B5C859090a6c34d164135398226’,
‘mixHash’: HexBytes(‘0xf87d5849ba31b83a89cee93d7bffe685976b3243b2fcb716a482cb8a21456ae1’),
‘nonce’: HexBytes(‘0xc99b7c4005f1603a’),
‘number’: 7637496,
‘parentHash’: HexBytes(‘0x3caee2ba302620bd42b28e9425877b4fdebfe65146915a53bb224aa64522a34e’),
‘receiptsRoot’: HexBytes(‘0xd7e9b766b499b2947d44a5871caefe46ffa6c90fb2b98b5c86a303268271266f’),
‘sha3Uncles’: HexBytes(‘0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347’),
‘size’: 17935,
‘stateRoot’: HexBytes(‘0x68526f828d1cd56ed5a36887b82b3b3712ebf0a200ebd324c51bf681bc5ad3ff’),
‘timestamp’: 1556208160,
‘totalDifficulty’: 9960526572756421265222,
‘transactions’: [HexBytes(‘0xce6e1dbcc3afa4cc6b0f60bfb467328e423a32a4df8d440023a4bb3d558aa9fc’)…],
‘transactionsRoot’: HexBytes(‘0xe313ae328075a3792f3c015bf056ee6a48949c253f8234b670c09fbc0b6b22ba’),
‘uncles’: []
}

At the block level, there is:

  • context metadata about how and when the block was mined
  • who mined it
  • how it can be validated
  • and for our parsing needs: an array of transaction hashes under the `transactions` key.

We can in turn ask for the transaction information using the transaction hash:


tx_hash = block.get(‘transactions’)[0]
tx = web3.eth.getTransaction(tx_hash)

Which returns:


{
‘blockHash’: HexBytes(‘0xe65324119713c5452c0c4c1d03567d0ec9fff7ff87a385e5f995fa11b74d468f’),
‘blockNumber’: 7637496,
‘from’: ‘0x98457a0B2A8169BF14Cb1A185e943165BdB807A4’,
‘gas’: 21000,
‘gasPrice’: 99000000000,
‘hash’: HexBytes(‘0xce6e1dbcc3afa4cc6b0f60bfb467328e423a32a4df8d440023a4bb3d558aa9fc’),
‘input’: ‘0x’,
‘nonce’: 66,
‘r’: HexBytes(‘0xd9aa38aa2d1fab1a41cc4cadb7076f8257964c661932adface7190f593e7db13’),
‘s’: HexBytes(‘0x6df6eeddf75a62fb84bd8eb267f06d193e74625208e74420f6b1cd2dfa8f0dd0’),
‘to’: ‘0x91BcD9C854bFde4F85e8639DF2176e03672720bC’,
‘transactionIndex’: 0,
‘v’: 38,
‘value’: 6000000000000000000
}

The transaction includes:

  • block-level reference data
  • fees paid to execute and record the transaction
  • sender and recipient data
  • input data which becomes important in Ethereum for more complex transactions (in our example, it’s actually an empty value)

You can see a cleanly decoded version of this transaction here on Etherscan.io.

You could continue pulling block after block and decoding transactions, which is what we do in our ETL flow to parse each chain, but let’s step back and ask what we’re doing here from a data access perspective.

In order to record transactions on the blockchain miners need to be able to perform a couple simple lookups:

  • “give me the latest block” — since they will be adding to the chain
  • “does this user have enough currency to spend on this transaction” — to validate each transaction that can form the next block of transactions

And when they mine the block, they will record inputs and outputs for each transaction. On a smart contract platform like Ethereum, these outputs can get very complicated because they contain the results of executing any arbitrary contract call — think of like the output of any function call in any given programming language…and think about what it would take to encapsulate all those possible outcomes to be decoded later on. Feel bad for our data team yet??

So “the truth” of each of these transactions is forever recorded on the blockchain…but what have we not done here?

Miners do not need to compute:

  • how the usage of this token has declined or increased in the last thirty days
  • whether the user-base on this platform growing
  • the rate at which apps are launching on the platform

In other words, the data stored on the chain and the interfaces built to access it serve to mostly answer point-lookup-style queries, not aggregate analytics.

What can you do with on-chain data?

Since our customers want to answer questions about adoption and usage trends, and not just do point lookups, we have to go beyond just crawling the chain. We have store transaction data in a way that lets us flexibly segment and aggregate it. The good news is, that for many of the questions we end up asking at Flipside Crypto, a straightforward relational data model is all we need.

For each block, we decode all the transactions and store the them as one or more individual records. An Ethereum transaction looks pretty similar to the rpc response above, with the following fields that get pulled from the block, the transaction, some additional metadata and decoded outputs from any potential contract calls:

Ethereum Transaction storage schema

Metric Example 1: Economic Throughput

With these values, we can slice and dice the network data by project to show larger trends, like economic throughput (explained in depth in the context of Bitcoin here by Nic Carter).

Our individual transactions easily roll up by project, and so we can calculate average transaction size and number of transactions over time. If we plot average transaction size against number of transactions on a project-by-project basis, we can start to see how different projects are used:

Economic Throughput by Project

Projects farther to the left are more similar to payment systems, like Paypal, that is, a high number of smaller value transactions. In contrast, projects with larger transfer sizes act more like stores of value. A project’s placement in this plot could confirm whether users are doing what the creators intended.

While “high-level averages across all projects” is interesting to help us get oriented to the ecosystem as a whole, our customers want to go deeper, to better understand their users. Since we start at the individual transaction level, we can refine the aggregate stats and break down transfer metrics in other ways.

Metric Example 2: Concentration of User Activity

For a single project, we can start to understand how active their user base is over time. In the plot below, we are looking at average transfer size again, but this time, we can compare it to the number of transfers sent per user per month to reveal the concentration of user activity:

Monthly User Activity

Often, crypto projects we work with have a profile of their ideal user behavior. By looking at these behavioral groupings over time, they can better understand if usage of their project is trending towards or away from their ideal.

Metric Example 3: User Retention

Finally, our analytics become even more powerful when we let projects map their off-chain efforts to on-chain activity.

In this user retention chart, we’re looking at the percentage of users from each month that are active six months later:

User Retention Tracking

A project can look at this chart and better answer questions like: Did a marketing campaign one month pay off by recruiting the right new users? How did a product redesign impact long-term usage?

By capturing retention from on-chain data, we can help them track and measure their off-chain engagement efforts in a much more concrete way.

Chaining it all together

As a data engineer, it’s been fascinating to see how a transformation in data storage — from blockchain to relational — has enabled so much insight for our customers. For all the merits of blockchain as a distributed datastore, capturing high level analytics isn’t one of them!

Just last night we were discussing some of these metrics with a data scientist who looked at us and said “What are you talking about? There are no customers on the blockchain!” We hope you’re a little surprised by these insights. We think it is possible to explain how people are using blockchain-based projects in the way that helps projects refine their missions, and ultimately leads to a healthier ecosystem.

We have many more data transformations ahead of us too. Each time we experiment with a new format, we find new questions we can answer.

If you’re interested in answering these questions and exploring how to provide more insight into what’s being built in the blockchain space, feel free to reach out.

Find me on twitter or email data@flipsidecrypto.com

--

--