Search and analyze blockchain data

Tomohiro Nakamura
catabira
Published in
6 min readMay 23, 2019

Hi, this is Tomo(Twitter: HAIL), CTO of catabira.

There are a lot of services and games that use blockchain, but how do you browse the data? What kind of data can I actually get? This story is like that.

Blockchains are simply not for advanced search and analysis.

When you want to analyze blockchain data and use it in product development, What do you do? If you wanted to see as small as a few dozen transactions, you could go to a Blockchain Explorer site like Etherscan, search for each one, copy and paste the data into a spreadsheet… but not hundreds or thousands. What should we do then?

Blockchain values the fact that they can form a consensus in a decentralized and programmatic manner and is highly tamper resistant. Therefore, it is not suitable for searching or analyzing data. When you take something, you lose something. ElasticSearch is a better data source for searching, and BigQuery is better for analyzing.

Copy the blockchain data

So, what should I do? The answer is simple: just copy the data on the blockchain to a data source that meets your needs. Just as Blockchain Explorer services do so. Blockchain exists as “correct data” there and the data can be used in different shapes depending on the purpose. There are two things you have to prepare for copying the data: the blockchain nodes and the tool to copy the data. Consider the following Ethereum example.

You can run the blockchain nodes by yourself, or you can use 3rd-party nodes like Infura, Ginco Nodes, AWS Managed Blockchain (The two latter services are not available for Ethereum as of May. 2019) for Ethereum.

As for tools, there are several OSS. For example, Blockchain ETL and TrueBlocks(QuickBlocks), and actually our company catabira has contributed a little to the Blockchain ETL.

However, while the nodes and tools are already in place, it is not easy for non-engineers to generate copies of the data. To satisfy the needs, BigQuery provides public datasets for several blockchains, including Bitcoin (bigquery-public-data:crypto_bitcoin) and Ethereum (bigquery-public-data:crypto_ethereum). It might be a good idea to use them. By the way, they are generated using the Blockchain ETL that I mentioned earlier.

For a simple analysis, such an existing database may suffice. If you or others have a BigQuery environment, give it a try. For your information, here are two sample queries:

Need for enterprise products

But even though the public dataset in BigQuery is available, you may want to see data in different formats for each purpose, or in more detail. Also, public data can be delayed but you may want to see the data updated in real-time.

For example, if you are targeting AML (Anti Money Laundering), you might want to use a graph-type database like Neo4j to analyze the distance between addresses given the set of transactions. Furthermore, Dapps operators and competitors may need to analyze which and how smart contract functions are often used.

Again, all these are analyzing transactions that have occurred on the blockchain. However, submitting queries directly into the blockchain for analysis is very inefficient, and copying to a suitable data source for analysis is overwhelmingly efficient. And, in the first place, there are various experts such as “use of data in marketing” “analysis of graphs”, not only blockchain experts should be involved. So, it is not realistic that everyone should know more about blockchains to use the data.

There are various types of use-cases and given that layer-2 transactions will increase and technologies will diversify to deal with scalability problems, I believe that enterprise products will be launched from a company which has advantages of both data and blockchains. OSS and public databases will not be the only way to analyze.

Side note: JOIN with off-chain data

Many of blockchains that are in production, not just being tested, are costly and not scalable enough to manage a large data. As a result, you may know, they often hold a part of data outside the blockchain (RDBMS, IPFS,…). The Ethereum standard ERC721, used for the NFT (Non-Fungible Token), defines its function as tokenURI(uint256 _tokenId) external view returns (string). This means “using the token’s ID as input, return a URI that contains information about the token placed outside the blockchain ”. Let's try with the popular blockchain game My Crypto Heroes (MCH).

Go to the Etherscan MCH Hero Token page. You’ll see a tab called Read Contract. This tab enables you to execute functions implemented in the contract.

When you scroll down, you will find the tokenURI function. Let’s try to get the URI for the following Hero:

When I typed this ID and queried it, it gave me a URI. And I accessed it, it returned the following information in JSON format. If you’re not an engineer, you might be scared, but think it’s the same as CSV. lol

{
"name": "MCH Hero: #100176912 Lv.4",
"description": "HeroName: MCH Warrior",
"image": "<https://www.mycryptoheroes.net/images/heroes/2000/10001.png>",
"attributes": {
"active_skill": "Rest",
"agi": 17,
"hero_name": "MCH Warrior",
"hp": 51,
"id": 100176912,
"int": 15,
"lv": 4,
"passive_skill": "Crypto Slash",
"phy": 19,
"rarity": "Unknown"
},
"external_url": "<https://www.mycryptoheroes.net/heroes/100176912>",
"image_url": "<https://www.mycryptoheroes.net/images/heroes/2000/10001.png>",
"home_url": "<https://www.mycryptoheroes.net>",
"extra_data": {
"active_skill_id": 2001,
"art_history": [

],
"current_art": "",
"current_stamina": 46,
"hero_type": 10001,
"max_stamina": 46,
"passive_skill_id": 1017
},
"timestamp": 1557380508
}

It shows the level of this hero, the amount of HP, the skill, and other important information for game strategy. If all of this information is put on Ethereum every time it is updated, the cost becomes high, and the processing speed may not be enough. That’s why they put this information outside the blockchain. In fact, There are several ways to solve this problem, but I won’t go into the details here.

These days, tokens are traded on the blockchain through markets like OpenSea. Naturally, users will decide to buy or not the Hero or set the reasonable price considering data outside the blockchain. With the URI I mentioned, we can see the token’s “current Parameters”, but we can’t see at any point in time like “parameters at the time the transaction occurred”.

So, I tried making something that can be analyzed including it!

Each time a token transfer occurs, it stores the token’s information by accessing the tokenURI, and is linked to a dashboard service that can analyze it.

Using this, I have mapped the correlation between the level of Hero at the rare level of Uncommon and the ETH price of the deal. The information such as the level is outside the blockchain as I mentioned earlier, and the transacted ETH is on the blockchain. We can now connect these pieces of information from disparate locations and analyze them.

This is a graph of the hero’s name and the its average ETH price. You can see that Himiko , Genghis Khan and George Washington are traded at high prices.

Innovation happens gradually, and putting information in different places for the primary purpose is not a bad thing. However, it should be solved that the data is difficult to analyze. Even if there is a benefit to use a blockchain, it is not used if there is too much loss. I think blockchain should not be a hard-to-use technology for the people who are used to analyze data with Tableau or Looker. I hope this kind of system will solve the problem.

Next is about the market analysis of DApps and NFT by Naochika Hanamura of our team! I think it will be more data than technology, so look forward to it!

--

--