A new way of using open data: Query the chain

Bernardo Vieira
Analytics Vidhya
Published in
6 min readNov 12, 2019
Photo by Launchpresso on Unsplash

“Data is the new gold”

“Is Big Data the New Black Gold?”

“Data isn’t the next gold. It’s the next uranium”

“The world’s most valuable resource is no longer oil, but data”

Spoiler alert, I’m a blockchain developer, not a data scientist. And this article is about accessing data with TheGraph. But we are in 2019’s fall, still (or starting to) believe that data is the next gold. Although some might question it, it’s not a topic to discuss here.

Data is, without a doubt one of the best resources we’ve ever had. And not because of its monetary value (unlike gold) but because of what it tells us. Because of what it says about us. Even if it sounds uncomfortable, we sometimes find things that we didn’t know. See, how could an ML or an AI system learn or train without data.

Once again, the most valuable aspect of data here is the quality that it brings to our lives. We’ve been able to do studies at a much faster pace. We’ve been able to collect thousands of pictures and create smile recognition systems. Thousands of words and recognize text subjects, among many other things.

Not only, but also

As I mentioned in the beginning, I’m not a data scientist. But I know that when a company receives data to study, a few steps are required before actually starting to study it. Before going into any ML or AI system, it needs to be standardized and anonymized.

With that said, let’s assume that the following data is public.

There are many questions that can be asked, but some of the most important will be:

  • What if this data was yours? Nobody will know it’s you.
  • Could this be relevant for studies?

I know it’s healthcare-related data but the truth is that it says nothing about anyone. And it could lead us to very interesting conclusions. The affluence of a hospital using past information. The most common entry reasons in some given dates. And how our age and lifestyle influences that.

Besides the monetary value, this data can have, it’s no secret that it can dramatically improve our lives.

Public data from the blockchain

It was recently discovered that the gas price has its peak around lunchtime. Isn’t that interesting?

Before moving to public data not yet available, I want to talk about public data from the blockchain. In this case from any chain on the ethereum ecosystem.

Let me introduce you to EIP 1767, intended to define a GraphQL schema for Ethereum events. By the time I’m writing, you can only query data using a JSON-RPC, with HTTP or WebSocket. Besides having a few problems with that interface, described in the EIP, it’s possible to do better.

It’s already possible to execute some GraphQL queries using Grid and ethql. Better than those two, in my opinion, is TheGraph.

Although I only discovered it recently, TheGraph was started two years ago. Its main goal is to have “scalable queries for a decentralized future”. The project as a whole is complex and beyond what I can cover in this article. For now, I’d like just to show how the tool works.

Queries with TheGraph

TheGraph works with smart contracts already deployed on a network. It needs to know the contract addresses before deploying a subgraph. And indexing data is a big problem that would take lots of time to be solved in every project. Using TheGraph, this problem is solved.

Let’s assume that you have a project already deployed in a network. To simplify it, let’s assume that you are using the simple-ethereum-dapp from TechHQ.

Clone that repository, and make a small change in SimpleStorage.sol file so that it emits an event to be processed by TheGraph. TheGraph works by indexing events, making it faster to retrieve information.

Before the storedData variable, define an event with event SetX(uint256 x);

Then on the set function, emit the SetX event with emit SetX(x);

To have a working blockchain environment feeding events to TheGraph , first, install the dependencies (using yarn), start the network (run yarn start:ganache:dev) and deploy the contracts deployed (using npx truffle migrate — network development) start the subgraph.

First install graph-cli (by doing yarn add @graphprotocol/graph-cli) and start the subgraph with npx graph init — from-contract 0xf2Dee5975A808f16f93bf4Fd55aB5481a8B20497 — network development. When prompted to write the name, type start/simple and the directory can be simple which is the default. When asked for the network, choose any, it does not make any difference we are using a local one. The contract address should have been automatically filled in. In case it fails to find the ABI, just put the address to the SimpleStorage.json file (should be ./build/contracts/SimpleStorage.json).

Now you’ve got a new folder named simple, containing TheGraph code. Let’s have a look at the new folders and files.

  • abis — contains files with contracts’ ABI
  • generated — don’t touch it, it’s generated automatically using all the other files
  • src — contains the source code to fuel the graphql server
  • schema.graphql — defines the graphql schema
  • subgraph.yaml — defines the subgraph

Deploy a subgraph

A subgraph is like a set of instructions, deployed and linked to URL. It allows you to query information about a group of contracts, using graphql.

You could deploy a subgraph to a hosted service, but it is sometimes easier to do it locally from the command line.

Move to the generated simple folder.

For your first time, you need to create the subgraph (with yarn create-local). Once successful, deploy the subgraph (with yarn deploy-local).

By the end of the last step, if it was successful, you should have got a link similar to http://localhost:8000/subgraphs/name/start/simple. By following this link you will get a GraphiQL webpage. For those of you that are not familiar with graphql, I’ll walk you through this simple and useful UI.

First of all, let’s get all the actions (setting a different value for x) that were done.

Using the above code in GraphiQL, you will get an empty array on the response. This is because nothing happened yet. But if you change the value a few times and query it again, there will be some data shown.

Move to the root folder that we cloned (not within simple) and use the commands below

If you execute again the query above, you will get some information. You will see only the last result is because we are indexing actions by the user who sent them. You can change that at src/mapings.ts at let entity = ExampleEntity.load(event.transaction.from.toHex())

To get more events you could change from to hash. Then TheGraph will index transactions by hash instead of user address, saving all transactions.

Query users that changed x to values between 5 and 9, during the last 2 minutes

It sounds like a complex challenge, but in reality, it is very simple. To query this new data, we need the contract to emit it.

The event is now emitting the user who changed the value, and the timestamp according to block.timestamp;

Let’ then update the event handler in subgraph.yaml to

Consequently, we also need to update schema.graphql to

And the abis/Contract.json to

The last step is to change the src/mapping.ts, but first, we need the newly generated code. It can be obtained with yarn codegen.

And lastly, update the src/mappings.ts to

TheGraph is now ready to index data using the transaction hash as the id, and to save the new x value, the account address that changed the value and the timestamp.

Before executing a new query, you have to deploy the new contract and update the subgraph. Remember that subgraph.yaml contains the contract address. If you deploy the contract, you will most likely get a new address. Since you are running a ganache instance (must be the one provided by the cloned repository, using yarn start:ganache:dev) you can take advantage of the deterministic property. So if you restart ganache and deploy the contracts again (see command above) you will get the same addresses.

So, just make sure to have the right address and deploy the subgraph again (with yarn deploy-local).

If you do the following query (in GraphiQL) now, you will get the desired result

You can find more detailed information about queries on TheGraph documentation.

Conclusion

We’ve seen that data can bring us a lot of information, valuable information.

If it was, until now, very hard to get any kind of data, and took very long, these new tools are showing that we can do it better.

Data Scientists, AI, and ML experts, if you are around, join the movement. You will get the amazing opportunity of getting tons of open standardized clean data.

Find me on social media, get in touch.

Special thanks to Alberto Cañada for reviewing. As well as TechHQ.

--

--

Bernardo Vieira
Analytics Vidhya

Coder, blockchain developer, writer. Why? Because I can.