How to load Bitcoin into Neo4j in one day

Cesar Pantoja, PhD
TokenAnalyst
Published in
10 min readFeb 22, 2019

Bitcoin can arguably be better understood as a graph data structure than as a table. I’m not talking about the block to block “chain” relationship. I’m talking about the way the actual transactions happen on the chain, where the units of trading are the transactions’ outputs.

Outputs are created when new blocks are mined and locked to a specific address. These outputs are used (or unlocked) as inputs to new transactions, which will in turn generate their own sets of outputs.

Blocks are just collections of transactions that move and split outputs around.

Graph of a simple transaction

In the previous figure, we can visualise a model of a simple transaction. Here Alice has 50 BTC on an output locked by her (either because she got it as a reward from mining a new block or because of another transaction that produced an output locked by her) and she wants to send 25 BTC to Bob. She would then create a transaction which uses the output locked by her as input and produces two outputs: one sending the 25 BTC to Bob and the other 25 BTC of change back to her. The unlocked 50 BTC output is considered spent (and thus can’t be used again), but the new outputs are unspent (so they become Unspent Outputs or UTXOs) which can be used in further transactions.

You can start to see the graph pattern emerge, where a big network of transactions use outputs from other transactions to generate their own outputs, as seen in this figure:

Bitcoin’s transactions graph

At TokenAnalyst, we believe that although modelling the Bitcoin domain knowledge on a tabular format has its value (and we in fact do it for different use cases), modelling it as a graph gives us extra power to perform analyses that would be otherwise very difficult to achieve on a regular database. Analyses such as Network Centrality, Paths Analysis, Community Detection, and many many more. This is why we decided to implement a Neo4j database for our Bitcoin data store.

This blog post explains the general architecture of our solution and how after a series of optimisations, we managed to reduce the bootstrap time of a new deployment, end to end, to under one day.

Data Pipeline

Here comes the juicy part. The blockchain data resides in the p2p network nodes. So the challenge is how to get the data out from the nodes and into Neo4j (or wherever we need it for that matter) efficiently. We were familiar with Greg Walker’s great bitcoin-to-neo4j project. However, his implementation takes weeks to ingest the whole chain data into Neo4j, which for our use case was unfortunately far too long. The route we chose was instead to use JSON-RPC calls to the nodes to get the data, and split the work among a few machines.

New nodes take a few days to sync up their chain data to the tip of the chain from the p2p network, so the first optimisation to reduce the ingestion time was to back up the chain data files from an existing node, and bootstrap new nodes using that backup. Thanks to this, we can have new Bitcoin nodes up and running in a matter of seconds.

Data pipeline overview

From there, we devised a data pipeline to allow us to quickly and efficiently load not only the historical data, but also to keep the data up to date with the latest blocks and transactions coming into the chain. The general architecture of our pipeline is shown in the figure above. We split the process into two phases:

Phase 1 — Historical:

  • Load data from the nodes using JSON-RPC calls and send it to a data lake using compressed Avro format.
  • Generate CSVs from the compressed Avro batched files.
  • Ingest CSVs into Neo4j

Phase 2 — Ongoing:

  • The JSON-RPC client sends messages to a Kafka topic.
  • Kafka records are consumed and ingested into Neo4j.

In the next sections, we will guide you through a day in the life of loading Bitcoin into Neo4j with a little bit more details on the data pipeline.

Phase 1

08:00–11:00: Ingesting from RPC into Data Lake

Our day begins like any good day, with a cup of hot Colombian coffee ☕️ and by extracting the data from the nodes. We wrote an application which consumed JSON-RPC calls and sent the retrieved information to a data lake in both Avro and Parquet formats (which allow us to perform other types of data analysis tasks, but we’ll leave that for another post).

The data lake is our source of truth for chain data. Not only that, having the data conveniently extracted in a format we control, saves us from the expensive JSON-RPC calls which include the overhead of the whole TCP/IP stack. This allows us to more efficiently perform other types of tasks like machine learning and other advanced data analytics, and bulk loading of the data into other data stores, in this case, Neo4j.

On this step, we performed another optimisation: the first extraction from JSON-RPC was parallelised to be performed by several instances. 10 in our case. Each instance grabbed a batch of blocks and performed the extraction of only its batch. When Bitcoin was still in its infancy, blocks didn’t include many transactions, and so those blocks are smaller. Because of this, the batches for the first blocks were bigger (as in, they included more blocks).

Thanks to this optimisation, we are able to load all the chain data (currently about 560,000 blocks and many, many more transactions, inputs, and outputs) in about 3 hours into the data lake.

11:00–14:00: Generating CSVs from Avro Data Format

For the ingestion of the historical data into Neo4j, we went for the CSV import route using Neo4j’s CSV Import Tool, which is different from the LOAD CSV Cypher Command, and a lot faster than any other method for ingesting data into Neo4j because it skips the transactional layer of the database. The caveat of this method is that it can only be used on a new database, which is not an issue for our use case.

This required we transform the records stored in our data lake into CSV files. We chose to extract from the Avro files not only because they allow for a compact representation of the records, but also because they allow for stream-based, record-by-record access to the data. This enabled us to perform all sorts of optimisations like loading in parallel (32 threads per extractor), and only reading from disk what’s necessary without overloading the memory.

Using these optimisations, we were able to generate the CSV files for the entire blockchain in about 3 hours. The total size of the generated (compressed) files was ~300 GB.

At this time of the day, you can leave the task running and grab some lunch 🍕

A few things to consider when generating CSVs to be used in the Neo4j Import Tool:

  • Files have to be in a very specific format. If you miss a thing or two, you might not get notified about the problem after the process has run for some time.
  • It is recommended that if you are dealing with very large files (as you should be if you are using this tool), we recommend you leave the CSV headers in a separate file.
  • The Neo4j Import Tool can read compressed files. If you are dealing with very large files, you should try to compress them. Or as we did, use a compression library in your environment of choice to compress the files as you are generating them.
  • We recommend you name your CSV files like you will name the Nodes and Relationships in your database. This will make the import command a lot easier to understand.
  • As always, read the documentation.

14:00–15:10: Importing CSVs into Neo4j

Once the CSVs are generated and you had your lunch, it’s time to load the records into Neo4j using its CSV Import Tool. Here there’s not much we can do to optimise as we rely completely on Neo4j’s tool, but we did find that using the option --ignore-duplicate-nodes slows down the process massively as it has to check for every single node that it has not been already inserted.

In our case, the addresses CSV file had duplicates (one record for each time an address was seen in a transaction). Once the duplicates are removed(using GNU’s sort -u command — a process that takes about 30 mins), run the import without the --ignore-duplicate-nodes, and this process takes about 40 mins.

Neo4j’s import tool scales very well when you throw bigger hardware at it. In our case, we had a machine with 32 cores, and the process parallelised properly its workload to use all the cores.

Phase 2

15:10 — Onward: Ongoing Load From Kafka Streams into Neo4j

The previous steps were to import the historical chain data into Neo4j. But to keep the database up to date, we modified the JSON-RPC client to also send the data to Kafka streams. Then we built a Kafka processor which consumes from the Kafka topics and inserts new records on Neo4j using the Bolt protocol and Cypher MERGE statements.

There is a topic for blocks and another one for transactions. Since there is only one block roughly every 10 minutes, there isn’t much of an issue to inserting them into Neo4j.

However, special considerations must be made when processing transactions, as there can be several thousand transactions for a given block (The only limit being the 1MB block size). For example, as of today, since block 550,000, blocks have an average of about 2,000 transactions, each with on average two inputs and two outputs.

Using the naïve approach, where every (bitcoin) transaction (along with its inputs and outputs) is inserted in its own (database) transaction, is not possible. For a couple of reasons:

  • Creating Neo4j connections and transactions adds a prohibitive overhead when the inserts are many and too atomic.
  • Since we are using MERGE statements, nodes must be searched before being inserted.

For the first issue, we came up with a couple of optimisations. First, we open 4 connections that will be used during the lifetime of the application, so they will only be opened once and can be used in parallel. Second, we batch every 100 new Bitcoin transactions into a single Neo4j transaction (using Prepared Statements with parameters), so we minimise the overhead related to creating and committing a transaction.

For the second issue, we created indexes in the database on the properties we were using to match. These are indexes that are needed anyway for the regular queries we want to perform, you just need to be mindful of creating the indexes before starting the ongoing ingestion process.

With these optimisations, we are able to ingest a week’s worth of recent Bitcoin data in about 8 hours. This ensures that the gap left after ingesting the CSV files can be quickly closed.

Our choice of technology also offers other advantages. Thanks to Kafka, we can write the data into other places we might find necessary in the future and provide web sockets API access to real time Bitcoin data. And using Bolt means we could potentially use other graph databases that support the protocol for the data store, such as Memgraph.

That’s it, we’re done! We’ve loaded Bitcoin into Neo4j in less than a day! Take the rest of the day off and know that your shiny new graph database will be kept up to date. 💆

Closing Thoughts

We’ve shown you how to load Bitcoin into Neo4j in a single day. The resulting database is about 1.3 TBs, including data and indexes.

Thinking about the Bitcoin blockchain as a graph has many advantages. For example, just looking at the image of Bitcoin’s transactions graph above, can you see which output nodes are the UTXOs? That’s right, it’s the red nodes (outputs) that are not connected with an “Unlocked By” relationship with a yellow node (transactions). This could be expressed with the simple Cypher query:

MATCH (o:Output) 
WHERE NOT (o)-[:UNLOCKED_BY]->()
RETURN o

Or if you wanted to find the balance of an address, you could sum up all the Bitcoin values of the outputs locked by this address that have not been unlocked by other transactions. Or in Cypher terms:

MATCH(a:Address {address: "<address>"})<-[:LOCKED_BY]-(o:Output)
WHERE NOT (o)-[:UNLOCKED_BY]->()
RETURN sum(o.bitcoinValue) as balance

Greg Walker’s GitHub repo has some examples of nice queries you can perform using Cypher. And we haven’t even touched on the advanced network analytics we can perform on Bitcoin’s graph. In a future post, my colleague Simon O’Hanlon will be talking in a bit more details the kind of analytics the graph enables.

This is the first iteration of our data pipeline, and we are sure we can improve this further to load, end to end, in a couple of hours. Please let us know in the comments what you think, and if you have any questions.

Shout out also to my colleague @madewithtea for all his input and collaboration on this project, and all the people at TokenAnalyst for their help on writing this article. You can follow me on twitter @chpanto and visit our website https://www.tokenanalyst.io/ for more of the cool stuff we do. Until next time!

TokenAnalyst builds infrastructure to track granular on-chain data with the goal of deriving fundamental insights to understand crypto-assets. Follow us on Twitter , check-out our open source contributions, and try out our SDK.

--

--