L is for Large

Growing up BitDB in a World of Unlimited Bitcoin

_unwriter
7 min readJan 24, 2019

Today I will discuss some of the challenges with scaling BitDB, and how they have been and will be tackled, along with some behind the scenes details on the design decisions of BitDB never told before.

I will also discuss the Block 566476 Incident which crashed 3 BitDBs (Genesis, Chronos, and Babel), which was related to these scaling issues:

Why MongoDB?

There is no right or wrong answer to choosing a database. If anyone tells you so, they are all amateurs. It completely depends on the use case.

And for the current purpose of BitDB — indexing Bitcoin scripts, graph data, block data — MongoDB is almost perfect. I made this decision to use MongoDB after actually having prototyped BitDB in all types of databases including:

  • Relational databases
  • Key-Value databases
  • Graph databases
  • Other NoSQL databases
  • Elasticsearch-like full text index
  • Experimental decentralized database projects

Some of these have certain benefits over MongoDB but when you average them out, MongoDB shines above all others.

Especially, the most important part is MongoDB’s query language. It’s written in JSON. JSON is portable, queryable, validate-able, and comes with all kinds of benefits JSON itself has, since JSON is the most widely used data exchange format today.

And this is how Bitquery could happen. It’s a single JSON based query language that describes a specific state of the blockchain, which is a combination of MongoDB query language and JQ, a Turing complete query language.

For example, because the query is represented as a JSON object, the following query will FOREVER represent this SPECIFIC state of the Bitcoin SV blockchain, which means you can treat these queries like a URI.

That said, we do have some challenges.

MongoDB Limits

Let’s start with the most basic limitations of MongoDB.

There are three important limitations to MongoDB which are relevant for today’s discussion:

  1. The maximum document size is 16MB.
  2. An indexed key can’t exceed 1024 Bytes.
  3. One collection cannot have more than 64 indexed keys.

Keep these in mind as you read along.

The Internals of BitDB Indexing

Currently Babel, Chronos, and Genesis BitDBs all add index to as many push data as they can:

First, the base64 encoded keys:

b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15

Next, the UTF8 encoded keys:

s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14, s15

And remember the constraint #3:

Constraint #3: One collection cannot have more than 64 indexed keys.

Because we have to add index to other keys like blk, tx, etc. we can’t go all out and keep adding index to b16, b17, etc. We need to fit them into 64 indexed keys. So we only add up to b15 and s15.

Note that this DOES NOT mean “one collection can’t have more than 64 keys”.

It just means the index count can’t exceed 64. You can have more than 64 keys and BitDB will store all of them, but it will just not index the push data above b15 and s15 for faster queries.

“Not adding an index” simply means you can’t query efficiently using those keys.

Querying with something like the following example will be lightning quick because it queries with an indexed key ( out.s2 ):

{
"v": 3,
"q": {
"find": { "out.s2": "hello" }
}
}

But trying to query with the following query will be slow because oout.s42 is not indexed.

{
"v": 3,
"q": {
"find": { "out.s42": "hello" }
}
}

Scaling Challenges

Now that we’ve discussed the internals of BitDB, let’s look at some of the scaling challenges with BitDB.

First, the constraint #2:

Constraint #2: An indexed key can’t exceed 1024 Bytes.

This means, if you try to insert a document with a key whose value is larger than 1024 bytes and is indexed, it will fail.

And THIS was exactly what happened with all BitDB nodes today at block 566476.

Because there was a push data that contained way more than 1024 bytes (an entire Alice in Wonderland book), it failed to insert that document.

17KB would definitely not fit in a 1KB key.

There needed to be a way to get around this 1024 bytes constraint.

Problem

Here’s what the previous BitDB schema for Genesis, Chronos, and Babel looked like:

As you can see, each push data ( b0, b1, …, s1, s2,…) is at best several bytes.

And this is why it used to just work out of the box, with index added to all of these fields. They never exceeded 1024 bytes.

But now that we have larger push data on Bitcoin SV, these cannot be indexed, we have a problem.

Since the problem is that we can’t index super large keys, the solution is to store them but NOT index them. And it needs to be compatible with existing schema and Bitquery. How to do this?

To solve this, we create a new category of attributes.

“L” is for Large

The new “L” class of attributes are:

  1. Reserved for large push data (larger than 512 bytes)
  2. NOT indexed.
  3. Yet still stored on the DB.
  4. Do not co-exist with other attribute of the same index. You can either have lb1 or b1, but not both, for example.

So, here are the L versions of each attributes

  • For b0, b1, b2, b3, … we have lb0, lb1, lb2, lb3, …
  • For s0, s1, s2, s3, … we have ls0, ls1, ls2, ls3, …

Think of it as sort of a “BIGINT” concept in programming. Here are the rules:

  1. At certain point when the push data is detected as being larger than 512 bytes, it gets stored with the L prefix.
  2. In that case, the push data is ONLY stored as the L prefixed attribute and not the original.
  3. But if it’s smaller (regular), then it’s business as always. No l prefixes.

For example, you can see below that the output:

  • contains lb1 and ls1
  • but does NOT contain b1 and s1.

So you will only see either lb1 or b1 but not both in a same transaction object.

The rationale is as follows:

  1. If the push data is that large, chances are, it’s only used for Reading, but not Querying. You would most likely be saving a file or a large blob and don’t care about querying by that field.
  2. The best way to query these large push data is NOT to query by that large key, but add another unique identifier push data and query with THAT field.
  3. The benefit is, ALL L prefixed attributes are stored but not indexed. Which means we can still insert theses huge push data transactions without the database throwing the 1024 byte limit.

You can check out an example query here:

https://babel.bitdb.network/query/1DHDifPvtPgKFPZMRSxmVHhiPvFmxZwbfh/ewogICJ2IjogMywKICAicSI6IHsgImZpbmQiOiB7ICJ0eC5oIjogImVmMjFlNzFkMDBiOWZjZTE3NDIyMmU2Nzk2NDBiMDllMjlhYzhhNTVmMzIxYzkzZTY0YjE2Y2MzMTA5OTU5ZjgiIH0sICJsaW1pdCI6IDEgfQp9

Caveat

Currently, the full text search is not yet supported for L Prefixed attributes. But this will all work out, soon.

Demo

Thanks to these updates we now have:

A website in a single OP_RETURN transaction output:

Even better, a website in ONE Push Data:

A Media Viewer for Single Push Data OP_RETURN transactions

OP_PUSHDATA4

Before we finish, there’s one last thing we haven’t discussed yet. It is the hard-cap 16MB MongoDB limit for each database document.

Constraint 3. The maximum individual document size for MongoDB is 16MB.

Wait a minute, so we can only store up to 16MB transactions? What happened to storing gigabytes of data? What about 4.3GB OP_PUSHDATA4?

Well we can use the same trick we used with the L prefix attributes discussed in this article. We can add another class of attribute (Maybe XL?) to indicate that this is a transaction that:

  1. Will be stored on the host machine
  2. But NOT indexed as a key
  3. And lastly, requires a larger-than-16MB storage

So how do we store transactions larger than 16MB as a whole?

We can use MongoDB’s native GridFS which lets you store more than 16MB, but use it like a database entry instead of storing in a separate file system. This is the most obvious solution right now, and one of the reasons why I chose MongoDB because GridFS is built into the database seamlessly so we can minimize all the problems that may arise due to the Database/Filesystem sync issue, out of the box.

That said, there are other options that will be explored, which I won’t discuss in this article yet. GridFS is the safest low hanging fruit at the moment, which is not easy in other types of databases.

Thankfully we still have some time as miners are not yet mining transactions as large as 16MB!

Anyway, what’s important is that this problem DOES have a solution, and it’s just a matter of implementing the best option.

So, don’t worry about the future of BitDB.

It will scale Infinitely along with Bitcoin itself.

--

--