Bringing blockchain properties to the centralized government databases

Published in

Nebo #15

6 min readJul 27, 2017

Right now we are building a large project in a healthcare field for our government. There are lots of interesting tasks and I want to share our ideas on one of them — making sure that it’s cryptographically not possible to alter records in a database even if you own full access to the system.

Why this matters? First of all, we want to make sure that it would not be possible to remove an evidence of a medical treatment that lead to a patient injury, whenever there is an investigation on such case. Second is that our government is going to reimburse all services that recorded in this database and we don’t want to leave a way to abuse this system and receive money which was not fairly earned.

For now, this is an RFC and we continue to look for best implementation options, so any kind of comments and suggestions are very welcomed.

Nowadays, when you want to solve “trust” problem, there is an existing wide-spread buzzword as a solution — blockchain. Full integration of the blockchain principles would require us to de-centralize our database and come up with some proof algorithm to solve consensus problem (common ones are proof of work and proof of stake, and there are much more theoretical ones, eg. proof of elapsed time) that is the best for our use-case.

But when working for the government there are some limitations that won’t allow you to go this path:

This database stores extremely sensitive data (eg. patients and prescriptions, country-wide), which, by both ethical and legislation reasons, can not be shared with third-parties;
You can distribute encrypted data, but in a long-term, any kinds of encryption can be broken, bugs are found pretty often and quantum computing is knocks on the doors. We could lose lots of sensitive data and people won’t be happy with that;
All proven proof algorithms would require us to burn additional CPU time, skyrocketing the limited cost of infrastructure for this project.

How can we get blockchain properties with an off-the-shelf database?

This is possible by leveraging one of our project requirements — everything that is submitted to a database is signed with the private key of a physician. Private keys are distributed by accredited certification centers. Basically, they are licensed third-parties that allowed to issue private keys which are considered to be legally-equal to handwritten signatures in Ukraine.

The actual stored data can look like this:

Column                   |            Type             | Modifiers
-------------------------+-----------------------------+-----------
master_patient_index_id  | uuid                        | not null
patient_first_name       | String                      | not null
patient_last_name        | String                      | not null
patient_passport_number  | String                      | not null ...
physician_personnel_id   | uuid                        | not null
...
previous_block_hash      | string                      | not null
physician_signature      | binary                      | not null

physician_signature is a cryptographic signature for all columns in a row (this row is an actual document) which were added on behalf of physician.

previous_block_hash is a Merkele tree root for a previous block or a genesis hash if we just started the system. Blocks can be created on any specific timeout with a tradeoff — you can alter all data within the last block so we don’t want it to be too big and we would not accept any new records when new block is creating so we can’t generate new blocks too often.

Let’s go through few cases to make sure this would actually work.

Changes to the transactions that are not grouped in a block yet

They are not possible to change unless you acquire physician private key. However, you can delete the record and it would be not easy to notice that it ever existed.

This is similar to transactions in Bitcoin, they are not guaranteed to be mined and some merchants even wait for a few additional blocks on a chain, to start counting your payment as succeeded.

We are okay with that, this issue can be mitigated by very common security approaches — access management and audit trail. We might even use this property to give a “rollback within a day” feature for our clients.

Changes to the transactions that are in the latest block

Since all new-coming transactions include previous Block ID in a signed content, it is not possible to make any changes without corrupting the hash chain.

If you really want to do that, you need to either delete the fact of block existence and all records that were inserted after its creation or to acquire a private key of each single physician that submitted any data to the database to update previous block hash column for those records too. (Or to hack all certification centers that issued those keys.)

When time starts to play on our side

The total number of transactions in newer blocks works like “score” property of the blockchain. In Bitcoin, this would be the total amount of hashing power on the chain. When this number grows, confidence that you can’t change data without being noticed is also growing.

Even more, we could publicly share the list of known blocks on a website or in an Ethereum smart contract, so that even if we are screwed in our responsibility to notice missing block, there are plenty of other people that do that for us.

What are the implementation options?

There are three that was able to come out from my mind:

Pick a database that already gives that properties (anyone?);
Native DB extension that enforces constraints when data is inserted and on a demand;
Database adapter for Ecto which encapsulates additional logic (may be good option if we going to use some rare database with built-in encryption, which we can’t afford to modify);
Custom Elixir application that fetches data from a database and enforces all this logic.

Unless this approach exists in some database, the second one looks to be the most efficient since there would be no need to query for too much data each time we are doing inserts.

Which database to pick if you want to implement this logic?

At my opinion, databases that are inspired by Amazon Dynamo white paper (Cassandra, Riak and much more), would be the best fit for our needs. When data is not tightly coupled, simple K/V lookups in combination with map/reduce for block generation and reports is more than enough. Those are scale pretty well.

Visual representation of a DHT ring, from official Riak documentation

Also, there are few solutions on top of PostgreSQL that are also making sharding easy (CitusDB and PostgreXL). However, I do not have an in-depth understanding of their internal concepts. Probably, they have built a new layer for query planning which aggregates results after querying all the shards.

But for a while we would fit a single PostgreSQL instance and while we don’t deal with a scale, we use HA PostgreSQL deployment as battle-tested tool we have experience dealing with.

Consistency model for read operations availability in NoSQL databases

Also, we want to be highly available to all the medical service providers that are sending data to us. Let’s start with some theory.

CAP theorem states that there are no systems which can be both consistent (CP) and highly available (AP).

Some people may say that there are CA systems, they are wrong. Partitions will happen in real life, you can’t do anything with that.

Databases on top of DHT rings tend to allow customizingNWR values for different types of queries. WhereN is the data replication factor (on how many nodes we are storing a copy for a single document); W is a number of nodes that need to confirm write for an overall transaction to be completed; R is a number of nodes we are reading from.

With their enforced at the table level, we can archive the best consistency/availability tradeoff — to be strongly consistent for our system needs and highly available for API consumers.

This can be done by writing transactions with eventual consistency; writing previous block hash with strong consistency (W=ALL, reading with R=ALL would hurt availability for new transactions); building block with strong read consistency (map/reduce with R=ALL properties).

Open questions

What is the right procedure for detected failed integrity checks? Possible option: restore corrupted block from a backup and re-run integrity check, after that cancel everything since last valid transaction.
What to do with the new-coming transaction are linked to a block that is not the latest? Possible option: decline them and require a user to send a new one with a link to the latest block.