Discovering Non-Deterministic Behavior in Provenance Blockchain and Cosmos SDK
One of the core tenants of blockchain is the ability to trustlessly verify the state machine from genesis, arriving at the same results as the network itself. Read about what happened when our team discovered strange behavior on our public testnet.
Blockchains and Determinism
One of the core tenants of blockchain technologies is the ability to trustlessly verify the state machine from genesis, arriving at the same results as the network itself. To do this, the transactions must be deterministic. Anything preventing a clean replay of state will break this rule, and thus break the trust of the correctness of the data and network. Blockchains are designed with this in mind, and issue halts if too much of the network disagrees on the state of a block. Thru jailing, penalties, and required fees to rejoin, the networks police the validators to assure correctness.
Testnet Oddities
Recently, after a Cosmos SDK upgrade, the Provenance Blockchain team began noticing strange behavior in our public testnet. Any nodes being brought online with a current data snapshot, as well as the validator network, were all progressing forward as intended. However, nodes brought up from older data snapshots or initialized from genesis were failing to sync with an app hash mismatch error. App hash mismatches in the Tendermint state machine happen when the current running node calculates a different hash for the current block and transactions than what was agreed upon in the validator network. More often than not, we have encountered this error when a user or organization runs the incorrect version of the Provenance Blockchain node binary at the wrong height. However, in this case, the nodes were running the exact same version as the testnet validators. There should have been no reason for an app hash mismatch error, yet the problem persisted after multiple attempts to restart nodes from different data backups. On my local testnet node, this kept occurring at block 3639413. So thus, the digging began.
Give Me the Data
The first step was to fetch the good block data from the testnet. Using our trusty Provenance Blockchain Explorer, I was able to pull down the data that the network had successfully committed at that height. Within that block there was only one message type, an authz grant for a MarkerTransferAuthorization
:
// Removed standard signature and header data for brevity.
// -- snip --
{
"authorization": {
"@type": "/provenance.marker.v1.MarkerTransferAuthorization",
"transferLimit": [
{
"denom": "figure-v2.psa.stock",
"amount": "333"
}
]
},
"expiration": "2021-10-05T20:45:47.233020Z"
}
// -- snip --
The MarkerTransferAuthorization
message is located in the Provenance Blockchain Marker module. Reviewing the handlers and keepers did not reveal anything particularly compelling, as seen below.
func (k Keeper) authzHandler(ctx sdk.Context, admin sdk.AccAddress, from sdk.AccAddress, amount sdk.Coin) error {
markerAuth := types.MarkerTransferAuthorization{}
authorization, expireTime := k.authzKeeper.GetCleanAuthorization(ctx, admin, from, markerAuth.MsgTypeURL())
if authorization == nil {
return fmt.Errorf("%s account has not been granted authority to withdraw from %s account", admin, from)
}
accept, err := authorization.Accept(ctx, &types.MsgTransferRequest{Amount: amount})
if err != nil {
return err
}
if accept.Accept {
limitLeft, _ := authorization.(*types.MarkerTransferAuthorization).DecreaseTransferLimit(amount)
if limitLeft.IsZero() {
return k.authzKeeper.DeleteGrant(ctx, admin, from, markerAuth.MsgTypeURL())
}
return k.authzKeeper.SaveGrant(ctx, admin, from, &types.MarkerTransferAuthorization{TransferLimit: limitLeft}, expireTime)
}
return fmt.Errorf("authorization was not accepted for %s", admin)
}
Everything looked like standard run-of-the-mill authz keeper calls. One small thing noted was that GetCleanAuthorization
issues a delete on expired grant (to proactively prune the grants list), and the marker module does the same in the event of limit being used up; but within the scope of a blockchain transaction, both are perfectly acceptable behavior. So my eyes were turned towards the IAVL stores for the affected modules (bank, authz, auth, marker).
Time to Climb the (IAVL) Tree
The IAVL tree is a versioned immutable AVL Tree used within the Cosmos SDK to house the state and application data as the chain progresses forward. Each module is allowed to write to their view of the tree using keepers to store data during transaction execution. More info about keepers can be found in the Cosmos SDK docs located here.
The good folks over at the Cosmos SDK team supply a handy tool called iaviewer to inspect the state machine data stores and the current tree shape. Using this tool, we can peer into the state at any given block height for a network, provided we have the node’s data directory. Using a known-good data directory dump from one of our in-sync sentinel nodes, and the bad data directory dump from my halted local testnet node, I started to dig through the two trees. I put together a small bash script to quickly generate exports and diffs to compare the good and bad states.
#!/usr/bin/env bashout=./diffs[ ! -d "$out" ] && mkdir -p "$out"block=3639413for db in application state; do
for typ in shape data; do
for module in authz auth bank marker; do
echo "Processing db $db for key k:$module for $typ"
outfile="$out/$typ-$module-$db"
goodDb="chains/goodnode/data/$db.db"
badDb="chains/badnode/data/$db.db" iaviewer $typ $goodDb s/k:$module/ $block > $outfile.good
if [ "$?" != "0" ]; then
echo "Failed to run iaviewer $typ $goodDb $module"
fi iaviewer $typ $badDb s/k:$module/ $block > $outfile.bad
if [ "$?" != "0" ]; then
echo "Failed to run iaviewer $typ $badDb $module"
fi diff -u $outfile.good $outfile.bad > $outfile.diff
done
done
done
It’s not much but it got the job done. From there, I verified all the changes that occurred on the good node but not the bad node using find ./diffs/ -not -empty -name \*.diff -ls
to list the non-empty diff files and ignore the rest. This block showed only two modules with differing data between the nodes, authz and bank.
521905 4 -rw-rw-r-- 1 phil phil 369 Sep 28 21:24 diffs/shape-authz-application.diff
527480 952 -rw-rw-r-- 1 phil phil 971838 Sep 28 21:24 diffs/data-bank-application.diff
525838 1264 -rw-rw-r-- 1 phil phil 1291038 Sep 28 21:24 diffs/shape-bank-application.diff
522956 4 -rw-rw-r-- 1 phil phil 646 Sep 28 21:24 diffs/data-authz-application.diff
Every block containing transactions should have movement in the bank module, since every transaction requires gas payment, and thus rewards to the validators. However, since the focus was more on the auth and authz module issues, the diffs for the authz module were reviewed.
Shapes and States
Since the IAVL trees are, at their core, rebalancing binary trees, the shape should change during each addition since a new node is added. I verified this using the diff between the good node and bad node’s shape on this block.
$ cat diffs/shape-authz-application.diff
--- ./diffs/shape-authz-application.good 2021-09-28 21:24:46.733497072 +0000
+++ ./diffs/shape-authz-application.bad 2021-09-28 21:24:47.129527545 +0000
@@ -1,2 +1,2 @@
Got version: 3639413
-*0 011464F28E2E365D68C47A693E010D2AB28F7B75C1E7146A3C0C5392B47E08F0B5E50DB7C20C86BA018D462F70726F76656E616E63652E6D61726B65722E76312E4D73675472616E7366657252657175657374
+<nil>
This diff proves that the good node did successfully write out the authz grant to the state store, and the bad node did not. If you decode the hex value, you can see the protobuf byte data that was written out at node 0 into the tree for the above noted MarkerTransferAuthorization
.
▷▷ echo "011464F28E2E365D68C47A693E010D2AB28F7B75C1E7146A3C0C5392B47E08F0B5E50DB7C20C86BA018D462F70726F76656E616E63652E6D61726B65722E76312E4D73675472616E7366657252657175657374" | unhex
*��{u��j<>
�� S����
���F/provenance.marker.v1.MsgTransferRequest%
The good node’s data gained the marker transfer grant (as seen in the transaction from the Provenance blockchain Explorer), but the bad node’s data did not have this grant. Again, we can verify from the data the difference in the authz module state store as well.
$ cat diffs/data-authz-application.diff
--- ./diffs/data-authz-application.good 2021-09-28 21:24:51.789886168 +0000
+++ ./diffs/data-authz-application.bad 2021-09-28 21:24:52.185916644 +0000
@@ -1,6 +1,4 @@
Got version: 3639413
Printing all keys with hashed values (to detect diff)
- 011464F28E2E365D68C47A693E010D2AB28F7B75C1E7146A3C0C5392B47E08F0B5E50DB7C20C86BA018D462F70726F76656E616E63652E6D61726B65722E76312E4D73675472616E7366657252657175657374
- 61848FAB9772C1FEDCAB2E38B8E962553D427EA5B1CC960457D55E7F06AE2989
-Hash: 46D6028D29055581097F24BAB92F456B69B5765A3E4214247DF2A97192EA73D4
-Size: 1
+Hash: E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855
+Size: 0
You can see the same MarkerTransferAuthorization
binary protobuf data in the data diff as noted in the shape diff, however, the total calculated sizes and hashes for these two nodes can also be seen in the data diff.
Into Validation We Go
I now had proof that the authz grant was somehow in the up-to-date nodes, but wasn’t making it to the fast-syncing nodes. A coworker, Danny Wedul, had brought up how we hadn’t looked into checkTx
yet, instead focusing on the data and our own module changes and updates. After some back and forth, we both stepped into the rabbit hole of Cosmos SDK message validations.
In the Tendermint ABCI world, the flow of a block incoming passes thru checkTx() for every replay during fast sync. Within Cosmos SDK, checkTx
then iterates over every message in each block transaction and calls Message.ValidateBasic()
. This validation has the expectation that all messages are to be validated via deterministic methods and calculations. How can authz.MsgGrant
break that?
Inside the authz.MsgGrant struct, there is an authz.Grant
field.
// ValidateBasic implements Msg
func (msg MsgGrant) ValidateBasic() error {
granter, err := sdk.AccAddressFromBech32(msg.Granter)
if err != nil {
return sdkerrors.Wrap(sdkerrors.ErrInvalidAddress, "invalid granter address")
}
grantee, err := sdk.AccAddressFromBech32(msg.Grantee)
if err != nil {
return sdkerrors.Wrap(sdkerrors.ErrInvalidAddress, "invalid granter address")
} if granter.Equals(grantee) {
return sdkerrors.Wrap(sdkerrors.ErrInvalidRequest, "granter and grantee cannot be same")
}
return msg.Grant.ValidateBasic()
}
At the end of the message validation, it delegates out to the Grant
field’s validation.
func (g Grant) ValidateBasic() error {
if g.Expiration.Unix() < time.Now().Unix() {
return sdkerrors.Wrap(ErrInvalidExpirationTime, "Time can't be in the past")
} av := g.Authorization.GetCachedValue()
a, ok := av.(Authorization)
if !ok {
return sdkerrors.Wrapf(sdkerrors.ErrInvalidType, "expected %T, got %T", (Authorization)(nil), av)
}
return a.ValidateBasic()
}
Within the Grant
field’s validation, there is a call to time.Now(). A constantly shifting vector such as the current clock time is inherently non-deterministic.
The specific validation being the if g.Expiration.Unix() < time.Now().Unix() { ... }
check. This assertion passes for the up-to-date nodes since time.Now()
is within the grant expiration window at the time of block minting and validation across the network. However, while replaying the state data in the future, this conditional fails since the time has shifted into the future and the expiration for the grant now exists before Now()
. We had found our point of non-determinism.
What could possibly go wrong?
So, why is this bad after all? Couldn’t we continue to simply use data directory dumps from the in-sync network?
No. A time skew and maliciously crafted payload could cause a validator network halt if the attack were timed correctly, such as during an upgrade where the validators are offline and begin minting and replaying at different times. Another downside would be any node attempting to sync from genesis will fail, breaking some of the main tenants of blockchain technology (trustless, verifiable, independent).
Disclosure
Once the issue was discovered, our team discretely made contact with the Cosmos SDK team to communicate the findings, hold for their fix, and allow time for other networks to upgrade to mitigate the problem before public disclosure occurred.
PHIL STORY
Phil is a software engineer from Montana working on the Provenance Blockchain ecosystem. He currently drives the R&D group, developing next generation concepts and technologies on top of the protocol. Outside of work, he enjoys the outdoors, gaming, and spending time with his family.