Order-Statistics-Based Hash Algorithm

(Note that the article is open for discussion, and some results are preliminary)

Motivation

The mining of Bitcoin is dominated by ASIC miners because the efficiency of ASIC (in terms of joule per hash) can be 1000x higher than CPU. This creates the concern of centralization because the manufacturers of the high-performance ASIC are concentrated to a few ones. Several algorithms are proposed to lower the efficiency of ASIC mining including Equihash, CyptoNight, and Ethash. Unfortunately, many of these algorithms have some ASIC mining machines, which claims to achieve various mining efficiency gains (See https://github.com/ifdefelse/ProgPOW for a good summary).

Among these algorithms, Ethash is probably the one with the least potential ASIC efficiency gain. The core idea of Ethash is to perform memory intensive operations instead of computationally intensive operations so that the memory read performance is the bottleneck of the hash algorithm. Giving the assumption that the memory read speed can hardly be accelerated by customized hardware, the performance gain of ASIC on Ethash should be limited.

Idea of Order-Statistics-Based Hash Algorithm

Motivated by Ethash, we propose an algorithm that aims to lower the ASIC acceleration from another aspect — by lower the benefits of parallelization offered by ASIC:

  • A set of fix instructions can be broken down into a circuit pipeline so that for every clock cycle (sometimes multiple), the ASIC can compute the instructions for multiple inputs of the hashes. For example, a + b + c + d can be pipelined so that for every cycle, 3 different inputs can be computed simultaneously: 1, a0 + b0; 2, b1 + c1; 3, c2+d2
  • Multiple circuit logic can be built into ASIC to calculate the instructions concurrently. For example, a + b + c + d can be implemented as (a + b) + (c + d) so that the computation can be done in 2 cycles.

Currently, the pipeline idea has been also widely used in modern processors such as x86, which have pipelined microprocessor with branch predictor[2]. One way to break pipeline is to have multiple if-then-else and then execute different branches with different code paths so that pipeline and branch prediction can hardly work.

To break the concurrency of the execution, we could rely on state dependency — any future instructions have relied on the current state, which can be changed frequently. The mean that we cannot perform any early execution of future instructions.

Order-Statistics-Based Hash Algorithm

In the section, we will introduce Order-Statistics-Based Hash Algorithm, which tries to break the pipeline and makes the code execution path to be more random. Before introducing the proposed algorithm, let us revisit Ethash core algorithm to generate a hash:

Input: 
- state: 128-byte state
- datablock: an array of large amount of data, each data is 64 bytes
- H(x, y): a fast hash algorithm, x and y has the same size, return the hash value with the same size as x
- R(x): return an 32-bit random integer derived from x
Algorithm:
for i in range(64):
p = R(state) % (len(datablock) - 1)
newdata = [datablock[p], datablock[p + 1]]
state = H(state, newdata)
return state

The draft of the proposed algorithm is

Input: 
- state: 128-byte state
- datablock: an long array with each entry being 8 bytes
- H(x, y): a fast hash algorithm, x and y has the same size, return the hash value with the same size as x
- R(x): return an 64-bit random integer derived from x
Algorithm:
for i in range(64):
p = R(state) % len(datablock)
newdata = []
for j in range(128 / 8):
newdata = newdata.add(datablock.find_by_order(p))
         # Remove the pth smallest element from datablock
datablock.remove_by_order(p)
         # Add a random data to the datablock, e.g.,
# datablock.insert(R([newdata[end]]))
         # Find the next index, e.g.,
# p = R([state, p]) % len(datablock)
state = H(state, newdata)
return state

The key differences of the proposed algorithm vs Ethash are

  • Instead of looking for the data with a random index p, the algorithm looks for the data with the pth smallest value.
  • After accessing one entry in the datablock, the entry will be removed and a new random entry will be inserted into the datablock.

Since datablock is a dynamic list with ordered data lookups, an efficient implementation of datablock is a dynamic search tree (DST) with ordered statistics (e.g., AVL tree, red-black tree, B+ tree). A tree delete/insert operations are unfriendly for the pipeline as the execution path is highly input-dependent and is uncertain for random inputs.

Performance Comparison of CPU vs. FPGA

We compare the performance of insert/delete of DST of CPU and FPGA [3]implementations. The code of CPU can be found here. For CPU implementation, we use the following configurations:

  • CPU: Intel i7-7700K
  • OS: Ubuntu 16.04 LTS
  • Compiler: g++ 5.4.0
  • Compilation command: g++ -O3 -std=gnu++17
  • Number of threads: 1
  • Number of keys: 64K
  • Key type: unsigned 64-bit random integers

Performance numbers:

  • FPGA: 3.97 million inserts/deletes per sec
  • CPU: 4.46 million inserts/deletes per sec

A couple of comments are of interests:

  • Compared to FPGA search performance (242 million DST searches per sec), where every search can be done in one clock cycle because of pipeline, the performance of insert/delete operations are much lower because more cycles are required for per insert/delete operations.
  • In addition, the FPGA performance is based on Virtex 5 LX330 FPGA, which may be out-dated. With the latest FPGA, the performance could be greater.
  • The performance of the CPU implementation is single thread/single core, which could be greater if multiple threads/cores are used.
  • The key size of the CPU performance is 64-bit, which is larger than 32-bit of the FPGA version.

References:

  1. https://github.com/ifdefelse/ProgPOW
  2. Branch preditor, Wikipedia, https://en.wikipedia.org/wiki/Branch_predictor
  3. Yang, Y-H. E. and Prasanna, V. K., High Throughput and Large Capacity Pipelined Dynamic Search Tree on FPGA, 18th Annual ACM/SIGDA Int. Symp. on Field Programmable Gate Arrays, 2010