After gaining some mainstream press, we’ve started to receive a number of questions about ProgPoW. We’d like to address a few of the most common.
Q. What’s your stance on Ethereum governance questions?
We don’t have one. We think it’s best left up to the community to answer questions like if, and when, ProgPoW should be adopted. We’ve proposed an algorithm and are happy to answer technical questions about it.
Q. Where did ProgPoW come from?
IfDefElse is a small team that analyzes and optimizes PoW algorithms. We have seen the community repeatedly ask for a PoW algorithm where specialized ASICs have a minimal advantage over commodity hardware available to everyone. It’s been frustrating to watch so many algorithms proposed that are clearly vulnerable to specialized ASICs. The community has been understandably upset each time a specialized ASIC is inevitably produced.
One day last spring we had an idea for how to modify Ethash to accomplish the goal for GPUs. After creating the basic algorithm all development and tuning has taken place on our public GitHub.
Q. Who has looked at ProgPoW?
Once the algorithm settled and received feedback from a close peer group, we were lucky enough to have an email review that included engineers from the Ethereum Foundation, Ethereum Core Devs, Nvidia and AMD. The Nvidia and AMD engineers gave the algorithm a generally positive review.
Q. What was AMD’s response?
AMD’s response addressed two major concerns:
- If the Ethash PoW algorithm was replaced with ProgPoW, wouldn’t ASIC manufacturers be able to quickly look at the open source code, and create a specialized ASIC to mimic it?
- Would ProgPoW make it more difficult for GPU miners to mine on Ethereum?
An AMD engineer stated that, yes, in theory you could create an ASIC for ProgPoW, but they would need GPU-specific knowledge, specifically around memory controllers.
Furthermore, they shared concerns about the cache size (stored in LDS on AMD chips). At 8KB or 16KB, they stated in an email, AMD and NV would have similar performance. At 32KB or 64KB there could be a significant impact on both GPU vendor architectures, and there would be some incompatibilities on Polaris and Vega.
Based on their input we set PROGPOW_CACHE_BYTES to be 16KB.
Q. What was NVIDIA’s response?
The Nvidia engineer generally agreed with our approach. They said that the algorithm “fills in the holes” between memory accesses with compute instead of the “GPU sitting idle as a glorified memory controller”.
The main concern they shared was that if too much random math was added the algorithm could become compute bound instead of memory bound. An ASIC for a compute bound algorithm could have a larger efficiency gain.
Based on their input we tuned PROGPOW_CNT_CACHE and PROGPOW_CNT_MATH to ensure the algorithm remains memory bound on most modern GPUs.
Q. Wouldn’t a ProgPoW ASIC be drastically more efficient due to the kiss99() and modulo calls in the main loop for picking random instructions?
This is a common misunderstanding when first looking over the algorithm. The calls to kiss99() and modulo in the main loop are evaluated by the CPU to generate a random program, which the CPU then compiles. The GPU will execute optimized code that’s already resolved what instruction to execute and what mix state to use.
We will attempt to clarify this in the spec.
Q. Will miners need to install the AMD or Nvidia SDK in order to compile the generated source code?
No. Both AMD and Nvidia’s drivers include OpenCL compilers, the same way they include DirectX and Vulkan compilers. For CUDA a small portion of the SDK will need to be distributed with the miner binary. See this issue for more details.
Q. Does ProgPoW favor one GPU architecture over another?
No. It was designed to have as level of a playing field as possible. The OpenCL and CUDA implementations are essentially identical. The 16KB cache size was specifically to work well on both architectures.
We avoided ops that only work on one architecture such as 16 or 24 bit multiplies, AMD’s indexed register file, or Nvidia’s LOP3. All the ops used are well supported by both architectures across multiple generations.
Performance of a GPU for ProgPoW in mining workloads will reflect the average gaming performance of that GPU.
Q. Why did you choose 32-bit mul over 24-bit mul?
We chose the middle ground between two architectures, which would not put too much pressure on either architecture.
32*32 = 32
AMD: 4x Penalty
Utilizes v_mul_lo_u32, which is 4 times slower than most instructions. A “full rate” instruction has a latency of 4 clocks, and is capable of reaching a throughput of one per cycle. The v_mul_lo_u32 instruction has a latency of 16 clocks.
NVIDIA: 3x Penalty
Maxwel and Pascal’s native 16*16=32 multiplication needs 3 instructions to produce the result.
For the normal 32-bit (low) multiply, AMD has a penalty of 4x, versus Nvidia’s penalty of 3x. However, for the 32-bit high multiply, AMD has a penalty of 4x versus Nvidia’s 5x. This is +5% in the case of the low multiply to Nvidia, and +8.3% to AMD in the case of the high one. Chances of the high one occurring are exactly the same as that of the low one. For a truly given random u-32 bit unsigned n, it will distribute over 11 possibilities almost perfectly (off by 1-in-(2**32)/11).
This means that both architectures are effectively equal.
24*24 = 32
AMD: No Penalty
Utilizes v_mul_u32_u24, which executes just as fast as most all of them. It is “full rate”, having a latency of 4 clock cycles and capable of a throughput of one operation per cycle. No penalty.
NVIDIA: 5x Penalty
This operation has no hardware support. It can be emulated by ANDing both inputs to produce a 24-bit number followed by a normal 3-instruction 32-bit multiply. 5x penalty.
16*16 = 32
AMD: 3x Penalty
This operation has no hardware support. It can be emulated by ANDing both inputs to produce a 16-bit number followed by a v_mul_u32_u24.
NVIDIA: No Penalty
This operation is a normal instruction at full speed.
AMD: No Penalty
This operation is one full rate instruction — v_mul_lo_u16 — which has a latency of 4 clock cycles and capable of a throughput of one operation / cycle.
NVIDIA: 2x Penalty
This operation would require two instructions — the first being a multiply which has a 32-bit result, and one being a bitwise AND operation to discard the high 16 bits.
Q. Why does my GPU with a heavily modified VBIOS slow down significantly more than the 2x expected difference between Ethash and ProgPoW?
ProgPoW reads 2x as much memory per hash as Ethash, so the expected hash rate is 1/2. All our tuning and the sample hashrates reported previously (see “Results: Hashrate”) were done on GPUs running at stock frequencies. Heavily modified VBIOS configurations that lower the core frequency will result in the algorithm becoming compute-limited instead of memory-limited.
As with any new algorithm VBIOS modifications and tuning will need to be redone.
Q. Can you walk through how an Ethash ASIC can be 2x more efficient than a GPU?
The Ethash algorithm requires just 3 components to execute:
- High bandwidth memory (for DAG access)
- Keccak f1600 engine (for the initial/final hashes)
- Tiny compute core (for the inner loop FNV and address modulo)
The Acorn line of FPGAs has shown that the power associated with the Keccak computations can be reduced to a negligible amount.
We estimate that roughly 1/2 of the GPU’s power while executing Ethash is spent on memory accesses. With the Keccak and compute cores taking negligible power an Ethash ASIC’s power is simply the power of the memory accesses, so 1/2 that of a GPU for a 2x improvement.
A quick summary of current Ethash mining hardware:
The first Ethash ASIC, Bitmain’s Antminer E3, did not have any efficiency advantage over GPUs. This is because the DDR3 memory it contains uses significantly more power than the GDDR memory GPUs use.
The Innosilicon A10 ETHMaster, which to our knowledge has not been released, claims to be significantly more efficient. Considering Innosilicon sells GDDR6 IP we presume the miner makes use of GDDR6 memory. This makes it 2.1x as efficient as the RTX 2070, currently the most efficient consumer GPU.
Q. What about HBM?
Our initial algorithm evaluation had assumed apples-to-apples comparison using the same memory type. HBM uses much less power but it is also much more expensive, making it somewhat impractical. For example an Nvidia Titan V with HBM is just slightly less efficient than the A10 ETHMaster, but at a cost of $3000 it is obviously impractical.
AMD’s Vega cards with HBM are reasonably priced but for some reason produce only 175 kilohash/s/watt. We’re unsure as to what is limiting Vega’s efficiency. Increasing the access size helps (it goes from 61% to 75% bandwidth utilization — see “Results: Hashrate”), but it still burns significant power. We expect the just-announced Radeon VII with >2x the bandwidth to be significantly more efficient.
We estimate that HBM consumes roughly 1/2 the power of GDDR6. If an expensive Ethash ASIC was made using HBM it would calculate more than 1 megahash/s/watt, or around 4x the efficiency of any consumer GPUs.
Q. How much more efficient could a ProgPoW ASIC be?
ProgPoW is designed to drastically reduce the efficiency gains available to a specialized ASIC. The ProgPoW algorithm requires the following components to execute:
- High bandwidth memory (for DAG access)
- Keccak f800 engine (for the initial/final hashes)
- Large SIMD register file (for the mix state)
- High throughput SIMD integer math (for the random math)
- High throughput SIMD cache (for the random cache accesses)
The size of the Keccak has been reduced, so it already consumes negligible power on a GPU. An ASIC reducing the power even more makes little difference.
In order to execute the random sequence a ProgPoW ASIC would need to implement something very similar to the compute core within existing GPUs. All the SIMD register access, math calculations, and cache accesses would need to be performed just like on a GPU. It’s true that a ProgPoW ASIC’s ISA could be more closely crafted to exactly match the ProgPoW algorithm, such as removing floating point and adding explicit merge() operations. This specialization would provide a marginal benefit, not an order-of-magnitude benefit.
To be optimistic lets assume that a finely crafted ProgPoW ASIC ISA could remove 1/4 of the compute core’s power consumption. Since the GPU core is much more active when executing ProgPoW we estimate the memory interface consumes roughly 1/3 of the GPU’s power. A ProgPoW ASIC that used GDDR would have a relative power consumption of:
1/3 (memory) * 1 + 2/3 (compute) * 3/4 = 5/6
For a 1.2x advantage.
If the ProgPoW ASIC made use of HBM then the relative power would be:
1/3 (memory) * 1/2 + 2/3 (compute) * 3/4 = 2/3
For a 1.5x advantage.
Q. Could ProgPoW run on an FPGA?
First there are practical issues with running ProgPoW on an FPGA. Since the random program changes every 12.5 minutes a new bitstream needs to be compiled and loaded that often. The tools and infrastructure required to do this generally don’t exist.
Ignoring that issue ProgPoW does not map well to an FPGA. FPGAs can be quite efficient for compute-dense algorithms, such as Keccak or Lyra. These algorithms can see a significant performance increase and power decrease by packing multiple ops into a single clock cycle and running many ops in parallel.
The ProgPoW loop has many cache reads interleaved in the sequence. This drastically reduces what ops can be packed into a single clock cycle or run in parallel, both decreasing the performance and increasing the pipeline length. The increased pipeline length becomes a problem due to the large mix state (16 lanes * 32 regs * 4 bytes = 2 kilobytes). If this large mix state is copied along every pipe stage that is a significant amount of wasted power. If the mix state is stored in a register file the compute core is starting to look a lot like an ASIC’s or GPU’s compute core, and FPGAs are always significantly less efficient than an ASIC.
Q. This has gotten really long. Can you give me a summary?
Our original 2x and 1.2x estimates for Ethash and ProgPoW were assuming apples-to-apples comparison of same memory type. In writing this FAQ we’ve realized it’s important to also include the apples-to-oranges comparison of an ASIC using HBM while most consumer GPUs use GDDR.