Performance and Tuning
ProgPoW — a short-hand form for ‘Programmatic Proof-of-Work’ (sometimes colloquially called ProgPoW, or PorgyPoW, after the eponymous Porg from Star Wars VII: The Last Jedi), is a GPU-tuned extension of Ethash that minimizes the efficiency gap available to fixed-function hardware.
As a design paper, this document presents the technical specifications and performance metrics for commonly-available GPU cards. Philiosophical rationale or justification for use is not presented in this document, and we encourage a coin to do their own research on whether or not ProgPoW is the right fit for their network.
“Historically, proof-of-work mining has taken a fixed algorithm and modified the hardware to be ‘efficient’ at executing the algorithm. With ProgPoW, we’ve put this paradigm in reverse — we have taken hardware, and modified the algorithm to match it. “
An efficient algorithm for hardware needs to match the access patterns and available space of that hardware. This is why AMD GPUs with firmware edits saw large performance gains on Ethereum — because the access patterns of memory chips were matched to the access patterns of Ethash.
Ethash, as used by Ethereum today, is a memory-hard proof-of-work algorithm that runs reasonably well on commodity GPUs. It requires a reasonably large framebuffer (currently 2.5 GB) and as much memory bandwidth as possible — both available to a commodity GPU. However, this is all that Ethash requires, and becomes obvious when a graphics card is profiled with ethminer, the reference implementation for Ethash GPU mining.
The SM’s (streaming multiprocessor, the computational cores of a NVIDIA GPU which create, manage, schedule and execute instructions from many threads in parallel) consume most of the GPU’s die area. They run at less than 30% utilization.
While not as many details are presented with CodeXL, similar behaviour is shown — Vector ALUs (arithmetic logic units, which are digital circuits that perform arithmetic and bitwise operations) are also executing at less than 30% utilization.
One of the deficiencies with Ethash is that it does small 128 byte reads from main memory. This small access size is a reason that GPUs that utilized GDDR5X memory were inefficient at executing Ethash. The public version of ETHlargement matched the access patterns of GDDR5X to Ethash to enable 128 byte loads to run at full speed.
Another issue exists where the core utilization of a GPU card is overstated — Keccak, the hash function called at the start and end of Ethash, can be executed much more efficiently on an FPGA or an ASIC. The Acorn line of FPGAs were designed specifically to offload the Keccak computations to save system power and increase performance. Profiling Ethash with Keccak removed shows that the compute cores of a card are really only utilized 20% of the time, allowing for a 10% efficiency gain.
Summarizing the per-source line profiling data shows that more than 20% of the instructions are Keccak, which can be offloaded.
These results show that a specialized ASIC targeting Ethash can be designed that consists of the following properties:
- A high-bandwidth memory interface (most commonly GDDR6 or HBM2).
- A keccak engine.
- A small compute core to do inner loop FNV and modulo operations.
This resulting ASIC would be both smaller and have a much lower power footprint than existing commodity GPUs.
The design for ProgPoW began by taking Ethash and modifying it to utilize as much of a commodity GPU card as possible. The target was what was most commonly available to the blockchain consumer community at large, which was currently AMD’s Polaris and Vega line of graphic cards, and NVIDIA’s Pascal line of graphic cards.
The Keccak hashes used at the start and end of Ethash were reduced from f1600 (with a word size of 64) to f800 (with a word size of 32). f1600 requires at least twice as many instructions to execute on a graphic card, since GPUs have 32-bit datapaths. Ethash itself does not use the extra data processed by f1600, so reducing the amount of data processed has no effect on the security of the algorithm. However, this reduces the possible efficiency gains from offloading the Keccak computations from the GPU.
The number of accesses to the DAG (which is also the number of loop iterations) is unchanged from Ethash at 64.
The DAG data read size is increased from 128 bytes to 256 bytes to allow ProgPoW to execute efficiently on all current, and hopefully near future, DRAM technologies without requiring overclocked timings.
GPU cores are most efficient when they do 16 byte (4 word) loads. In order to have 256 byte loads, there must be 256 bytes / 16 bytes / lane = 16 lanes working together in parallel.
Working back from the framebuffer interface, GPUs have L2, L1, and Texture caches. Unfortunately, we have not discovered a method to make use of these caches that is both performant and portable across GPU architecture. ProgPoW does not target them — they simply pass through the DAG loads.
Next to the L1 cache, GPUs have a small amount of scratchpad memory, a high-speed internal memory used for temporary storage of calculations, data, and other work in progress. NVIDIA and CUDA refer to this as shared memory, whereas AMD and OpenCL refer to this as local memory. The defining feature of this memory is that it’s highly banked with a large crossbar, allowing accesses to random addresses to be processed quickly.
NVIDIA’s Pascal line supports scratchpads of up to 96kb, while AMD’s Polaris and Vega lines support up to 64kb. The AMD OpenCL kernel currently requires additional scratchpad space in order to exchange data between lanes (this wouldn’t be the case if AMD’s cross-lane operations were exposed in OpenCL). In order to execute efficiently on all existing architectures and not limit occupancy, the cached portion of the DAG is set to 16kb.
The compute core of a GPU has a large number of registers that feed high throughput programmable math units. The inner loop of Ethash has just the DAG load and then an FNV to merge the data into a small mix state. ProgPoW adds a sequence of random math instructions and random cache reads that are merged into a much larger mix state.
How large this mix state is, and how many math instructions and cache reads to perform was found empirically: increasing these parameters until the compute utilization matched the memory utilization.
With the above settings, ProgPoW is able to saturate both compute (the SMs) and memory bandwidth at once.
The scratchpad memory (or, shared memory in CUDA terminology) is also saturated:
An RX 580 also has similar compute utilization.
While the Keccak calculation is halved, ProgPoW does add a series of KISS99 calculations as part of the fill_mix stage to initialize the mix state. The fill_mix stage has too much data to offload to an external FPGA, but it could be implemented on chip within an ASIC using a small accelerator. However, the two only account for about 7% of the compute (SM) utilization:
The per-source line summary tells a similar story — more than 90% of the instructions are executed within the inner loop of DAG access, random math and random cache accesses. Keccak and fill_mix account for about 7% of the instructions:
These results show that an ASIC specialized for executing ProgPoW would need to consist of:
- A high bandwidth memory interface.
- A compute core with a large register file.
- A compute core with a high throughput integer math.
- High throughput, highly banked cache.
- Small Keccak + KISS99 engines.
This specialized ASIC would look suspiciously similar to existing commodity GPUs. It would only be marginally smaller and would have similar power performance.
We also managed to get our hands on an RTX 2080, which has GDDR6 memory, to perform initial benchmarking on. The CUDA profiler does not fully support the new Turing chip: a number of performance metrics (including framebuffer utilization) are listed as 0, as a result.
The GDDR6 memory appears to have similar issues to GDDR5x with Ethash’s 128 byte loads. With 256 byte loads, ProgPoW is able to saturate the memory bandwidth, running slightly faster than a Titan X (Pascal) even though it has slightly less memory bandwidth (448 vs 480 GB/s).
The Turing SM’s appear to be able to execute many more math instructions and shared memory accesses than the Pascal SM’s could. The core and shared memory are both running at roughly half the utilization than Pascal was.
This means that, given the current tuning of ProgPoW, much more of a Turing ASIC will sit idle than an equivalent Pascal ASIC. Pascal GPUs, therefore, have a higher performance per die area, which is roughly correlated with performance per dollar.
If at some point in the future it was desired that ProgPoW should target the Turing generation of GPUs, it would be a simple matter of changing a few of the tuning parameters (such as PROGPOW_REGS, PROGPOW_CNT_CACHE and PROGPOW_CNT_MATH). With appropriate tuning, Turing GPUs would maintain the same performance, while the current generation of GPUs would become compute limited, and slow down.
The bandwidth utilization columns compare the observed hashrate against the theoretical hashrate if 100% of the GPU’s memory bandwidth could be consumed (which is never possible in the real world).
The theoretical hashrate is calculated with bandwidth / data-per-hash, which is 8kb for Ethash, and 16kb for ProgPoW.
The general expectation is that ProgPoW should have around half the hashrate of Ethash since it accesses twice as much memory per hash. This holds true for GPUs that utilize GDDR5 memory — the RX 580 and the GTX 1070. GPUs that utilize HBM2, GDDR5X and GDDR6 are more efficient at executing ProgPoW.
Performance Profiling Reports
The following profiler images are attached in .png format:
Note: You will need to zoom-in to clearly view the data.
There are also two .csv’s attached, containing source code which has been annotated with profiler information:
The reference implementation can be found here.