The Cost of ASIC Design

It’s not rocket science. But it is a science.

IfDefElse
11 min readMar 30, 2019
Source: https://www.custompcreview.com/news/amd-polaris-10-die-shot-confirms-radeon-rx-480-utilizes-full-gpu/

Introduction

The speculation about hardware design and development costs as it pertains to both ProgPoW and Ethash are usually followed by a statement of authority: trust the author, because they have previous experience in <field>. Sometimes, it is cryptocurrency-ASIC production, other times, it is integrated-circuit design.

For the audience who is more familiar with code rather than fan-out and rise-times, a level of perspective on why this statement of authority does not apply to ProgPoW may be helpful.

A programmer can encompass everything and anything: from writing scripts to iPhone Apps, and from embedded systems, to the Windows operating system. But having shipped an App does not make you an authority on the App Store backend, or its system efficiency; having a shipped RTOS does not make you an authority on the cost trade-offs of scaling Windows.

This does not mean the Windows designer is a ‘superior programmer’. Rather, these backgrounds are different in the understanding and assumptions, especially when speaking on the topic of economies of scale.

In a similar vein, a hardware designer can encompass everything from designing an IC for a toothbrush, to a silicon architect for networking equipment. The toothbrush engineer who ships 100,000 units will have no understanding of the economies of scale available to the networking engineer who ships 1,000,000 units, monthly. Similarly, a cryptocurrency-ASIC designer will have very little understanding of the domain of a GPU-ASIC designer — these are industries (and often countries) apart from each other.

There comes one last point: both programming and engineering are skills. Unless it is something you work with, daily, you quickly fall behind in both knowledge and authority. This is why a new cryptocurrency-ASIC manufacturer may struggle to enter the SHA256d market — they will need to catch up with the engineers who have studied SHA256d for six years, already.

Often, these articles prey on the lack of hardware knowledge in the ecosystem. Cryptocurrency has been predominately a software-dominated industry, with most hardware engineering taking place behind closed doors within private companies.

We have seen arm-chair experts go to great lengths to assure software-engineers of their ability to outmaneuver the ecosystem — we have seen this happen with Monero, with Bitcoin, and with ZCash. The reality is that a challenge has not been presented — designers were never competing against hardware designed by a company with the liquidity of capital markets, or multi-decade engineering experience. If Bitmain or Innosilicon tried to make a CPU, do you think they would be able to outmaneuver AMD or Intel?

Economies of scale are always prevalent — be it in cost or experience.

Breaking Apart an IC Designer’s Arguments

Whether the algorithm is ProgPoW or ETHash, the hashrate is determined by the storage bandwidth of external DRAM.

No. The hashrate of ProgPoW is determined by two factors: the compute core, as well as the memory bandwidth. This is why the variance between Ethash and ProgPoW, as shown below:

Source: https://medium.com/@infantry1337/comprehensive-progpow-benchmark-715126798476
Source: https://medium.com/@infantry1337/comprehensive-progpow-benchmark-715126798476

Now the memory demand for profitably mining ETHash has increased significantly. This demand for high-bandwidth memory has prompted the development of next-generation high-speed memory tech such as GDDR6 and HMB2.

The demand for high-bandwidth memory did not come from ‘Ethash’, a 15 billion market cap, of which only a portion is attributed to mining. It came from the demand of the core market: GPUs, FPGAs, AI, HPC and gaming. Mining demand does not drive the $1.2 trillion-dollar market AI market, or a 30 billion-dollar PC gaming market, or the 35 billion-dollar console market, or the 29 billion-dollar HPC market.

Because of the similarities that exist between the algorithm as well as architecture of ProgPoW and those of ETHash, I believe that Innosilicon’s next ASIC would be tailored for ProgPoW.

The only similarity between ProgPoW and Ethash is the usage of the DAG in global memory. From a compute perspective, Ethash only requires a fixed keccak_f1600 core, and a modulo function. ProgPoW, on the other hand, requires the ability to execute a 16-lane wide sequence of random math, while accessing a high-bandwidth L1 cache. Designing a compute core that can execute the ProgPoW math sequence is significantly more difficult than implementing a fixed-function hash like keccak.

It is true that Ethash’s hashrate effectively only depends on the memory bandwidth. The point of ProgPoW is that it depends on both memory bandwidth, and core compute on randomized math sequences. That dinstinction is important.

The point of proof-of-work is to have mathematical proof of costs in hardware and energy. Ethash was an algorithm that could not capture a big part of the hardware expense (compute engine) in the mathematical proof. Instead, it only captured the memory interfaces. This is why you could have a cryptocurrency-ASIC that cut out the part that isn’t being captured in the math.

Because GPU is a general-purpose acceleration chip, it usually takes about 12 months for a GPU to be designed, fabricated and tested, requiring a lot of hardware simulations and software developing to cover different computing scenarios.

ProgPoW simply means to capture the entirety of the hardware cost (as best as it can). Since that new part of algorithm is meant to capture the compute hardware running different computing scenarios — down to the architectural wrinkles — it’s not just ‘another’ three-to-four month ASIC design.

This brings up the question — why have floating point operations been omitted? Simple: floating point is not portable across chips. Different chips handle corner cases related to special values (INF, NaN, and deform) in different ways. The largest divergence is in the handling of NaN, which will happen naturally when using random inputs. To quote the Wikipedia page:

If there are multiple NaN inputs, the result NaN’s payload should be from one of the input NaNs; the standard does not specify which.

This means that in order to use floating point, essentially every floating point operation would need to be paired with a if(is_special(val)) val= 0.0 check. This check could trivially be done in hardware, which would be a huge benefit in a cryptocurrency-ASIC.

So what is hashrate, or hash-per-watt, in all of this?

Hashrate is a measure of the energy cost. As long as everyone is measured in the same way, the energy consumption per unit does not matter — a miner will continue to invest in as much energy (as hashrate) as they can afford. The operating cost economics do not change because you switch the units of measurement from 1 Ethash (smaller unit, like joule) to 1 ProgPoW-hash (bigger unit, like calorie). Global hashrate measures the total economic weight of everyone’s contribution to securing the network. As long as everyone’s contributions are measured fairly, with the same units, nothing changes for the typical miner.

The claim about ‘large farms’ is often brought up around ProgPoW, and again, we reiterate: economies of scale will always exist, and are a fact of life.

An ASIC producer can use the smaller GDDR6 memory banks to gain cost advantages over GPUs. 16 GDDR6 4GB memory banks can be used to achieve a 2x bandwidth advantage, while maintaining GDDR6 costs at almost the same level.

First, having 2x the bandwidth will require 2x the compute, so it’s linear scaling, rather than an advantage.

Second, we currently don’t have production ready 4 gigabit memory chips for GDDR6. Micron only produces 8 gigabit chips, while Samsung produces 8 gigabit and 16 gigabit chips. The GDDR6 IO interface area is expensive on the memory dies. Each generation, the interface takes up more of the actual memory die compared to the memory cells, because the PHY design can’t shrink as fast as memory cells can with process shrinks. The memory market is driven by the major buyers over long cycles (gaming consoles, GPUs), who also tend to favor bigger capacity. Memory vendors have no incentive to make a risky 4 gigabit part that cannot drive O(billions/yr) in a consistent manner.

There are many modules in the RTX2080 chip that occupy a lot of the chip area and are useless for ProgPoW. These include PCIE, NVLINK, L2Cache, 3072 shading units, 64 ROPs, 192 TMUs et. all.

The RTX 2080 is not a good reference for this discussion. Nvidia’s RTX series include significant die area for their new features, such as the ray-tracing cores. Since ProgPoW is designed to work on existing hardware in the ecosystem from both NVIDIA and AMD, these features can not be used. A better comparison would be an AMD RX 5xx, or an Nvidia GTX 1XXX.

As we’ve stated in our official writeup, there are parts of the GPU that are not utilized, such as: floating point logic, the L2 cache, and graphic-specific things like texture caches and ROPs. The shading units are where the vector math is executed, which is absolutely a requirement for ProgPoW. A cryptocurrency ASIC would also want to add die area to implement the keccak function. We estimate that a ProgPoW-ASIC would have a 30% smaller die area than an equivalent GPU — however, it would only be 20% lower power, in the best case. The unused logic on the GPU wastes die area, but takes minimal power.

compared to large chips, small chips have higher yields

Ah, yes, it sounds like auditing “Chip Making 101”. Furthermore, their yield calculation formula comes from a 2006 article. That is thirteen years of innovation in yield and process control.

For a chip that has a single functional unit, a smaller die area will be higher yielding than a large die area. That’s not true for modern GPUs. GPU’s today are nearly arbitrarily binnable or recoverable, with a ton of tiny replicated units ignored on defect. As long as each binnable functional unit is small, then the chip yield can be nearly as high — or even higher — than a smaller chip with bigger, functional blocks.

Here’s a simple thought experiment to explain this concept:

  1. Say you have a “Giant ChipA” that takes up the entire wafer. Giant ChipA is made of 100,000 binnable sub-components, and only 80% of the sub-components need to be defect-free for Giant ChipA to work normally. Bad sub-components are functionally bypassed during the binning process.
  2. Separately, say you have a “Tiny ChipB”, which is made up of only one functional block (non-binnable), but small enough to fit 100,000 units in the same wafer. In Tiny ChipB’s case, a single defect means that the Tiny ChipB is bad.
  3. If you have exactly 20,000 defects evenly distributed on every wafer, then the yield of “Giant ChipA” would be 100%, with 20% of the sub-blocks binned-off, whereas the yield of “Tiny ChipB” would only be 80%, since there are no binnable sub-blocks.

This is why you can find GPUs with a wide range in the number of enabled shaders that all use the same underlying chip. For example look at AMD’s various Polaris 20 products or Nvidia’s various GP104 products. This die shot shows the large number of tiny “binnable” (ignorable) sub-blocks in a GPU.

Source: https://wccftech.com/nvidia-gtx-1080-gp104-die-shot/

The voltage of ASICs can easily be reduced to 0.4V, which is ½ that of GPU’s …Such low-voltage ASIC designs are already utilized by ASIC producers in Bitcoin mining machines and there is no reason to believe that they would not be used in ProgPoW ASICs.

Low voltage designs can work when the chip consist of only compute, such as a SHA256d ASIC. Integrating other elements like the SRAM (required for ProgPoW data caching), is extremely difficult or impossible to make functional at low voltages.

The same power-saving can also be achieved in LPDDR4x DRAM, which has a lower power consumption than GDDR6

LPDDR4x is drastically lower bandwidth than GDDR6 (4.2Gb/s per pin vs 16 Gb/s per pin). Nearly 4X more memory chips and 4X as many memory interfaces on the compute chip would be required to reach the same performance, resulting is significantly higher costs.

It’s important to note that high bandwidth compute chips are often interface limited. This means that the die area is just big enough to have a perimeter that barely allows all the signals to get off of the chip onto the PCB. An LPDDR4x design would require ~4X the die perimeter pad count to hit the same bandwidth. This means that the cost isn’t just in memory chips, but also in compute die area just feeding the memories. Making things worse, the speed-oriented process means any that extra die area also means more power wasted in leakage.

Consider why GPUs don’t run on LPDDR4x today: LPDDR4x is terrible in terms of bandwidth per dollar. LPDDR4x is more than 4 times more expensive for a given amount of bandwidth (4 times the number of chips), so there is a dramatic cost increase (LPDDR4x is roughly $150 for 256 GB/s @ 9W, vs less than $40 for GDDR6 @ 11W) with almost no power savings for a whole miner (note that is cost for bandwidth, and not for capacity).

A GPU producer like Nvidia employs about 8000 people to develop GPUs, which are much more complicated, whereas an ASIC producer like Linzhi only employs a dozen or so people to focus only on ASICs for ETHash mining. The labor costs of these companies company are different by a factor of 100. So ASICs have further advantages in terms of cost and time-to-market than GPU chips.

Economies of scale are an important factor here. The GPU industry is also amortized across diverse sales pipelines across the globe, and a combined 420B: AMD with 11.6B; NVIDIA with 154.5B, and Intel with 254.8B. On memory alone, you have the cost of PHY and wafers amortized across a 500B industry: 325.9B Samsung Electronics, with 320,671 employees, who is the largest recipient of active U.S. patents in the world; the 60.1B Micron Technology, with 34,100 employees, who was the first to reveal astonishing speeds of 20Gbps on GDDR6; or the 56.8B SK Hynix, with 187,903 employees, who developed the industry’s first 1Ynm 16Gb DDR5 DRAM. In contrast, the cryptocurrency-ASIC industry is a 146B, with 73B belonging to Bitcoin.

Considering time to market and TAM, we can examine the development time for the successor to the famous S9 unit. If even an well-researched and computationally simple SHA256d takes three years to iterate on, what guarantees the speedy implementation of a GPU-like ProgPoW-ASIC? We can also look at recent Ethereum cryptocurrency-ASIC work. With a year since GDDR6 sample availability, how long has it taken to make a publicly available GDDR6 version?

Final Thoughts

ProgPoW is based on the simple idea of accurately proving the real cost of work on widely distributed, off-the-shelf hardware. We targeted a type of hardware that would be supported by massive economies of scale, with high visibility, and fierce existing competiton.

We’re a small team — and each of us have full-time jobs — so we’re unable to respond in a timely manner to all of these statements, articles, or otherwise chatter on various forums. And, while we are excited about the level of curiosity on hardware design and development, we advise caution: hardware, just like software, is a diverse field — and knowledge of cryptocurrency-ASICs does not make you a subject matter expert in GPU-ASICs.

--

--

IfDefElse

We are the team behind ProgPoW, a GPU-tuned extension of Ethash.