Accelerating the Future of Electronics:

Harsh Sharma
13 min readApr 10, 2023

Current paradigms extending the “art of electronics” beyond Moore’s Law: Sustainable Solutions for the Future of Computing.

It’s April 2023 and has been about 15 days to the death of one of the biggest pioneer known in the semiconductor industry Dr. Gordon Moore. With an article in the electronics magazine in 1965, the then R&D lead (at Fairchild semiconductor) Moore observed that the number of components (like transistors, resistors, diodes, or capacitors among others)in a dense integrated circuit had doubled approximately every year and speculated that it would continue to do so for at least the next ten years. This so called “unproven” law, a mere prediction which now is limited by the laws of physics has helped realize technology roadmaps to be built for developing more powerful and efficient computing devices. The doubling of transistors on a microchip every two years predicted by Moore’s Law has allowed companies to plan their research and development efforts to keep pace with these advancements. Around the same era, around 1975s, Robert Dennard introduced the Dennard scaling, an idea which is looked at in conjunction to Moore’s law and stipulated the performance improvement of 2X as the years passed by. Dennard had build his credibility already with the ideation of DRAM(Dynamic Random Access Memory) while being at IBM. DRAM in current day is still very prevalent in usage while researchers explore non-Von Neumann architectures with in-memory-compute paradigms. The combination of Moore’s Law and Dennard scaling allowed for the rapid miniaturization of transistors and the development of more efficient microchips. This enabled the development of smaller and more portable computing devices, from laptops to smartphones, that could perform complex computations at ever-faster speeds.

However, in recent years, the pace of scaling has slowed down, and Dennard scaling has come to an end. Historically, the transistor power reduction afforded by Dennard scaling allowed manufacturers to drastically raise clock frequencies from one generation to the next without significantly increasing overall circuit power consumption. This privilage to just increase the frequency has been long lost, with much more timing and power constraints not being met. Dennard scaling ignored the “leakage current” and “threshold voltage”, which establish a baseline of power per transistor. As transistors get smaller, power density increases because these don’t scale with size. These created a “Power Wall” that has limited practical processor frequency to around 4 GHz since 2006. Additionally, around the same time (2005–2007) Dennard scaling first appeared to be broken. As transistors have become smaller and more densely packed, the current leakage poses greater challenge and the amount of heat generated by a microchip has become more difficult to dissipate, leading to performance and reliability issues.

Manycore Architectures towards parallelism

Amdahl’s law in conjugation to manycore architectures design was theorized to speed-up beyond, allowing significant increase in parallelism for more “Moore”. This provides a way to continue scaling up computing power by adding more processing cores to a single microchip. This enabled improved energy efficiency with efficient use of processing resources. With multiple cores available, each core can be used for a specific task, rather than relying on a single, power-hungry core to handle all tasks. By using such specialized processing units (like the graphic processing units or scientific application floating point units), computing tasks can be completed more efficiently and with greater accuracy. This enabled them to be scalable. As could be imagined with such a flexible approach of computing, one can be adapted to meet the needs of a wide range of applications, from consumer electronics to high-performance computing on a data center scale scenario.

This often was researched, going over the reticle limit of data-transfer, with multiple processing elements equipped with near-memory or in-memory computing. Application-specific architectures (like this, and this) were theorized to be sped up beyond Moore. In-memory computing and near-memory computing are two approaches to processing data that aim to reduce the time and energy required to transfer data between the memory and the processor. However, both approaches can be bottlenecked by the memory transfer rate, which can limit their effectiveness. Table below compares the two approaches in terms of their advantages and limitations:

In-memory compute vs Near-memory compute

Around all the discussions above towards much faster and efficient system design paradigms, one of the biggest bottleneck has been the memory data-transfer rate. Also known as the interconnect bandwidth(BW), it becomes imperative to consider as the sequential part in any pipeline is constrained with the interconnect speed. For more background about this refer here. As the years progressed, the DRAM bandwidth(what we talked about during Dennard scaling discussion above) and the interconnect BW have not at all followed the Moore’s law and is reasonable responsible for unexpected bubble as we come towards the end of Moore era. As shown below, the big gap between the interconnect/memory BW and processor speed up is what currently been researched to bridge the gap.

A. Gholami, 2020, Hynix; Chipworks; Adopted from SRC decadal by Cao, Yu at ASU

In-memory computing, where specific memory technologies (like ReRAM, SRAM, FeFETs among others) as employed for computations, developing application-specific architectures. In the space of analog computing, or as famously is called the hardware-accelerator design, these novel processing-in-memory architectures are used to speed up the multiply-and-accumulate operations within the machine learning workloads (like convolution neural networks, large language models, graph neural network, graph convolution networks among almost every other paper in any major EDA conference). In the current era with industry demands and requirements being built around DEEP learning (DL) workloads require much more data-intensive computations which are mostly multiply-and-accumulate operations. Some deep learning workloads like deep neural networks (DNNs), convolutional neural networks (CNNs), graph neural networks (GNNs), and their variants, are employed in a range of applications, including autonomous vehicles, medical diagnosis, video analytics, recommendation systems, and social networks.

Comparing to GPUs

DEEP learning (DL) workloads, due to their data-parallel nature, are usually accelerated using GPUs. However, general-purpose GPUs lead to suboptimal performance since they are not customized for executing DL workloads. The most notable limitations of GPU architectures are: 1) high power and area overheads; 2) low performance-per-watt; and 3) limited memory bandwidth(as we saw in the above plot). To address these limitations, DL hardware accelerators must include domain-specific customizations to optimize performance and energy-efficiency tradeoffs. Recent research has demonstrated the potential of such memory type called resistive random-access memory (ReRAM) for efficient DL training and inference. DL computation kernels mainly involve multiply-and-accumulate (MAC) operations, which are efficiently implemented using ReRAM-based architectures. ReRAMs also allow for processing-in-memory (PIM), which helps reduce communication between computing cores and the main memory, increasing energy efficiency. But that was not enough, as we look at the immense pace of development of much deeper neural networks and variants with much higher compute requirements, packing the same amount of capabilities could not be made possible with monolithic chip designs with multiple processing elements. As an example from a recent talk I attended, it was highlighted that the amount of data expected by a fighter jet to process over an array of FPGAs could be in an order of terabytes for just one specifc time of positioning arrays (don’t expect me to quote Norththrop Grumann directly).Hence, this was the time for rethinking if we can go towards much more federated approach. Escaping the monolithic chip as the sizes and hence the cost of making defect-free chips go upwards, there is a need of lego blocks. An approach of breaking down the problem into chunks which are fed into multiple computers/machines such that with low-latency or performance penalty we could transmit the data exchange to finish a much bigger tasks. Comes in one of the most famous architecture designs currently which is gaining a lot of press: A chiplet based system.

Old wine in a new bottle: Chiplet-based system

In a very similar context to manycore architectures, consider a possibility of making bigger processing elements. In addition, they can also possess a certain level of heterogeneity with some chiplets could be CPU/ GPU/ DRAM/ReRAM and their own interconnects among each others. With different interconnects bus widths, serializer-deserializers, and clock-domain crossings current packaging methodologies make this technology much more realizable.Like a lego, or in commercially acceptable terms, defined as reusing IPs with standardized protocols to ease the inertia of innovating newer chips with higher capabilites and possible use cases.

The demand for computing continues to grow at an increasing rate. Thus, major chip vendors such as AMD, Intel, NVIDIA, among others have embraced 2.5D interposer-based integration to scale the performance of individual computation node. One of the key technologies enabled by the 2.5D systems is the integration of multiple discrete chips (often called chiplets) within a package either over a silicon interposer or via other packaging technologies. To get a big picture idea and possibilities of chiplet-based systems on much compute-intensive systems like server-scale I would suggest reading this and my paper which I will link here. Furthermore, chiplets could also be integrated using Multi-chip Module SoCs (MCM-SoCs).

As we discussed about ReRAMs above in the context of hardware accelerator design for deep learning(DL), they can helps reduce communication between computing cores and the main memory. This effectively increasing energy efficiency and also help bridging the gaps between processing and interconnect BW. Unfortunately, as the complexity of DL models continues to grow, tiled ReRAM architectures with tens to hundreds and thousands of process- ing elements (PEs) are required to support efficient training and inference. Monolithic ReRAM implementations, even with 3-D stacking, are no longer viable solutions at this scale due to temperature hotspots and additional dollar costs with the “keep out regions”. Such regions make parts of the chip unusable due to being unable to meet power constraints. In contrast, 2.5-D integration, which involves integrating multiple small chips (i.e., chiplets) on an interposer, represents an effective solution to improve yield and scaling for larger and complex DL models.

Novel 2.5-D chiplet platforms provide a new avenue for compact scale-out implementations of emerging DL workloads. Integrating multiple small chiplets on a large interposer offers significant performance and manufacturing yield improvements compared to 2-D ICs, reducing the fabrication cost and making it a promising direction for foundries. Furthermore, it also achieves higher thermal efficiency than 3-D ICs and facilitates heterogeneous integration. Hence, it has become possible to envision large-scale accelerators on 2.5-D platforms to support the resource-intensive DL applications. However, as you would naturally imagine, scalable communication between chiplets is particularly challenging due to relatively large physical distances between chiplets. In contrast to manycore architecture design, here the physical distance actually makes it much harder to ensure scalable communication and this opens up a critical pandora box for the research community: How to ensure scalable and trustworthy/robust interposer design such that communication no longer becomes a bottleneck. This issue is exacerbated with poor technology scaling of electrical wires, and shrinking power budgets available as the chip size grows which is natural in such a federated multi-chip/computer module. The aforementioned challenges make it difficult to design viable network-on-interposer (NoI) that can support ultrahigh bandwidth, energy-efficient, and low-latency inter- chiplet data transfers without increasing fabrication costs. The demands on the NoI infrastructure is only yet to increase as application complexity continues to scale. In addition, the NoI area overhead alone can be up to 85% of the total system area. Which means making it efficient and smaller could be a natural way to optimize for better chiplet-based system design. However, none of the prior studies have considered specific optimization for NoI while executing DL workloads on chiplet-based systems which we are going to discuss here as the system size scales. This becomes imperative as the compute demands are expected on the server-scale to be much higher order, realizing high performance energy efficient systems can help us stay a step ahead Moore’s law. Below we have a toy diagram representing a chiplet based system.

Chiplet-based PIM architecture utilized within SWAP. The architecture consists of chiplets with their associated NoI routers. Each chiplet consist of 16 tiles with buffer, accumulator, activation unit, and pooling layer. Tiles are connected on a Mesh network.

So far, design of various general-purpose as well as application-specific 2.5D-based systems has been explored. The first family of NoI architectures are based on regular multi-hop networks. IntAct, for example, is a 2.5D prototype system with six chiplets stacked on an active interposer with a Mesh NoI. In IntAct, the authors demonstrated the scalability of the 2.5D system with low latency distributed interconnects. Simba is another 2.5D system with 36 chiplets specifically designed for deep neural network (DNN) inferencing. It also uses a Mesh NoI. Simba employs tiling optimizations to limit the inter-chiplet traffic. Recently, the Kite family of NoI topologies have been proposed for a 2.5D-based system considering synthetic traffic/workloads [8]. NN-Baton is another recently proposed 2.5 D architecture that undertakes a design exploration considering several DNN applications. The NoI topology adopted in NN-Baton is a ring architecture. Figure below shows the NoI architectures designed based on regular multi-hop networks.

NoI architectures designs based on multi-hop networks a) Kite, b) SIAM and c) SIMBA

It should be noted that all the above mentioned NoI architectures principally utilize multi-hop network, which does not scale with higher number of chiplets. Moreover, for datacenter scale applications these multi-hop NoI architectures create huge performance bottleneck. A high-performance and energy-efficient NoI architecture called SWAP has been recently proposed for designing chiplet based systems for server-scale scenarios, running multiple DL workloads in parallel. Figure 2 is an illustrative example of the SWAP architecture. SWAP is the first 2.5D accelerator with inter-chiplet communication-aware NoI to achieve high performance and energy efficiency with reduced fabrication cost with respect to state-of-the-art alternatives. SWAP leverages an efficient multi-objective optimization (MOO) mechanism to generate a NoI architecture with a smaller number of links and smaller routers than all the existing NoI counterparts mentioned above. For more information about multi-objective optimization, you can read the paper or send me an email here. The irregularity in the SWAP NoI improves the overall link utilization in the system and is scalable for a wide variety of DL workloads and the number of chiplets in the system. Below GIF visualizes the multi-objective optimization for a chiplet-based NoI design.

Here as the traffic is expected, we remove underutilized links from the system to create high-performance energy efficient NoI design with a small-world network idea.

As the number of links are reduced with newly perturbed ones to maximize the NoI energy efficiency, we have much smaller router ports and associated links. This also reduces the fabrication cost with much higher performance. Figure shown below demonstrates smaller router ports for such a chiplet-based system with smarter link placements.

Router port configuration for Kite, SIAM, and SWAP for a 2.5-D system with (a) 36 chiplets, (b) 64 chiplets, and (c) 81 chiplets. Peak of the plot is observed to be moving toward left in case of SWAP.

By having smaller routers and hence reducing the unnecessary links, SW AP not only reduces the inference latency of DL workloads but also achieves significant reduction in energy consumption. SWAP for instance achieves up to 47% lower energy than Kite for 81-chiplet-based system. On average we observe a 25% lower energy than SIAM for a system with 81 chiplets. The simultaneous energy and latency benefits result in significant EDP improvements over entire spectrum of considered data-center scale scenarios with much ower fabrication costs. Figure below captures the idea with utopia existing at the origin. In case of SWAP with application agnostic link placement, we achieve better performance with lesser fabrication costs.

Trend in fabrication cost and EDP for (a) Kite and SWAP; (b) SIAM and SWAP with 81 chiplets.

In summary, smaller routers and fewer appropriately placed links enable SWAP to achieve lower latency and energy consumption than both Kite and SIAM NoP architectures.

How does the future look?

With reasonable challenges yet much more interesting possibilites, the chiplet based architectures can actually open up doors for much more innovative system designs. Currently due to the monopoly with bigger players in the space where unless 50M people demands a certain chip it it not feasible to be fabricated. A reusable system such as chiplets could make it not just cheaper but faster to realize with fixed timings to be met and power constraints to be kept under check. IBM has highlighted that these chiplet-based systems have an interesting future full of innovations. But green computing becomes one of the bigger challenge towards sustainable computing. Moving memory into a chiplet architecture that stacks it closer to the processor can help tackle bigger AI tasks, but also could have massive environmental benefits as well. More than 50% of power consumed by a computer chip is from moving data horizontally around the chip, according to Huiming Bu, vice president of global semiconductor research and Albany operations at IBM Research. “With chiplets, you can move the memory closer to the processing unit, saving energy,” he added. By some estimates, training an AI model can emit as much carbon as running five cars for their lifetimes. Any energy efficiencies that can be gleaned on a single chiplet module could have huge implications when deployed at the scale of a datacenter. This ages well to the idea of SWAP, where the need of increasing energy efficiency is well met by removing underutilized links but that might not be enough. We also need to consider better computing resource utilization and how to make ML models on the edge.

Energy and policy considerations have been explored in the field of Green AI. Green AI refers to the research that yields novel results without increasing computational cost, and ideally reducing it. Higher computational requirements lead to larger carbon footprint for serving and manufacturing such systems. Green AI aims to explore the environmental effects regarding the capex (non-recurring) and opex (recurring) costs in semiconductor industry. It is previously highlighted that there is an overall increase of 300,000x in computing requirements in the last 10 years of deep learning, with training cost doubling every few months. This necessitates larger monolithic chips which are costlier to manufacture (capex) and have higher energy requirements (opex) with respect to chiplet based 2.5D system. Additionally, reusing the defective chiplets instead of discarding them reduces the carbon footprint (capex). Hence, we should explore it to establish a performance-sustainability tradeoff and pave the way towards more environmental design paradigm, which is the need of the hour and is being pursued by many foundry and industries over the next decade.

If you like this, please follow for similar articles in the future. I plan to write one every two weeks. You can additionally write to me for any questions or concerns at first [dot] last name [at] wsu [dot] edu.

--

--