The exponential growth and requirements of AI use cases
The increasing complexity of AI models and the explosive growth of AI model size are both rapidly outpacing current innovations in computing resources and memory capacity available on a single device. AI model complexity now doubles every 3.5 months or about 10X per year, driving rapidly increasing demand in AI computing capability. Memory requirements for AI models are also rising due to an increasing number of parameters or weights in a model.
Also, in 2010 approximately 28% of the world population had an internet connection. 10 years later, 62% of the world population has internet access. Furthermore, high-end connection speed grew by 57 X in the last ten years, and currently, you can expect 5G to be between around 3 and 6 times faster than 4G on average, but in some cases that difference is much greater. Various areas such as VR/AR and the quality of videos and gaming is further going to accelerate the data demand.
Growing compute / data bottleneck
Research commissioned by Micron Technology found that 89% of respondents say it is important or critical that compute and memory are architecturally close together. The survey, carried out by Forrester Research, also found that memory and storage are the most commonly cited concerns regarding hardware constraints limiting AI and machine learning today. More than 75% of respondents recognize a need to upgrade or re-architect their memory and storage to limit architectural constraints. Finding ways to bring compute and memory closer together is critical to reduce power and costs and increase performance.
This will only grow with the exponential data growth: AI applications generate about 80 Exabyte’s per year, which is expected to increase to 845 Exabyte’s by 2025. Manufacturers will increase their output of storage accelerators in response, with pricing dependent on supply staying in sync with demand. AI solutions must adapt to changing needs — and those depend on whether an application is used for training or inference. For instance, AI training systems must store massive volumes of data as they refine their algorithms, but AI inference systems only store input data that might be useful in future training. Overall, the demand for storage will be higher for AI training than inference.
Chris Gardner, Forrester’s senior analyst: “The compute needs to surround the
memory and, to a lesser degree, the storage, rather than compute being at the center. We’ve been running with a CPU kind of mindset for decades now, the idea that we’re going to get out of that mindset is pretty revolutionary.”
CPU reaching their limits
The chipmakers know that CPUs will soon reach their limits based on current architectures. They are great with sequencing tasks, starting programs. etc. as long as they stay out of the way of the accelerators. The hard work is to figure out what the right accelerator is because the models are changing all the time. The problem is always the weakest link. If you fix one element, another emerges as an issue (storage capacity, memory, through-put, etc. — this is an endless battle where new issues keep popping up).
Developers only need one server to build an initial AI model, but AI applications require many servers during training and much more during production with real data. Autonomous-driving models, for instance, require over 140 servers to reach 97 percent accuracy in detecting obstacles. If the speed of the network connecting servers is slow — as usually is the case — it will cause training bottlenecks.
Complexity of processing of irregular data types
Fixed coarse-grained architectures like GPUs in order to be efficient need to collect a batch of data to process. This doesn’t work well with some data structures, like sparse matrices, where the memory latency will severely influence the performance. Take the example of 2D vision vs 3D. 2D images are dense, i.e. no pixels are missing and GPUs are built to efficiently exploit this regularity. 3D point clouds, however, are irregular, sparse, and unordered. A way for GPUs to be effectively used on point clouds is to somehow regularize them, which can be done at the cost of losing information, reducing accuracy, and generating artifacts.
Shortage of hardware engineers and long development times
According to the US Labor report from 2019, we have 22 software engineers on every hardware engineer in the market. Moreover, we lack system developers with a good understanding of both. At least more software developers should have an understanding of the hardware tradeoffs and consider optimization during the coding process.
While advances in software tools and frameworks have enabled individuals to create new products within relatively short time frames, hardware designs take large teams multiple years to develop. Classical metric for comparing AI accelerators is “TeraOps/W”. However, recently “go to market” time seems to be an equally important metric as the AI world moves very fast.
Throughput bottlenecks and fixed flows for hardware
AI algorithms are massively parallel in nature and their execution
benefits largely from the computation throughput of the accelerator. Hardware accelerators need to support every increasing computation complexity. Coarse-grained accelerators like GPUs usually rely on some kind of a compiler to schedule the algorithm on available resources. They are always designed and optimized with a specific processing flow in mind.
For decades, the choice has been pretty simple: either build an Application Specific Integrated Circuit (ASIC), or a general chip. An ASIC is a fully custom design, which is mass-produced and deployed in devices. Obviously, you get to tweak it and optimize it in the way that works best for your application: best performance, best power efficiency, etc. However, designing,
producing, and deploying an ASIC is a long, expensive — and risky infrastructure where you need a highly skilled team to set up and maintain them on-premises. One issue that prevails is the way HW is designed. Current HW design methodology is more focused on optimizing the end result in terms of area and performance, than on allowing easier maintenance and reducing the cost of change.
The next generation of hardware are runtime & re-configurable hardware’ architectures. Toolsets are extremely important to reach scale and we need to make those toolsets more accessible. There would also be a massive benefit to patching effective toolsets together and, similar to the software community, embrace more open source collaboration with communities that boost development in that space.
The opportunity for virtual chips
Current GPUs are too power-hungry and ASICs too inflexible. The cost of bringing a server-class chip to market is around a half-billion dollars. It could be close to a billion over the next several years. If your application can live with the latency required to collect enough samples to forward a full batch, then you should be fine. If not, then you’ll have to run inference on single samples and it’s likely that throughput will suffer. In order to get the best inference performance, the logical step would be to use a custom chip. For decades, the choice has been pretty simple: either build an Application Specific Integrated Circuit (ASIC) or use an FPGA.
Since designs deployed on virtual chips are easy to scale and to deploy, hardware designers can target more niche markets and still make a profit. This can open a new market as we had 25 years ago with the web industry.
Xilinx CTO, Ivo Bolsens: FPGAs won’t just gain incremental
momentum, they will put the CPU out of work almost entirely. “In the
future you will see more FPGA nodes than CPU nodes. The ratio
might be something like one CPU to 16 FPGAs, acceleration will
outweigh general compute in the CPU
Engineer for flexibility: Are FPGA’s the future?
The ability to program the FPGA in a flexible manner allows it to process sparse data orders of magnitude faster and much more energy-efficient than a CPU or GPU. FPGAs can do as many arbitrary functions in parallel as they have logical elements (thousands to several million). When processing sparse data, an FPGA can be programmed to ignore zeros and only compute non-zero values. In addition, random access to memory is far more granular and efficient on an FPGA, their chip memory access latency is 1 clock cycle vs 200 for GPUs, they consume less power and they have a flexible pipeline.
As our knowledge within the AI field progresses, the algorithms will become more computationally optimized which will inevitably require specialized operations. Already we have seen how much sparsity can play a role in AI network optimization. FPGAs have an edge over all coarse-grained accelerators in that they can support specialized operations with little overhead.
Acquisitions of FPGA companies emphasize the potential
In recent years the big chip makers have increasingly focused on FPGA companies and with large acquisitions as a consequence. Examples include the recent $35 Billion AMD acquisition of Xilinx, Intel's acquisition of Omnitek (and before Altera), and NVIDIA’s acquisition of ARM. Clearly, more programmable architectures are important for the leaders in the space.
Make it easier to program HW / FPGA’s with higher-level languages
Programming FPGAs is still relatively hard given the lower level language used. There is a clear opportunity to enable a more agile hardware development flow, making it possible to quickly and easily modify an existing design and play with the resulting system. There are many ongoing attempts to improve the way hardware is designed, and it is recognized that FPGA is a great technology that lacks the proper programming model. One approach at making it easier for software engineers to design hardware areHLS tools, which unfortunately bring suboptimal performance. Anari AI approaches this by enabling developers to program FPGAs in Python — instantly opening up to a much larger developer community.
Optimize networks with programmable switches
Although most strategies for improving network speed now involve data-center hardware, developers are investigating other options, including programmable switches that can route data in different directions. This capability will accelerate one of the most important training tasks: the need to
resynchronize input weights among multiple servers whenever model parameters are updated. With programmable switches, resynchronization can occur almost instantly, which could increase training speed from 2 to 10x. The greatest performance gains would come with large AI models, which use the most servers.
New data flow architectures/data transfer technologies such as free-space optical data transfer: instead of electrons, you use photons for transfer. Achieving that would mean to solve the data bottleneck shifting the problem again to compute. On the compute side, in turn, heterogeneous integrations and taking a ‘chiplet’ based approach might offer significant jumps. For instance, you can put ‘chiplets’ 0.1mm apart and stitch them at the wafer to scale the package that way. Because in semiconductors chips have scaled enormously and packages have not. Furthermore, of course, there are completely new architectures such as quantum computing that will significantly accelerate the capacity for various use cases whilst also presenting new bottlenecks.
Optimization for exponentially more devices
There is an estimate of 250bn IoT devices in the next 5 years. Over 3G response times are typically around 60 milliseconds (ms) and on 4G they’re around half that at roughly 35ms. The theory is that 5G response times will ultimately drop to just 1ms, which will be completely imperceptible.
Fog computing can create low-latency network connections between devices and analytics endpoints. This architecture in turn reduces the amount of bandwidth needed compared to if that data had to be sent all the way back to a data center for processing. It can also be used in scenarios where there is no
bandwidth connection to send data, so it must be processed close to where it is created. Higher up the stack fog computing architectures would also touch core networks and routers and eventually global cloud services and servers.
In a world of exponentially more connected devices, some of them will not have a need for standard CPU architectures.
Flexibility of the cloud
Enabling more people to program and access custom / programmable hardware can best be done in the cloud. Especially when the bottleneck of lower level programming languages is mitigated, that means access at scale for programmers to build custom designs and consequently users. Perhaps we will see ASIC clouds as well.
Removing (middleware) layers of complexity
AWS, Google, and other cloud service providers are producing layers of layers to offer supposedly “easier” solutions for customers. However, true innovation will come from the bottom. From nimble and agile startups who see the fundamental pain point that tech giants are currently creating. Example: Sage Maker from Amazon — similar to Google Mail where you have all the self-standing elements, but nothing really talks to each other. This also describes quite well how the current ML tooling workflow space looks like. We still need to piece self-standing elements together and deploying this on a data science level is just where the headache starts. A company that solves this will go very far.
The app store for virtual chips?
We need to crack the code of reconfigurability in the software space. Anari AI envisions a cloud-based app store for virtual chips where each company can personalize their needs.