Questions Every Investor should ask an AI Accelerator Start-up

Aron Kirschen
11 min readNov 2, 2022

--

[disclaimer: the author is co-founder of SEMRON, an AI accelerator start-up itself; however, the specific advantages of SEMRON’s technology are not discussed here]

Deep-tech seems to be the only remaining great artistry of our time (Assisi, Basilica of Saint Francis of Assisi)

The more questions I get from interested (or at least allegedly interested) VCs, the more I learn about the strategies some deep-tech start-ups use to cover up their individual technological weaknesses.

At the first glance, it is no surprise, that deep-tech founders are in a very comfortable position of hiding the shortcomings of their technology: they simply know best.

That’s why investors rely on the judgment of industry experts. Ultimately, it becomes primarily a gamble as to which expert is chosen (paradigm changes are rejected the most by the peoples in the industry concerned).

While I can’t talk about deep-tech domains other than AI accelerators (computer chips that run trained Deep Learning models), I know which questions are relevant here.

Content

  1. General Questions
  2. SRAM-based in-memory computing
  3. Non-SRAM-based in-memory computing
  4. In-Memory computing based on RRAM/PCM/MRAM/FRAM
  5. Near-memory computing
  6. Accelerators for Spiking Neural Nets

I will limit that to in/near-memory computing approaches and spiking-neural net accelerators to keep it at least a little short. However, I may extend this article in the future to cover other approaches as well. If you think I have missed a critical aspect (or that I am hiding something we may be struggling with ;-) PUT IT IN THE COMMENTS, MARK IT RED AND BOLD AND USE CAPITAL LETTERS!

General questions / challenges

What is your business model? The most fundamental question. After all, it revolves around risk versus upside potential. For instance, selling actual chips as opposed to licensing IP entails extraordinarily high risk. It involves navigating the ecosystem (surviving among some of the most powerful companies on earth), technological risks, and the capital needed until market readiness. With IP licensing, the risk is somewhat lower, but this comes at a cost: you are likely not a fund returner. Why? If you license a process or new material to a foundry, you are trapped, and your exit channel is limited to ~1. Moreover, you lose control over your technology. You will have to ensure that your part fits into the system, and you will only capture a small portion of the value chain (low potential for further innovation). This is particularly true if the economic costs of switching are high, for example, when licensing processes for new materials to a foundry.

  1. Are customers able to run any common DL model? This question is essentially about the compiler and therefore flexibility of the the AI chip. For example, Gyrfalcon’s Lightspeeur 5801S chip is only capable to run some selected DL models (and of course, it is WAY easier to be efficient if the hardware is optimised for only some data flow options).

This aspect has gained enormous relevance with the rise of GenAI/LLMs: they are based on transformers, which comprise so-called Attention Layers. In contrast to classical convolutional neural nets, where you can keep the parameters of the model fixed (which benefits the in-memory computing approach), there are operations without fixed operands in every inference. Therefore, you need non-volatile (fixed weights) AND volatile operations (attention layers). Many in-memory computing approaches suffer from this, as they need to run the operations for these layers outside the crossbar, which breaks their energy efficiency completely. So one might ask: what do the metrics look like for a ViT model and where are the attention layers calculated?

2. How is the AI chip integrated in the system? Which interfaces are used? Is there a bandwidth bottleneck?

3. Which metrics are used? It’s easy to generate nice numbers for marketing. For example, TOPS/W is a useless number without regard to the input and ‘weight’ (trained parameter) precision. What assumptions in terms of utilisation, used DL models and sparsity were made?

In addition, good performance numbers are defined with a specific DL model, whose configurations are laid out in detail. If these numbers do not exist, one can infer (in the positive case) that they simply have not yet been able to produce such results.

4. What is the parameter density? This is a very critical property! The number of parameters (of a given precision again!) per footprint determines how large the DL model can be in terms of trained parameter without relying on external memory (which basically destroys every efficiency and comes along with bottlenecks). Instead of the size of the DL model this question can also be about how many of them can be carried out in parallel or which acceleration is possible through redundancy.

This defines the entire business concept: what is the maximum package size of your targeted application/product? There is room for some dozens of square millimeters in a smartphone. If the in-memory computing approach in-question has a parameter density of, let’s say, 100 million parameters per square centimeter, then you probably can’t properly run models with 30 million parameters! This doesn’t sound like the next-gen AI chip for mobile devices, does it?

5. What is the performance density? Again, choosing a Deep Learning Model as reference that fits the chip architecture very well is just a bad joke. Ask for the performance metrics for many different DL model topologies!

6. What is the precision of the input values and the trained parameters (‘weights’)? With less than 6bit integer, retraining after quantisation is almost always required. This means that every customer would have to hand out you all the data and the AI accelerator start-up would have to train the model for every customer individually. This question mainly concerns SRAM- or MRAM-based approaches. More over, is the device able to implement floating-point precision?

7. What is the energy efficiency on device / chip / system level? Of course, the direct benefit from in-memory computing is not the energy efficiency of the device itself, but at system level, including the interfaces. However, the device efficiency is crucial for the last question:

8. Is the technology able to be scaled up in the 3rd dimension? With scaling down technology nodes coming to an end, the future will belong to the technologies capable of stacking dozens (or hundreds) of compute layers on top of each other as in 3D NAND flash without overheating. [1]

Important addition: Pre-seed investors might not be too concerned if founders haven’t spent much time already thinking about costs and price points. However, it is questionable whether customers will pay $500 for an AI chip that goes into an AI-enhanced lawnmower. Chiplets will not significantly improve cost-effectiveness here, nor will HBM. 3D monolithic growth will.

3D stacking in the era of the Roman Empire (Forum Romanum)

SRAM based in-memory computing

SRAM is an established (volatile) memory technology. The advantages are writing speed and endurance, hence it could be a candidate for training as well. The major drawback comes in two dimensions: first, SRAM cells are large. Especially when aggregated to multiple-bit SRAM circuits. The second problem is the same as with most digital technologies: variable costs.

In this case, some useful questions could be:

  1. What is the parameter density? How can the factor of over 40x times fewer parameters per footprint be reduced to a reasonable level? [2] Derivations such as Switched Capacitor — meant to solve the SRAM-based disadvantages — are even larger (while still not achieving outstanding energy efficiency)!
  2. How do they intend to compete on a cost level? Assuming a 10x higher wafer price for 300mm wafers in advanced technology nodes and ~2.5x more dies per wafer, (pure) analog-based technologies could be 4x cheaper in production.
  3. (If SRAM is used for binary weights) What is the overall energy efficiency? How to overcome the market entry barrier if re-training is required after quantisation? Either SRAM suffers from parameter density or from low precision. Even for inference, more than 6bit INT weight precision are required to avoid retraining after quantisation. [3]
  4. How big is the additional power consumption due to leakage currents? This is esp. critical for smaller technology nodes.
  5. Is it possible to stack multiple compute layers with SRAMs? In a trend that calls more and more for 3D integration it is questionable what the future of SRAM will look like. The IRDS does not expect monolithic integration of multiple layers of SRAM before 2031. What is the strategy to compete with emerging 3D-enabled technologies other than being forced to hold onto scaling down technology nodes or expensive and limited die-stacking?
Tiles placed next to each other as in 2.5D integration (Florence, Baptistery)

Non SRAM based in-memory computing

Most in-memory computing approaches are analog by nature. But not all of them are based on a new device or new material. For example, a flash transistor can be used for in-memory computing as well. However, they have the following challenges in common:

  1. How to deal with device-to-device variations? In contrast to digital approaches, analog devices (hence, not only the input is analog but also the parameter / weight) often suffer from small changes in their behaviour that lead to computation variance. This can compromise the accuracy of the neural net. [4]
  2. How many writing cycles can be achieved? This is called endurance. It is not only important for training but for modern SOTA DL models like Transformers and Graph Neural Nets as well. For example, flash cells achieve around 10⁵ cycles, which is too few for the DL models classes mentioned above. However, sufficient endurance in analog technologies is no longer exclusive for SRAM- or MRAM-based in-memory computing. But this should be carefully checked by any deep tech investor.
Variations in minerals inspired this wonderful piece of art (Uffizi, Panel with view of the port of Livorno); in electronics variations are rather undesirable

In-memory computing with emerging memory technology / new device architectures

Non-volatile memories may include emerging memories like MRAM, FRAM, RRAM etc. each with their own specific challenges.

NVMs usually have a small footprint but have to be carefully fitted in the usual CMOS technology (Florence, Baptistery)

They are almost always pure analog approaches. However, since there are some peculiarities here, I have separated them from other non-SRAM based technology. The other questions in the previous section also apply to this sections. Especially with RRAM, device variations can lead to major problems.

  1. Is it based on new materials? Introducing new materials in a foundry is a difficult and expensive task. Only few foundries will be willing to do this for small entities. Also, new materials are inherently prone to yield / qualification issues by nature. CMOS compatibility is a key criterium here.
  2. How dependent is the company on the specific foundry? The negotiation power of the foundry will be extraordinarily if a new material has been developed with their support. They probably ask for exclusive rights / licenses. Is it possible to move to another foundry? (This also affects the exit scenario as mentioned in the General Questions part.)
  3. What is the time to market? In contrast to established CMOS technology, multi-project wafers are not feasible if a new process is developed. This requires a lot of engineering effort and causes large development cycles. In the mean time, more ‘classical’ approaches may target the ‘low hanging fruits’, creating some momentum.

If a foundry’s technology, which entails an established process, is utilized and MPWs are an option, a different concern arises: the ease with which the IP can be circumvented.

Grains are a challenge in many emerging NVMs: in HfO₂ for example, they impair the ability to achieve multilevel switching with small devices (Fountain of Neptune, Bologna)

Near-memory computing approaches

By reducing the distance between memory and processor, the power consumption of the data transfers is reduced. However, there is no way to achieve the energy efficiency and parameter density as in in-memory computing approaches. It is therefore questionable whether the stuck-in-the middle position allows for a potential of more than niche markets. A well know example in that space is Syntiant and Graphcore.

Devices for Spiking Neural Nets

So far, we only discussed technologies focussing on Deep Learning models such as Convolutional Nets, Transformers or Graph Neural Nets. But there are a lot of technologies that are specifically suited to a different type of Machine Learning: Spiking Neural Nets (SNNs).

While SNNs definitely have their place in the AI landscape, they have some fundamental barriers to overcome.

  1. Can potential customers use the same software stack they use currently? Costumers want to use standard tools like Tensorflow or Pytorch, but these tools don´t support SNNs natively.
  2. Which segments other than event-based applications can be targeted as well? This question is mainly about the potential of the company. Will it remain limited to niche markets?
  3. How sustainable is the USP of the event-based approach? Why shouldn’t event-based in-memory computing implementing SNNs, with all their other benefits, become a reality ?
  4. What is the largest SNNs they can run on a certain footprint? Of course, parameter density matters as well here. Mixed-signal technologies like current SNN hardware usually require a very big footprint for each parameter to reflect the nature of a spike. What is the strategy if a new device that implements the spike behaviour in one single cell pops up?
  5. What is the accuracy compared to SOTA DL models? Artificial neural networks trained with spike-timing-depent plasticity (STDP) have a much lower accuracy compared to DL models trained with backpropagation. This is especially true for large neural networks.

***

Of course, this is not a definitive list of the aspects that lead to an evaluation of a technology’s potential for edge AI inference, but I hope it is a useful starting point.

I will blog about other aspects of edge AI and in-memory computing every now and then.

Further reading

[1] With only Yolo v7 accounting for 37 million parameters, a parameter density of at least several 100 million parameters (>INT8) per cm² is required to enable edge devices with strict footprint constraints to run large / multiple DL models in parallel. Reducing the latency requires parallelisation which depend on the parameter density as well.

[2] SRAM has a footprint of several 100 F², NOR flash and other NVMs of only 8–12 F². Together with the 8 transistors usually required for 8 bit parameters (NVM: typically only 2), that leads to a disadvantage of factor 40!

[3] If the weights of the hidden layers are binary and the energy efficiency for the respective operations is around 600 TOPS/W, but the first layer of the neural net has to have high-precision weights, the overall energy efficiency is only 14 TOPS/W!

[4] One approach is to add noise while training the neural net to make it more robust agains variation. However, this would lead to the re-training issues as for low-precision SRAM. Training a DL model on-chip is obviously not scalable, except for a very rare type of DL models.

--

--

Aron Kirschen

founder of SEMRON GmbH, industrial engineer, opera and wine enthusiast, Go >> Chess