Questions every investor should ask an AI accelerator start-up

Aron Kirschen
9 min readNov 2, 2022

--

[disclaimer: the author is co-founder of SEMRON, an AI accelerator start-up itself; however, the specific advantages of SEMRON’s technology are not discussed here]

Deep-tech seems to be the only remaining great artistry of our time (Assisi, Basilica of Saint Francis of Assisi)

The more questions I get from interested (or at least allegedly interested) VCs, the more I learn about the strategies some deep-tech start-ups use to cover up their individual technological weaknesses.

At the first glance, it is no surprise, that deep-tech founders are in a very comfortable position of hiding the shortcomings of their technology: they simply know best.

That’s why investors rely on the judgment of industry experts. Ultimately, it becomes primarily a gamble as to which expert is chosen (paradigm changes are rejected the most by the peoples in the industry concerned).

While I can’t talk about deep-tech domains other than AI accelerators (computer chips that run trained Deep Learning models), I know which questions are relevant here.

I will limit that to in/near-memory computing approaches and spiking-neural net accelerators to keep it at least a little short. However, I may extend this article in the future to cover other approaches as well. If you think I have missed a critical aspect (or that I am hiding something we may be struggling with ;-) PUT IT IN THE COMMENTS, MARK IT RED AND BOLD AND USE CAPITAL LETTERS!

General questions / challenges

  1. Are customers able to run any common DL model? This question is essentially about the compiler and therefore flexibility of the the AI chip. For example, Gyrfalcon’s Lightspeeur 5801S chip is only capable to run some selected DL models (and of course, it is WAY easier to be efficient if the hardware is optimised for only some data flow options).
  2. How is the AI chip integrated in the system? Which interfaces are used? Is there a bandwidth bottleneck?
  3. Which metrics are used? It’s easy to generate nice numbers for marketing. For example, TOPS/W is a useless number without regard to the input and ‘weight’ (trained parameter) precision. What assumptions in terms of utilisation, used DL models and sparsity were made?
  4. What is the parameter density? This is a very critical property! The number of parameters (of a given precision again!) per footprint determines how large the DL model can be in terms of trained parameter without relying on external memory (which basically destroys every efficiency and comes along with bottlenecks). Instead of the size of the DL model this question can also be about how many of them can be carried out in parallel or which acceleration is possible through redundancy.
  5. What is the performance density? Again, choosing a Deep Learning Model as reference that fits the chip architecture very well is just a bad joke. Ask for the performance metrics for many different DL model topologies!
  6. What is the precision of the input values and the trained parameters (‘weights’)? With less than 6bit integer, retraining after quantisation is almost always required. This means that every customer would have to hand out you all the data and the AI accelerator start-up would have to train the model for every customer individually. This question mainly concerns SRAM- or MRAM-based approaches. More over, is the device able to implement floating-point precision?
  7. What is the energy efficiency on device / chip / system level? Of course, the direct benefit from in-memory computing is not the energy efficiency of the device itself, but at system level, including the interfaces. However, the device efficiency is crucial for the last question:
  8. Is the technology able to be scaled up in the 3rd dimension? With scaling down technology nodes coming to an end, the future will belong to the technologies capable of stacking dozens (or hundreds) of compute layers on top of each other as in 3D NAND flash without overheating. [1]
3D stacking in the era of the Roman Empire (Forum Romanum)

SRAM based in-memory computing

SRAM is an established (volatile) memory technology. The advantages are writing speed and endurance, hence it could be a candidate for training as well. The major drawback comes in two dimensions: first, SRAM cells are large. Especially when aggregated to multiple-bit SRAM circuits. The second problem is the same as with most digital technologies: variable costs.

In this case, some useful questions could be:

  1. What is the parameter density? How can the factor of over 40x times fewer parameters per footprint be reduced to a reasonable level? [2] Derivations such as Switched Capacitor — meant to solve the SRAM-based disadvantages — are even larger (while still not achieving outstanding energy efficiency)!
  2. How do they intend to compete on a cost level? Assuming a 10x higher wafer price for 300mm wafers in advanced technology nodes and ~2.5x more dies per wafer, (pure) analog-based technologies could be 4x cheaper in production.
  3. (If SRAM is used for binary weights) What is the overall energy efficiency? How to overcome the market entry barrier if re-training is required after quantisation? Either SRAM suffers from parameter density or from low precision. Even for inference, more than 6bit INT weight precision are required to avoid retraining after quantisation. [3]
  4. How big is the additional power consumption due to leakage currents? This is esp. critical for smaller technology nodes.
  5. Is it possible to stack multiple compute layers with SRAMs? In a trend that calls more and more for 3D integration it is questionable what the future of SRAM will look like. The IRDS does not expect monolithic integration of multiple layers of SRAM before 2031. What is the strategy to compete with emerging 3D-enabled technologies other than being forced to hold onto scaling down technology nodes or expensive and limited die-stacking?
Tiles placed next to each other as in 2.5D integration (Florence, Baptistery)

Non SRAM based in-memory computing

Most in-memory computing approaches are analog by nature. But not all of them are based on a new device or new material. For example, a flash transistor can be used for in-memory computing as well. However, they have the following challenges in common:

  1. How to deal with device-to-device variations? In contrast to digital approaches, analog devices (hence, not only the input is analog but also the parameter / weight) often suffer from small changes in their behaviour that lead to computation variance. This can compromise the accuracy of the neural net. [4]
  2. How many writing cycles can be achieved? This is called endurance. It is not only important for training but for modern SOTA DL models like Transformers and Graph Neural Nets as well. For example, flash cells achieve around 10⁵ cycles, which is too few for the DL models classes mentioned above. However, sufficient endurance in analog technologies is no longer exclusive for SRAM- or MRAM-based in-memory computing. But this should be carefully checked by any deep tech investor.
Variations in minerals inspired this wonderful piece of art (Uffizi, Panel with view of the port of Livorno); in electronics variations are rather undesirable

In-memory computing with emerging memory technology / new device architectures

Non-volatile memories may include emerging memories like MRAM, FRAM, RRAM etc. each with their own specific challenges.

NVMs usually have a small footprint but have to be carefully fitted in the usual CMOS technology (Florence, Baptistery)

They are almost always pure analog approaches. However, since there are some peculiarities here, I have separated them from other non-SRAM based technology. The other questions in the previous section also apply to this sections. Especially with RRAM, device variations can lead to major problems.

  1. Is is based on new materials? Introducing new materials in a foundry is a difficult and expensive task. Only few foundries will be willing to do this for small entities. Also, new materials are inherently prone to yield / qualification issues by nature. CMOS compatibility is a key criterium here.
  2. How dependent is the company on the specific foundry? The negotiation power of the foundry will be extraordinarily if a new material has been developed with their support. They probably ask for exclusive rights / licenses. Is it possible to move to another foundry?
  3. What is the time to market? In contrast to established CMOS technology, multi-project wafers are not feasible if a new process is developed. This requires a lot of engineering effort and causes large development cycles. In the mean time, more ‘classical’ approaches may target the ‘low hanging fruits’, creating some momentum.
Grains are a challenge in many emerging NVMs: in HfO₂ for example, they impair the ability to achieve multilevel switching with small devices (Fountain of Neptune, Bologna)

Near-memory computing approaches

By reducing the distance between memory and processor, the power consumption of the data transfers is reduced. However, there is no way to achieve the energy efficiency and parameter density as in in-memory computing approaches. It is therefore questionable whether the stuck-in-the middle position allows for a potential of more than niche markets. A well know example in that space is Syntiant and Graphcore.

Devices for Spiking Neural Nets

So far, we only discussed technologies focussing on Deep Learning models such as Convolutional Nets, Transformers or Graph Neural Nets. But there are a lot of technologies that are specifically suited to a different type of Machine Learning: Spiking Neural Nets (SNNs).

While SNNs definitely have their place in the AI landscape, they have some fundamental barriers to overcome.

  1. Can potential customers use the same software stack they use currently? Costumers want to use standard tools like Tensorflow or Pytorch, but these tools don´t support SNNs natively.
  2. Which segments other than event-based applications can be targeted as well? This question is mainly about the potential of the company. Will it remain limited to niche markets?
  3. How sustainable is the USP of the event-based approach? Why shouldn’t event-based in-memory computing implementing SNNs, with all their other benefits, become a reality ?
  4. What is the largest SNNs they can run on a certain footprint? Of course, parameter density matters as well here. Mixed-signal technologies like current SNN hardware usually require a very big footprint for each parameter to reflect the nature of a spike. What is the strategy if a new device that implements the spike behaviour in one single cell pops up?
  5. What is the accuracy compared to SOTA DL models? Artificial neural networks trained with spike-timing-depent plasticity (STDP) have a much lower accuracy compared to DL models trained with backpropagation. This is especially true for large neural networks.

***

Of course, this is not a definitive list of the aspects that lead to an evaluation of a technology’s potential for edge AI inference, but I hope it is a useful starting point.

I will blog about other aspects of edge AI and in-memory computing every now and then.

Further reading

[1] With only Yolo v7 accounting for 37 million parameters, a parameter density of at least several 100 million parameters (>INT8) per cm² is required to enable edge devices with strict footprint constraints to run large / multiple DL models in parallel. Reducing the latency requires parallelisation which depend on the parameter density as well.

[2] SRAM has a footprint of several 100 F², NOR flash and other NVMs of only 8–12 F². Together with the 8 transistors usually required for 8 bit parameters (NVM: typically only 2), that leads to a disadvantage of factor 40!

[3] If the weights of the hidden layers are binary and the energy efficiency for the respective operations is around 600 TOPS/W, but the first layer of the neural net has to have high-precision weights, the overall energy efficiency is only 14 TOPS/W!

[4] One approach is to add noise while training the neural net to make it more robust agains variation. However, this would lead to the re-training issues as for low-precision SRAM. Training a DL model on-chip is obviously not scalable, except for a very rare type of DL models.

--

--

Aron Kirschen

founder of SEMRON GmbH, industrial engineer, opera and wine enthusiast, Go >> Chess