Why ai will kill the Smartphone

Aron Kirschen
5 min readDec 1, 2021

--

It seems that evolution has not taken smartphones into account… (Photo by Hugh Han on Unsplash)

Okay, a click bait title and definitely one you’ve already heard. And now, the but section usually starts…

But, there is a true core in this title :) Facebook (now Meta) announced it, others as well. It’s not a bet on a certain technological direction, it’s rather a no-brainer to predict that development.

Today, we humans have to adopt to the technology. We have to look down on a small screen to write messages, we have to hold it to our ears to talk and we even use our fingers to transform the, at least to some extent, efficient communication via speech to the slow text messaging.

Future generations will think of that as medieval age technology.

Thus, it’s only a matter of years until we will see wide adoption of smart contact lenses with AR, wrist bands for gesture recognition and in-ear buds for listening (the last device is probably something we need to completely reinvent since it is not convenient at all and bone conduction is a solution for smart glasses not contact lenses).

But unfortunately, today we don’t have the hardware components we need for that. I will show you why.

Before we step into technical details, let’s have a short look at the impact of a communication system like that.

Imagine what potential it holds, to enable deaf people to ‘see’ the voice of their counterparts on a screen in the field of view with Automatic Speech Recognition (ASR).

Walking at night through the city while getting the full picture as it would be in daylight?

Zooming in on a detail in the far distance fully controlled by your eyes?

Getting all the information you need when you meet someone at a conference?

Seamlessly switch to a friend’s point of view who is surfing at the coast of California while sitting in a café in Berlin?

Let spoken Google searches pop up on your screen?

The list continues…

And it has one thing in common: all that needs powerful artificial intelligence. And the most powerful (at least most developed) class of algorithms we have for that are Deep Learning (DL) Models.

Okay, I have no idea what this has to do with my article, but it’s beautiful — and in the city where I live (Photo by Pan Species on Unsplash)

My goal is to show you how this leads to an extremely high need of computational power optimised for DL.

For that I will use two parameters: the number of operations usually measured in Giga-/Tera-operations per second (GOPS/TOPS). We are fine with INT8 precision, for those who don’t like this measure.

The second parameter is the number of trained parameters that fits into the chip (because we will need in-memory computing ASICs for that since GPUs and CPUs are inherently inefficient for DL due to off-chip memory accesses).

The mentioned applications above fit into two classes:

  1. machine vision (for example vision enhancement or eye tracking)
  2. natural language processing (NLP) / speech recognition.

Object annotation is at the intersection of both.

For the first class you need high throughput because pictures are big and latency is critical. Thus, it is very important to have a lot of TOPS carried out on your chip. For eye tracking you need some TOPS, for night vision including recolorisation some more and for super resolution it should also not be less than some TOPS. Let’s assume 10 TOPS at the moment.

Processing that workload on-device typically means battery capacity of <2x150mAh at 3.7V for glasses and way less for contact lenses. With that you have to power the display, the image sensor and the processor.

By spending half of the power for the chip and a targeted battery life of at least 10 hours (without the decline over time of your battery!), you’ll end up with a need of >200 TOPS/W just for machine vision tasks.

Given the multitude of parameters you need for the dictionary of ASR / NLP you also get a lot of TOPS. This workload is on top of what we just calculated.

And now imagine that amount of workload on a contact lens…

This guy used a GPU for his super-resolution contact lens (he later regretted this publicly)

This is the reasons why a dozen of innovative start-ups are working on a new class of processor chips for DL models. Mythic, Syntiant and SEMRON are just a few names in that game.

You may ask yourself, why we can’t just transmit the data to the smartphone and process it there or in the cloud?

There are 3 main reasons:

  1. legal problems, imagine all the intimate situations which would be uploaded
  2. inconvenience: relying on the smartphone wouldn’t be a good replacement of the smartphone, right? Relying on best connectivity all the time isn’t a realistic scenario for still a lot of places and we don’t want to have our reality faltering…
  3. but the main reason is the transmission bottleneck

Transmitting audio signal requires 16kB/s (16k samples per second at 8bit at least). Transmitting text only 12B/s (for English 3 symbols per syllable, 4 syllables per second, 8 bit for the symbol domain). This is a compressing factor of >1,000x! Even in audio-only applications this would put a lot of pressure on the Bluetooth integration.

For vision the equation becomes even worse: good inference takes >50fps, each with at least one MB resolution. This is simply not possible without burning the device with the antenna.

Good news: all parts needed for that are currently in development. For example, there is fantastic progress with MEMS micro mirrors for awesome new displays.

Because for the world after the smartphones, we need innovations in every component.

****

For the sake of brevity, I simplified a lot. If I did a mistake or if I missed something important please add a comment! After my last article, I ‘ve got some suggestions for additional articles. I will try to integrate as many of the feedback as possible into my next article.

Thanks for reading!!

--

--

Aron Kirschen

founder of SEMRON GmbH, industrial engineer, opera and wine enthusiast, Go >> Chess