Introduction to TinyML

Ayyuce Demirbas
8 min readJul 2, 2022

--

In this blog post, you’ll learn about Arduino Nano 33 BLE and the basics of TensorFlow Lite for Microcontrollers.

Having been chosen to contribute to TensorFlow Lite for Microcontrollers as part of this year’s Google Summer of Code program, I will delve into running machine learning models on microcontrollers. I aim to share the outcomes of my experiments and the insights I’ve gained through a series of blog posts.

TinyML involves making predictions on microcontrollers using machine learning models. The most challenging part of TinyML is the hardware constraints. We have to work with limited memory and computational power. And, if we need to use battery power, we must reduce power consumption. More computational power drains more battery power.

Although there are different frameworks for TinyML applications, Google’s TensorFlow Lite Micro is widely used in this field.

TensorFlow Lite for Microcontrollers supports the following development boards:

· Arduino Nano 33 BLE Sense

· SparkFun Edge

· STM32F746 Discovery kit

· Adafruit EdgeBadge

· Adafruit TensorFlow Lite for Microcontrollers Kit

· Adafruit Circuit Playground Bluefruit

· Espressif ESP32-DevKitC

· Espressif ESP-EYE

· Wio Terminal: ATSAMD51

· Himax WE-I Plus EVB Endpoint AI Development Board

· Synopsys DesignWare ARC EM Software Development Platform

· Sony Spresense

For my experiments, I’m going to use Arduino Nano 33 BLE.

According to the datasheet, the features of the Arduino Nano 33 BLE are written below.

Processor

· 64 MHz Arm® Cortex-M4F (with FPU)

· 1 MB Flash + 256 KB RAM

Bluetooth® 5 multiprotocol radio

· 2 Mbps

· CSA #2

· Advertising Extensions

· Long Range

· +8 dBm TX power

· -95 dBm sensitivity

· 4.8 mA in TX (0 dBm)

· 4.6 mA in RX (1 Mbps)

· Integrated balun with 50 Ω single-ended output

· IEEE 802.15.4 radio support

· Thread

· Zigbee

Peripherals

· Full-speed 12 Mbps USB

· NFC-A tag

· Arm CryptoCell CC310 security subsystem

· QSPI/SPI/TWI/I²S/PDM/QDEC

· High speed 32 MHz SPI

· Quad SPI interface 32 MHz

· EasyDMA for all digital interfaces

· 12-bit 200 ksps ADC

· 128-bit AES/ECB/CCM/AAR co-processor

LSM9DS1 (9 axis IMU)

· 3 acceleration channels, 3 angular rate channels, 3 magnetic field channels

· ±2/±4/±8/±16 g linear acceleration full scale

· ±4/±8/±12/±16 gauss magnetic full scale

· ±245/±500/±2000 dps angular rate full scale

· 16-bit data output

MPM3610 DC-DC

· Regulates input voltage from up to 21V with a minimum of 65% efficiency @minimum load

· More than 85% efficiency @12V

Processor

Arduino Nano 33 BLE uses an ARM Cortex M4F processor which runs up to a 64MHz clock. That means it spends 1/64 = 0.015 microseconds in a clock cycle. The processor possesses 32 bits wide registers, 1MB of Flash memory, and 256 KB of RAM. “The flash (non-volatile memory) can be read an unlimited number of times by the CPU, but it has restrictions on the number of times it can be written and erased and also on how it can be written”. [1] The processor also contains a floating-point unit. A Floating-Point Unit (FPU) (or math coprocessor) performs operations such as addition, subtraction, multiplication, and division on floating-point numbers.

“Arm Cortex processors with Digital Signal Processing (DSP) extensions offer high performance signal processing for voice, audio, sensor hubs and machine learning applications, with flexible, easy-to-use programming. The extensions provide a unique combination of compute scalability, power efficiency, determinism and interface options in order to perform the signal processing required in multi-sensor devices that do not require dedicated DSP hardware. The benefits of DSP extensions in Cortex processors include:

· Simplify the design, lower the bill of materials, reduce power and area with DSP and ML capabilities on Arm processors across a single architecture.

· Reduce system-level complexity by removing the need for shared memory and DSP communication, complex multi-processor bus architectures, and other custom ‘glue’ logic between the processor and DSP.

· Reduce software development costs, as the entire project can be supported using a single compiler, debugger or IDE, programmable in a high-level programming language such as C or C++.” [7]

ARM Cortex M4F implements Armv7E-M architecture. For more information about Armv7E-M architecture, you can download the architecture reference manual from here . ARM Cortex M4 processors are based on Harvard computer architecture. That means, there are separate memory and pathways for instructions and data.

Figure 1: (https://developer.arm.com/Processors/Cortex-M4)
Figure 2: Memory Map

Bluetooth

The Arduino Nano 33 BLE uses u-blox NINA-B306, which is A powerful 2.4 GHz Bluetooth® 5 Low Energy module from u-blox, with an internal antenna.

Inertial Measurement Unit (IMU)

This LSM9DS1 module detects orientation, motion, and vibrations using a 3D accelerometer, gyroscope, and magnetometer.

The Operating System

The ARM Cortex M4F processor uses Mbed OS. The Mbed OS is an open-source operating system that targets microcontrollers, Internet of Things devices, and wearables.

TensorFlow Lite for Microcontrollers

TensorFlow Lite Micro is a machine learning framework for microcontrollers. It provides low memory usage and power consumption.

TensorFlow Lite Micro employs an interpreter-based approach which allows portability across different hardware platforms. On the other hand, code generation approach which generates models into C code does not allow portability. Furthermore, “code generation intersperses settings such as model architecture, weights, and layer dimensions in the binary, which means replacing the entire executable to modify a model. In contrast, an interpreted approach keeps all this information in a separate memory file/area, allowing model updates to replace a single file or contiguous memory area.” [2]

Figure 3: Implementation-module overview [2]

According to the official website of TensorFlow Lite Micro to train and deploy your model to the device and make an inference, the following steps are required.

· Build your model considering the limits of your device. Smaller models can cause underfitting, and larger ones might result in a higher duty cycle which drains more power.

· You need to convert your TensorFlow model to a TensorFlow Lite model.

· Convert your TensorFlow Lite model to a C array to deploy it to your device.

· Deploy it and run inference.

Unfortunately, TensorFlow Lite for Microcontrollers doesn’t support all of the TensorFlow operations. You can find a list of the supported operations here.

How Does TensorFlow Lite Optimize the Memory Usage?

Intermediate tensors hold the intermediate computation results to reduce inference latency. These intermediate tensors can use large amounts of memory. They might be larger than the model. TensorFlow lite employs different approximation techniques for intermediate tensors problem. These techniques use a data structure named tensor usage records. They contain records of how big the intermediate tensor is and when it is used for the first and last time. The memory manager reads these records and computes the intermediate tensor usages to reduce memory footprint. [6]

Experiments

Authors of MicroNets: Neural Network Architectures for Deploying TinyML Applications ON Commodity Microcontrollers [4] paper run experiments on different kinds of MCUs using TensorFlow Lite Micro. You can find their models here.

In figure 4, the authors show an example memory occupancy map for a KWS model on ARM Cortex M7 with 320 KB of Static Random Access Memory and 1 MB of on-chip embedded flash memory. Weights and biases (and the model architecture) must be stored in the flash memory due to its capability of maintaining the stored data without requiring power. In this example, the weights and biases occupy 500 KB of memory. The model size depends on how many layers and units on each layer we have and the size of our dataset.

Figure 4: An example memory map of how an audio keyword spotting model mapped onto ARM Cortex M7 with 320 KB of SRAM and 1 MB eFlash [4]

This paper also measures the latencies of different layers of TensorFlow Lite Micro. Figure 6 and Figure 7 explain how convolution work.

Figure 5: Latency of each layer in TensorFlow Lite Micro on the ARM Cortex M7 with 512 KB of SRAM and 2 MB flash memory [4]

This paper also measures the latencies of different layers of TensorFlow Lite Micro. Figure 6, Figure 7, and equation 1 explain how convolution works.

Figure 6: Convolution kernel
Figure 7: Convolution (Vincent Dumoulin and Francesco Visin, arXiv, A guide to convolution arithmetic for deep learning, 2016)
Equation 1

Depthwise convolutional layers provide a cheaper computation.

RGB images consist of three channels (matrices), red, green, and blue. Depthwise separable convolution processes each channel separately. Similar to spatial separable convolution, the depthwise separable convolution layer also divides the kernels into smaller-sized kernels.

These layers work in two stages. The first stage is depthwise convolution. At this stage, each channel of the image is processed separately. This is not the final result, but an intermediate result for the point convolution, the second step of the depthwise separable convolution.

Pointwise convolution uses 1x1 kernels. For RGB images, the depth of this kernel must be equal to the number of channels, 3. As a result of the convolutions performed on the intermediate results obtained from the previous depthwise convolution step, we get an output of the same size as the output of the standard convolution layer.

Figure 8 shows how depthwise convolutional layers work.

Figure 8: Depthwise separable convolution

According to Figure 5, depthwise convolutional layers increase the latency.

Implementation

Table 1 shows what kind of neural nets we use for different types of data. Convolutional neural networks are quite useful for image data.

Table 1: Survey of TinyML use-cases, models, and datasets [3]

Google bought me an Arduino Nano 33 BLE board, but it didn’t arrive yet. For the next blog post, I’ll build and train a convolutional model using TensorFlow and convert it to TensorFlow Lite format and C array. I’ll also handle the basics of deep learning and image classification. I’ll use the CIFAR-10 dataset for this project.

References

[1] https://content.arduino.cc/assets/Nano_BLE_MCU-nRF52840_PS_v1.1.pdf

[2] Robert David, Jared Duke, Advait Jain, Vijay Janapa Reddi, Nat Jeffries, Jian Li, Nick Kreeger, Ian Nappier, Meghna Natraj, Shlomi Regev, Rocky Rhodes, Tiezhen Wang, Pete Warden, TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems

[3] Colby R. Banbury, Vijay Janapa Reddi, Max Lam, William Fu, Amin Fazel, Jeremy Holleman, Xinyuan Huang, Robert Hurtado, David Kanter, Anton Lokhmotov, David Patterson, Danilo Pau, Jae-sun Seo, Jeff Sieracki, Urmish Thakker, Marian Verhelst, Poonam Yadav, Benchmarking TinyML Systems: Challenges and Direction

[4] Colby Banbury, Chuteng Zhou, Igor Fedorov, Ramon Matas Navarro, Urmish Thakker, Dibakar Gope, Vijay Janapa Reddi, Matthew Mattina, Paul N. Whatmough, MicroNets: Neural Network Architectures for Deploying TinyML Applications on Commodity Microcontrollers

[5] Vincent Dumoulin and Francesco Visin, arXiv, A guide to convolution arithmetic for deep learning, 2016

[6] https://blog.tensorflow.org/2020/10/optimizing-tensorflow-lite-runtime.html

[7]https://developer.arm.com/Architectures/Digital%20Signal%20Processing

--

--