AI on a cloud native WebAssembly runtime (WasmEdge) — Part I

Published in

Wasm

11 min readJul 18, 2021

This article will demonstrate how to run machine learned models using the edge computing paradigm. Specifically, how to run TensorFlow Lite YOLOv4 model on a WebAssembly (Wasm) run time, called WasmEdge.

WasmEdge is a Cloud Native Computing Foundation(CNCF) Sandbox project.

WebAssembly (Wasm)

WebAssembly, first announced by the W3C in 2015, is an effort to produce a standard high-performance machine-independent bytecode that is also safe. For example, from a memory perspective, Wasm only exposes three distinct isolated memory regions; the stack, global variables, and a linear memory region. By design, these regions must be accessed with different type-safe instructions. This makes it easy for a compiler to ensure that memory accesses are safe when compiling to native code (Johnson et al, 2021). Further, other operating system resources such as networking and multithread management are governed by high-level security policies. Wasm implements capability-based security, by design, so Wasm can be both performant and secure.

The importance of Wasm in edge computing

As we mentioned above, the design of WebAssembly promotes fast and safe programs. Wasm eliminates dangerous features from its execution semantics, while maintaining compatibility with programs written for C/C++ (WebAssembly, 2021). This fast and safe characteristic is particularly important in solving many technological problems.

One such problem is the fragility of the automotive supply chain. The automotive industry needs more functionality and features than ever before. However, the option to simply add more and more microprocessor-based electronic control units (ECUs) is becoming less and less viable.

Now, instead of the auto industry concealing dozens of physical computers throughout vehicles, auto industry manufacturers can potentially share physical hardware (IEEE Spectrum, 2021). This reduces the demand for microprocessors and reduces the cost of manufacturing by simplifying physical hardware requirements.

By changing the software architecture (instead of increasing the amount of hardware required), auto manufacturers can now worry less about supply chain issues and focus on achieving their technological feats in automation, infotainment, performance, comfort, efficiency, safety.

Edge computing

Automotive applications, e.g. cooperative autonomous driving, have strict or tight delay requirements that cloud solutions would have difficulty meeting. As an alternative to generic “cloud computing”, “edge computing” hosts computation tasks as close as possible to the data sources and the end users (Wang et al, 2020).

For example data from sensors and cameras can be processed at the edge (in the vehicle) removing any potential issues such as network latency, data availability, connectivity and so forth.

Bringing Wasm to the edge (WasmEdge)

WasmEdge brings Wasm to the edge; WasmEdge enables serverless functions (Wasm executables) to be embedded into many software platforms. For example WasmEdge can be used:

from the cloud’s edge
as an API endpoint i.e. Function as a Service (FaaS)
in Node command line
in embedded devices i.e. automobiles (WasmEdge Runtime, 2021)

Ahead Of Time (AOT) compiler optimizations

In its AOT mode, WasmEdge is the fastest Wasm VM on the market today (WasmEdge, 2021).

mikemacmarketing, CC BY 2.0 <https://creativecommons.org/licenses/by/2.0>, via Wikimedia Commons

Machine Learning (ML), Natural Language Processing (NLP) and subsequently Artificial Intelligence (AI)

Let’s take a look at how we can train and run ML on WasmEdge. But first, a little background …

TensorFlow Lite on WasmEdge

TensorFlow Lite is TensorFlow’s lightweight solution for embedded devices. It works without needing to make a round trip to a server. This removes network latency and connectivity issues and also maintains privacy because no data actually leaves the device (TensorFlow Lite, 2021).

In order to perform tasks such as object detection, facial recognition and so forth, TensorFlow requires a trained model; specifically a frozen model. What do we mean by models?

GraphDef

GraphDef files are the heart of your model data; they describe your graph in such a way that they can be read by other processes. GraphDef files come in two formats; with the .pb extension for binary format and the .pbtx for text format. Whilst the text format is structured data which is also human readable, the binary format is much less verbose and easier to run on a machine.

Checkpoint

Checkpoint files contain serialized variable from a TensorFlow graph. The checkpoint file does not contain any structure; just state of the variables at different stages of the learning process.

Frozen Graph

A Frozen Graph is created by combining the latest single Checkpoint file with the GraphDef file. Creating a Frozen Graph is known as “freezing”; where we literally take the definitions from a GraphDef file, then take the values from a Checkpoint file and then also turn every variable into a constant.

TensorFlow Lite (TFLite)

As mentioned above, TensorFlow Lite is an open source deep learning framework for on-device inference (TensorFlow Lite, 2021). TensorFlow Lite is able to perform on smaller devices because it:

is more memory efficient
uses less code
has fewer code dependencies
has smaller binary
accepts smaller model size
has a low-overhead static execution plan
uses flat buffers (as apposed to protobufs); can read data without deserialising an object

A TFLite file can actually be generated from an existing TensorFlow Frozen Graph. Transforming a Frozen Graph to a TFLite file is done converting a TensorFlow model into a compressed flat buffer (with the TensorFlow Lite Converter). This method is a few years old now. But it was worth mentioning. If you specifically want to use TensorFlow Lite only, there is good news.

Rather than go through the model conversion steps above (mainly useful for migrating). You can actually start from scratch and train, test and run your own TensorFlow Lite models. You can do this using the TensorFlow Lite Model Maker Library. Let’s go ahead and try the TensorFlow Lite Model Maker Library out.

TensorFlow Lite Model Maker library — A natural language classification example

The TensorFlow Lite Model Maker library allows us to start from scratch and create our own tflite model. This means that we will perform all of our own training and testing, and then just export the results as TensorFlow Lite model. Using the TensorFlow Lite Model Maker means that we can skip all of the above steps of transforming TensorFlow (GraphDef, Checkpoint, Frozen Graph etc.) to TensorFlow Lite. It was necessary to unpack and discuss the TensorFlow architecture (and how that relates/transforms to TensorFlow Lite). But now let’s just use the TensorFlow Lite Model Maker library; you will see that it is actually quite a simplified process (to get from raw data to production usage).

We will first need to install TensorFlow

pip3 install --user --upgrade tensorflow

We then install tflite-model-maker

pip3 install -q tflite-model-maker

We then install a specific version of numpy (to satisfy TensorFlow’s versioning requirements)

pip3 install --user numpy~=1.19.2

The next step is to import dependencies/libraries

import os
import numpy as np
import tensorflow as tf
from tflite_model_maker import model_spec
from tflite_model_maker import text_classifier
from tflite_model_maker.config import ExportFormat
from tflite_model_maker.text_classifier import AverageWordVecSpec
from tflite_model_maker.text_classifier import DataLoader

Then obtain the data

data_dir = tf.keras.utils.get_file(
      fname='SST-2.zip',
      origin='https://dl.fbaipublicfiles.com/glue/data/SST-2.zip',
      extract=True)
data_dir = os.path.join(os.path.dirname(data_dir), 'SST-2')

We then write a function to replace tab separator with comma separator

def replace_label(original_file, new_file):
    df = pd.read_csv(original_file, sep='\t')
    label_map = {0: 'negative', 1: 'positive'}
    df.replace({'label': label_map}, inplace=True)
    df.to_csv(new_file)

And execute that function for both the training dataset and the testing dataset

replace_label(os.path.join(os.path.join(data_dir, 'train.tsv')), 'train.csv')
replace_label(os.path.join(os.path.join(data_dir, 'dev.tsv')), 'dev.csv')

Now, we can train and test the data

spec = model_spec.get('average_word_vec')train_data = DataLoader.from_csv(filename='train.csv', text_column='sentence', label_column='label', model_spec=spec, is_training=True)
test_data = DataLoader.from_csv(filename='dev.csv', text_column='sentence', label_column='label', model_spec=spec, is_training=False)# Train
model = text_classifier.create(train_data, model_spec=spec, epochs=10)# Test
loss, acc = model.evaluate(test_data)

At this point, we can export the model to our ~/average_word_vec directory.

model.export(export_dir='average_word_vec')

Inspecting the model

If we want to inspect the model visually, we can use a tool such as netron.app (as shown below)

Deploy and run

Whilst WasmEdge is available directly on embedded devices etc. For the purpose of demonstrations, in this article, we will be using another one of WasmEdge’s modes which is Wasm as a RESTful function. This way you can run the model via an API endpoint on the open web. More specifically, you can try out live demonstrations like this (face detection) and this (image classification).

There is also documentation and source code on the wasm-learning GitHub repo that explains how to make this specific natural language mode available as a live demo.

https://second-state.github.io/wasm-learning/faas/tf_lite_natural_language/html/index.html

Optimising machine learning models

The above example is useful to demonstrate how we can train and deploy simple, small, models i.e. the language classification’s .tflite file is only 757 Kb. There are models which are wildly more complex. One example is this YOLOv4 model

Yolov4 Model Example

You only look once (YOLO) is a state-of-the-art, real-time object detection system. Prior to YOLO, detection systems repurposed classifiers or localizers to perform detection. Prior to YOLO they applied a model to an image at multiple locations and scales. High scoring regions of the image were considered detections (Redmon, 2021).

In contrast, YOLO applies a single neural network to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities (Redmon, 2021).

As a first step, YOLO is only looking at generating confidence values in relation to the question “is there an object in that region?”. Once this part is complete, we know where the objects are in the image but we don’t know what each of the objects are.

Next YOLO performs conditional probability i.e. if a grid cell predicts “car” it is not yet saying that a car is in that grid cell. One more step is still needed in order to confirm that “if there is an object in that grid cell then that object is a car”.

If we take these conditional probabilities and multiply them by the confidence values (from the first step) we then get bounding boxes which are weighted by the probabilities for containing a particular object.

YOLO then uses thresholds to discard areas with low weighting.

Essentially the way in which YOLO works allows the “object detection” pipeline to be the same speed as an “image classification” pipeline (just one evaluation necessary) i.e. YOLO is predicting all of these detections simultaneously.

COCO

Common Objects in COntext (COCO) is a large-scale object detection, segmentation, and captioning dataset. While previous object recognition datasets have focused on (a) image classification, (b) object bounding box localization or (c) semantic pixel-level segmentation, COCO focusses on (d) segmenting individual object instances by introducing a large, richly-annotated dataset comprised of images depicting complex everyday scenes of common objects in their natural context (Lin et al, 2014).

We recently trained the YOLOv4 Model on the COCO dataset. Here is how we then ran it on WasmEdge.

Running the model using WasmEdge

Before we can write code to run this model, we need to understand the model’s inputs and outputs.

Inputs

The inputs can be seen using a free online tool called netron. If we load the yolov4–416.tflite file we can see the input.

Clicking on the input in the web browser, shows the model properties in the side bar. More specifically, we can now see the single input called Input_1 which is of data type float32[1,416,416,3].

Outputs

The outputs (Identity and Identity_1) can also be seen using netron software.

If we click on the outputs, in the web browser we can also see the data structure of both of these outputs. For example, Identity is float32[1,1,4] and Identity_1 is float32[1,1,80].

Rust source code

The Rust source code needs to:

load this model
establish the inputs and outputs
run the model
save the outputs to variables for further use

the Rust source code looks like the following.

let model_data: &[u8] = include_bytes!("yolov4-416.tflite");let mut session = ssvm_tensorflow_interface::Session::new(model_data, ssvm_tensorflow_interface::ModelType::TensorFlowLite);session.add_input("input_1", &flat_img, &[1,416,416,3]);session.add_output("Identity");
session.add_output("Identity_1");session.run();let res_vec: Vec<f32> = session.get_output("Identity");
let res_vec_1: Vec<f32> = session.get_output("Identity_1");

If we compile the above code, we generate a new WebAssembly executable. As mentioned above, thanks to WasmEdge, we can run our new WebAssembly executable .wasm file in a number of environments. For this demonstration we chose to run the model using Node. Here’s how.

// Import file system library
const fs = require('fs');// Create wasmedge instance
const wasmedge = require("wasmedge-extensions");// Load the .wasm file
const path = "/media/nvme/yolo/wasm-learning/faas/yolo-tflite/pkg/yolo_tflite_lib_bg.wasm";// Create a WebAssembly VM Instance
var vm = new wasmedge.VM(path, { EnableAOT:true, rgs:process.argv, env:process.env, preopens:{"/": "/tmp"} });// Create a file path for ahead of time compiled (optimized) binary
aot_path = "/media/nvme/aot_file.so"// Make an AOT optimized executable file
vm.Compile(aot_path);// Create new VM instance using AOT (as apposed to wasm interpreted)
var vm_aot = new wasmedge.VM(aot_path, { EnableAOT:true, rgs:process.argv, env:process.env, preopens:{"/": "/tmp"} });// Open the image which we will perform object detection on
var img_src = fs.readFileSync("image.png");// Run by passing in the image as a byte array
var return_value = vm_aot.RunUint8Array("infer", img_src);

Results

Running the AOT compile file generates two arrays, Identity and Identity_1 (as shown in the above model properties which we inspected using netron)

Identity

The output from Identity is as follows

[3.1309958, 3.7360609, 7.5596023, 8.140947 ... snip ... 401.64178, 400.8591, 267.096, 256.04437]

Interestingly Identity is an array with a length of 42588 which has values ranging from min0.34816748 to max 725.75775. So how do we interpret this data?

Identity_1

The output from Identity_1 is as follows

[0.0000864767, 0.0000008141958, 0.000010776263 ... snip ... 0.0000000000043190325, 0.0000000000006169203, 0.0000000000014217605]

Also we can see from the output that Identity_1 is an array of length 851760 which has values ranging from min 3.323909e-26 to max 0.9954778. Again, how do we interpret this data?

Interpreting the results

In the next article we will be interpreting the results. We will be filtering out low quality boxes, extracting box coordinates and drawing bounding boxes. The end goal being to redraw the input image with all objects identified and labelled; just like in the image below.

References

IEEE Spectrum: Technology, Engineering, and Science News. 2021. How Software Is Eating the Car. [online] Available at: <https://spectrum.ieee.org/cars-that-think/transportation/advanced-cars/software-eating-car> [Accessed 17 June 2021].

Johnson, E., Thien, D., Alhessi, Y., Narayan, S., Brown, F., Lerner, S., McMullen, T., Savage, S. and Stefan, D., 2021. Доверяй, но проверяй: SFI safety for native-compiled Wasm. NDSS. Internet Society.

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C.L., 2014, September. Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755). Springer, Cham. Vancouver.

Long, J., Tai, H.Y., Hsieh, S.T. and Yuan, M.J., 2020. A Lightweight Design for Serverless Function as a Service. IEEE Software, 38(1), pp.75–80. < https://arxiv.org/pdf/2010.07115.pdf >

Redmon, J. and Farhadi, A., 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.

Redmon, J., 2021. YOLO: Real-Time Object Detection. [online] Pjreddie.com. Available at: <https://pjreddie.com/darknet/yolo/> [Accessed 14 July 2021].

TensorFlow. 2021. TensorFlow Lite. [online] Available at: <https://www.tensorflow.org/lite/> [Accessed 21 June 2021].

Wang, X., Han, Y., Leung, V.C., Niyato, D., Yan, X. and Chen, X., 2020. Convergence of edge computing and deep learning: A comprehensive survey. IEEE Communications Surveys & Tutorials, 22(2), pp.869–904. < https://arxiv.org/pdf/1907.08349.pdf >

WasmEdge [online] Available at: <https://github.com/WasmEdge/WasmEdge> [Accessed 19 June 2021].

Wasmedge.org. 2021. WasmEdge Runtime. [online] Available at: <https://wasmedge.org/> [Accessed 17 June 2021].

Webassembly.org. 2021. WebAssembly. [online] Available at: <https://webassembly.org/> [Accessed 17 June 2021].

Zheng, S., Wang, H., Wu, L., Huang, G. and Liu, X., 2020. VM Matters: A Comparison of WASM VMs and EVMs in the Performance of Blockchain Smart Contracts. arXiv preprint arXiv:2012.01032. < https://arxiv.org/pdf/2012.01032.pdf >