Journey putting YOLO v7 model into TensorFlow Lite (Object Detection API) model running on Android

Stephen Cow Chau
Geek Culture
Published in
8 min readSep 1, 2022

Forewords

This article is not a tutorial on how to convert a PyTorch model into Tensorflow Lite model, but instead a summary of my journey trying to use YOLO v7 (tiny) PyTorch model as on edge device (for instance, Android).

This article might not be using the best option, as it mainly restricted to my own skillset as well as my goal.

My skill set

I considered myself knowing PyTorch enough, having very little experience on using Tensorflow.js and Keras, but absolutely NO experience in TensorFlow v1 and TensorFlow Lite.

Also I mainly develop mobile app with React Native, so to learn how to develop an Android app is another challenge.

Why not TensorFlow.js + React Native then?

I see people using TensorFlow.js with React Native to put a model on mobile devices. But my goal is to put it on edge devices like Raspberry Pi or even ESP32/Arduino devices (unlikely with giant model like YOLOv7), so I am stick with TensorFlow Lite (or even TensorFlow Lite micro, which might be more challenging)

End of forewords

I wish this article share something useful for anyone who aim to do the same or a similar exercise.

My journey / path of conversion

This is the path I tried to achieve my goal. There are plenty of article online for how to convert a PyTorch model to other format (and some article do have the same target — TensorFlow Lite), so what’s special about my journey here?

The libraries I use

  1. PyTorch to ONNX — YOLO v7 source code provided the code, which cover not only the graph, but amazingly, the Non Max Suppression operation is included in the graph
  2. ONNX to TensorFlow— a library called onnx-tensorflow (Github link), I believe this is the most official library which is under ONNX github organization
  3. TensorFlow to TensorFlow Lite — The TensorFlow Lite convertor in official TensorFlow Python library (tf.lite.TFLiteConverter)

The hiccups / obstacles

The different in input / output between PyTorch YOLO v7 model vs the TensorFlow Lite Object Detection API requirement

In the first place, why stick with TensorFlow Lite Object Detection API?

This is because I do not want to develop from scratch the client application on Android, so I am reusing (updating) the example application from TensorFlow Lite, which use the Object Detection API.

The alternative is to use the lower level TensorFlow Lite interpreter API, which one can do anything before and after the interpreter session run, but that require a major effort on the Android App modification.

https://www.tensorflow.org/lite/examples/object_detection/overview

The problem / symptom

So after going through the code of conversion from PyTorch to TensorFlow Lite model, load the model into the app, the app complaint it cannot imitate the model.

With the help of “adb logcat” [this is for reminding myself how I realized it], I see the error is about the expected input dimension.

According to the Model compatibility requirements:

The input difference

PyTorch image processing use the layout NCWH (batch size, channel, width, height), while TensorFlow Lite ask for NWHC (batch size, width, height, channel), that require a permute of tensor axes.

The output difference

This is about the YOLOv7 model prediction output, the model would concatenate each detection result’s bounding box, confidence score and the class into 1 tensor of size 6 like follow: [x0,y0,x1,y1, score, class]

What I tried / considered

1.[Failed] The Android App / TensorFlow Lite library level

The first thing I explore was at the Android App level, and I figured out the Object Detection API seems not written in Java, and the TensroFlow Lite Java library interact with it through jni, one can check the source code at tflite-support github for the jni implemetation for object detection (which is written in C), so seems like using Object Detection API.

2.[Used] The model (in the chain) level

Another idea is wrap the model with transformation steps to align the expected input and output.

There are many models along the path:

TensorFlow Lite model — This is stored as flatbuffer format, and looks like it’s not expected to be tampered

TensorFlow SavedModel — With my limited knowledge on TensorFlow, this saved model should be considered a static graph, although people might be able to convert it to Keras model (or the ONNX export to Keras model instead of this static graph model) and perform modification, but I choose to pass.

ONNX graph — This one was my initial candidate, I believe there is a way to perform such operation as it’s literally a graph (which each layer in model is considered as graph node and the calculation being the edges).

The approach is feasible with Nvidia’s TensorRT onnx-graphsurgeon library, they have example for how to modify the graph, so I was thinking to add a permute (NWHC to NCWH) node before the original input and remove the concat node at the end.

I ended up gave up this idea as I was not too sure about the graph structure back then.

PyTorch model — I didn’t consider performing the “surgery” in the PyTorch model as top choice, because that involve a “Non Max Suppression” operation after the forward pass of the model, so even I can wrap the PyTorch model (a nn.module) with the permute operation ahead, I still need to figure out how the non max suppression goes into the ONNX exported model (as a graph operation instead of some kind of looping program code, I admit I felt so amazed when I see this).

the NonMaxSupression operation inside the ONNX exported graph, visualized with Netron (https://netron.app/)

So what I did was open the source code and trace the YOLOv7 export.py code, I see the lines trying to register NMS plugin for ONNX, so I guess, the non max suppression is pre-implemented in ONNX, so this is where the magic happened.

My “solution”

Figured out where the magic happen, I decided to folk the github project and update the source code.

For the input NWHC => NCWH, I updated the End2End module in experiemental.py

Original
Updated with an permute action (conditioned)

And for the “splitting” of the output, I instead conditioned the case in ONNX_ORT module in experimental.py file

Original
Updated to return 4 elements instead of concatenated result
return selected_scores.reshape(1,-1), (selected_boxes/640).unsqueeze(-1).permute(2,0,1), selected_categories.reshape(1,-1), selected_indices.shape[0].unsqueeze(-1).float()

( [Start] 2023–05–11 EDIT )

Note that there is an update for the above

return selected_scores.permute(1,0), (selected_boxes/640).unsqueeze(-1).permute(2,0,1), selected_categories.permute(1,0), selected_indices.shape[0].unsqueeze(-1).float()

What changed

The score and class tensor change from using reshape(1,-1) to permute(0,1)

Why this change

Note that even in ONNX context, the previous implementation is working OK, but once it go down the path to convert to Tensorflow (by onnx_tensorflow or onnx2tf package) and then to Tensorflow Lite (official tensorflow.lite.TFLiteConverter ).

On the side, note that even the trick no.4 mentioned below, we can see ONNX graph is working (imply this node is connected in the ONNX graph), while after convert to Tensorflow, it’s no longer connected, so it’s only for the sake to fulfill the requirement of number of output that Tensorflow Lite Object Detection API required.

After different implementations, I would conclude Tensorflow Lite Object Detection API seems make implementation easier, but it have it own rigidness

For me, I prefer the lower level Tensorflow Interpreter implementation, and the next exploration is ditch Tensorflow at all, and explore the possibility to use PyTorch mobile.

( [End] 2023–05–11 EDIT )

There are 4 tricky parts:

  1. The output sequence — The expected sequence of TensorFlow Lite models should be (bounding box, confidence score, class, number of detected result) but here I returned (score, bounding box, class, number of detected result), this is purely empirical, not sure where in the conversion chain of operation messed up the sequence and the Android App would complain about dimension (that took me quite some time to figure out)
  2. The bounding box divided by 640 — the YOLOv7 tiny model run on fixed 640 x 640 input and the bounding box output range is (0, 640), but the Object Detection API expect ratio, so the simple take is divide it with 640 (which is fixed and so hardcoded here, for now)
  3. The number of item detected — the very first trial I used len(selected_boxes), there was an error (which I forgot if it happened from Android App crash or the export), but the reason is simple, the ONNX export expected to trace the graph data flow of the tensor, and len(selected_boxes) cut the trace path, and so the output would be missing for this number of detected result, and with some trial and error, this selected_indices.shape[0].unsqueeze(-1).float() do the trick
  4. And why the .float() ? — This because the PyTorch Tensor.shape output is integer, and that make this different from the other 3 elements (which are all float), and TensorFlow Lite don’t like it

Intermission

There are some more on the Android App part, but I think I would pause here for now and provide more updates.

Anyhow the TensorFlow Lite model being exported through this journey does work in the Android App, but it’s too slow (CPU 2 thread inference took ~1500ms, compare to Efficient Det v2 quantized is only ~400 ms, while MobileNetv1 quantized is ~80ms)

Quantization of YOLO v7…is a task that I was working on without any success so far…

(To be continued…)

--

--