TL;DR: I have documented my wrangling with ML tools to detect cats faces and doing interesting things with them. Cats meet ML what could be better?
First, below are 3 Google Colab Notebooks that:
a) preprocess data [link]
b) train the model using ssd_mobilenet_v2_coco_2018_03_29 [link]
c) convert to a CoreML model [link]. The result is not perfect and hence this article for discussion. Also uses a lot of tutorials and hard work which I will try and credit as best I can.
I recently wanted to replicate as best as I could the quite robust and amazing realtime video pet Cat Filters for TikTok/DouYin and SNOW apps (both uses the SenseTime developed SenseMoji API for this filter I believe). Note that this is NOT mapping a cat texture on a human head — rather it is the filter that puts stickers on an actual cat’s face.
I did some prior research and came across Cat Hipsterizer where someone trained a Mobilenetv2 with the Crawford Cat Dataset from Kaggle to recognise cat faces and cat landmarks. However since it uses a Classifier network to predict cat face location it will always return a position even if there is none and will not work with >1 cat face in the picture. I wanted to extend this to an SSD or YOLO architecture in order to detect 0->N cat faces and return bounding boxes (as suggested by the Todos in Cat Hipsterizer readme). What follows are my attempts to do this.
The general idea is to take the same Crawford Cat Dataset from Kaggle and train a mobile friendly model like ssd_mobilenet_v2_coco_2018_03_29 on Google Colab (because it’s free) and convert it to a CoreML model in order to run it accelerated on an iOS Device.
The first 2 steps are based articles like https://www.dlology.com/blog/how-to-train-an-object-detection-model-easy-for-free/ and this github https://github.com/Tony607/object_detection_demo.
Make sure you have an account on Kaggle.com and grab the JSON config file to sign in. The data preprocess basically reads the .cat files and then converts them to PASCAL VOC Format, along with splitting the dataset into train/eval sets and finally generates TFRecords from them using some Python scripts from https://github.com/Tony607/object_detection_demo. I then zip this lot up and save it to my Google Drive to save having to do this again and again.
Training The Model
The meat of the work — basically it loads up the relevant python frameworks — in particular I use Tensorflow 1.14 GPU build (explicitly so that it uses the GPU device on Google Colab if available). The reason for 1.14 is that the Python tfcoreml tools (used to convert this to a CoreML model) die a horrible death on unknown TF nodes if using 1.15 or later. I then load up the training and eval data saved earlier from Google Drive, and sets up a pipeline to train (fine-tune? Transfer Learn? I really hate the terminology here) ssd_mobilenet_v2_coco_2018_03_29 model starting from the weights used to learn the COCO dataset. The only clever thing I do here is that I change the model training path to point to my Google Drive so that it saves checkpoints there and can resume again when my instance disappears. The result of all this is a bunch of checkpoint files saved somewhere.
Running / Converting Model to CoreML
In order to use this model we have to first freeze the graph at the latest checkpoint and then export it out as a SavedModel to be used in an inference call. You can see this happening in the Colab notebook above — and it runs the exported model on a bunch of eval images and visualises them.
After this we convert the model to a CoreML file to be loaded by iOS and used on device. The rest of the notebook is 99% based on this article https://machinethink.net/blog/mobilenet-ssdlite-coreml/ — so just go there and follow what he has to say (I just didn’t use his model to start with)! Essentially though at the end of it you get a .mlmodel to put into your XCode project.
At this point it is also possible to export a tflite model using export_tflite_ssd_graph.py in order to then convert the graph to a .tflite model for use on device. However — the model I got from this did not work with the TensorflowLite GPU enabled (experimental) flag so I went down the path of a CoreML model instead (for the HW acceleration).
Running on iOS device
If you continue to follow https://machinethink.net/blog/mobilenet-ssdlite-coreml/ you can then run the .mlmodel as part of the iOS Vision Framework. Once I get a bbox of a cat face I feed it to Cat Hipsterizer premade Cat Landmark model (which I run using TensorflowLite with GPU acceleration) and render something silly on the resulting landmark positions. Anyway I got this running at a fair clip of >50fps but the results are very flaky — although it seems to work well on still images it fails to capture live camera feed properly. I have debugged the video->Vision pipeline extensively so the bug is unlikely to be there (there are a LOT of gotchas around the size of the video buffer, Vision Framework coordinates, offsets and annoyingly different coordinate systems). The failure in the pipeline is mostly down to the mlmodel NOT finding a cat face and returning no bounding boxes.
UPDATE: Surprise! The failure in the pipeline WAS actually to do with the stupid iOS Vision and ML and UIView and AVCaptureSession coordinate system being out of whack, and I was feeding in cat faces at 90 degrees! This means the ML model works completely fine now. Clearly I have not debugged the video->Vision pipeline enough — wait until the next article wherein I talk about how we visualise effects on the bounding box using SpriteKit (failure), UIImageView (moderate success and a nightmare with bounds vs frames when combined with CIImage), Core Image (bizarro coordinates when cropping), and obviously good ol’ VisionKit and ML coordinates.
I’ll upload the video results when I get time — but you can see the mAP and image results in the Google Colab notebooks themselves.
So anyway — while I figure out how to make the above model better — I made a different pipeline which is basically just using the Cat Hipsterizer premade models to capture cats faces, with the help of iOS Vision Animal detector. This replaces the SSD output with the output of the iOS Vision Animal detector, before feeding this to the Cat Hipsterizer models. This actually works pretty well EXCEPT that the iOS Vision Animal detector is extremely good and also detects cats even when they are seen from the back — thus messing up the premade models. Here’s the results on our cat Whisper:
Anyway if anyone has any tips on how to improve this, or whether I am doing this completely wrong would be appreciated! You can contact me on twitter @mrfungfung, or more professionally at firstname.lastname@example.org.