Object Detection Training with Apple’s Turi Create for CoreML (2019 Update)
Taking a look at my last post about CoreML object detection, I decided to update the two part series with the latest Turi Create (now using Python 3.6). The original parts were about detecting an object in the camera frame (photo or video) and drawing a bounding box around it. This post builds on the previous two parts by adding detection for multiple objects. With iOS 12, the Vision framework makes it easier to find detected objects in Swift, and the training can all be completed in orders of magnitude faster with GPU support!
Installation
A lot of the setup is the same as it previously was, however there will be some updates to the runtime. The latest version of Python that works with Turi Create 5.4 is 3.6.8 as of this writing. Download and install, and then follow the instructions in the GitHub repo about setting up the virtual environment to run and install Turi Create via pip
.
Image Setup
This still uses the Simple Image Annotator from the previous post, which will generate a CSV file of all of the image annotations. It is recommended to use one folder for all the images and then output to a single CSV — otherwise, combining multiple CSV files into one file will be required. This step still uses Python 2, which is installed by default on macOS and can be used outside of the virtual environment created above.
Trying to keep this as simple as possible to update from the previous articles, source imagery for Object Detection should be kept in the same folder format:
training/
├── images/
│ ├── object/ <- named what you’re detecting
│ └── other_object/ <- what else you’re trying to detect
├── prep.py
└── train.py
Preparing annotations is now slightly different from the previous iteration. It used to involve two steps:
- create the
annotations
column (convert.py
previously) - prepare the files for training using Turi
I’ve now simplified this into one step with prep.py
.
Because the annotations
column requires the label
(the object
and other_object
training labels above), combining this into one step made more sense, and made it easier by not having to hardcode a training label name anymore. It will now use the subdirectory name from images/
(object
and other_object
above), and can prepare any number of objects for detection.
python prep.py input_file.csv
It now takes an input file location (pointing to the data that you output from Simple Image Annotator), and uses Python 3.
Lines 42–53 are where the magic happens. It creates the data needed for Turi Create to train the model. If all went according to plan, and the number of rows in the CSV match the number of images, an annotations
column on the data
object will have an array of updated detected objects. At the end of the prep.py
run, you should also have a training.sframe
directory with everything needed for training your model.
Training the Model
Nothing changed for the train.py
script… Except now with Turi Create 5, they added GPU support. So instead of previously taking nearly 3 hours to train 1000 iterations, in my test with even more imagery, it took around 17 minutes. Nearly 10 times faster to train! So I bumped up the number of iterations to 2500, which takes about 42 minutes to train on my Radeon Pro 560 with 4 GB.
All that is needed is to set the modelName
variable to whatever you want to use in Xcode.
python train.py
Setting 'batch_size' to 32
Using GPU to create model (AMD Radeon Pro 560)
+--------------+--------------+--------------+
| Iteration | Loss | Elapsed Time |
+--------------+--------------+--------------+
| 1 | 6.255 | 12.2 |
| 11 | 6.269 | 22.4 |
…
| 2500 | 0.734 | 2556.8 |
+--------------+--------------+--------------+
At the end of the train script, you should have a modelName.model
folder and modelNameClassifier.mlmodel
for use to drag and drop into your Xcode project.
Mlmodel to Xcode
Upgrading to Xcode 10, and iOS 12, the Vision APIs are more user friendly. We no longer have to do a lot of detection math that we did previously. The Vision API now returns an array of VNRecognizedObjectObservation
objects. This makes detection so much simpler, since each object has a boundingBox and matching label of what was found. The updates to the Xcode project are based on the sample project from Apple. The sample project does not walk through detection in a single image, but my GitHub repo for this use case does.
To set up the camera for Vision detection, we need to do some work. The updated code does this in more compartmentalized fashion:
setUpAVCapture()
sets up the AVCaptureSession to use a camera (back by default) using Apple recommended methods, and adds the camera layer to the screensetUpLayers()
adds aCALayer
for the camera view and drawing rectangles around the found objects over the camera layerupdateLayerGeometry()
is from the Apple project to help the overlay rectangles when rotating the devicesetUpVision()
sets up the machine learned object detection using the Vision framework and provide a handler when it detects an image is detected
The main change with detection is how the detected objects are now a VNRecognizedObjectObservation
object. This makes it easy to get the relevant information on screen.
We start off by checking that the results we get back from a detection are of the new type. The API has not changed — It returns an array of Any
objects, for what I assume is backwards compatibility with an array of VNCoreMLFeatureValueObservation
previously returned — So we need to be sure we are getting the iOS 12 only VNRecognizedObjectObservation
objects from the results
.
- Line 6 — we take the first object detected, which is the highest confidence of detection
- Like 8 converts the detected bounding box from the found rectangle to the position to be displayed on top of your camera view, a very important translation
- Line 10 is a custom method to create a layer that will be added to denote where the bounding box of the detected object is
The custom create layer method does more than just put a rectangle around the detected area — The rectangle style changes based on the confidence or likelihood that it detected one of the objects. Dashed lines are under my threshold of 45%, and colors are changed based on the detected object, so red represents a different object than cyan.
The advantage to this new API from Vision is that we no longer have to determine the predictions ourselves, removing this Turi Create step (or the method call to predictionsFromMultiDimensionalArrays
) that we had to do previously. We no longer need to:
Check the IoU (see Evaluation) between it and and all the remaining predictions. Remove (or suppress) any prediction with an IoU above a pre-determined threshold (the nmsThreshold we extracted from the meta data).
Conclusion
Turi Create has made some great advancements since it released in late 2017. Along with Apple’s updates to the Vision framework, training and detecting objects in images and live video is easier and faster than ever. I am excited to see what iOS 13 brings with regard to CoreML and machine learning, and if any updates would be able to make object detection training done through Xcode 11.