Object Detection Training with Apple’s Turi Create for CoreML (2019 Update)

Ryan Jones
Slalom Build
Published in
6 min readJul 9, 2019

Taking a look at my last post about CoreML object detection, I decided to update the two part series with the latest Turi Create (now using Python 3.6). The original parts were about detecting an object in the camera frame (photo or video) and drawing a bounding box around it. This post builds on the previous two parts by adding detection for multiple objects. With iOS 12, the Vision framework makes it easier to find detected objects in Swift, and the training can all be completed in orders of magnitude faster with GPU support!

Installation

A lot of the setup is the same as it previously was, however there will be some updates to the runtime. The latest version of Python that works with Turi Create 5.4 is 3.6.8 as of this writing. Download and install, and then follow the instructions in the GitHub repo about setting up the virtual environment to run and install Turi Create via pip.

Image Setup

This still uses the Simple Image Annotator from the previous post, which will generate a CSV file of all of the image annotations. It is recommended to use one folder for all the images and then output to a single CSV — otherwise, combining multiple CSV files into one file will be required. This step still uses Python 2, which is installed by default on macOS and can be used outside of the virtual environment created above.

Trying to keep this as simple as possible to update from the previous articles, source imagery for Object Detection should be kept in the same folder format:

training/
├── images/
│ ├── object/ <- named what you’re detecting
│ └── other_object/ <- what else you’re trying to detect
├── prep.py
└── train.py

Preparing annotations is now slightly different from the previous iteration. It used to involve two steps:

  • create the annotations column (convert.py previously)
  • prepare the files for training using Turi

I’ve now simplified this into one step with prep.py.

Because the annotations column requires the label(the object and other_object training labels above), combining this into one step made more sense, and made it easier by not having to hardcode a training label name anymore. It will now use the subdirectory name from images/ (object and other_object above), and can prepare any number of objects for detection.

python prep.py input_file.csv

It now takes an input file location (pointing to the data that you output from Simple Image Annotator), and uses Python 3.

Lines 42–53 are where the magic happens. It creates the data needed for Turi Create to train the model. If all went according to plan, and the number of rows in the CSV match the number of images, an annotations column on the data object will have an array of updated detected objects. At the end of the prep.py run, you should also have a training.sframe directory with everything needed for training your model.

Training the Model

Nothing changed for the train.py script… Except now with Turi Create 5, they added GPU support. So instead of previously taking nearly 3 hours to train 1000 iterations, in my test with even more imagery, it took around 17 minutes. Nearly 10 times faster to train! So I bumped up the number of iterations to 2500, which takes about 42 minutes to train on my Radeon Pro 560 with 4 GB.

All that is needed is to set the modelName variable to whatever you want to use in Xcode.

python train.py
Setting 'batch_size' to 32
Using GPU to create model (AMD Radeon Pro 560)
+--------------+--------------+--------------+
| Iteration | Loss | Elapsed Time |
+--------------+--------------+--------------+
| 1 | 6.255 | 12.2 |
| 11 | 6.269 | 22.4 |

| 2500 | 0.734 | 2556.8 |
+--------------+--------------+--------------+

At the end of the train script, you should have a modelName.model folder and modelNameClassifier.mlmodel for use to drag and drop into your Xcode project.

Mlmodel to Xcode

Upgrading to Xcode 10, and iOS 12, the Vision APIs are more user friendly. We no longer have to do a lot of detection math that we did previously. The Vision API now returns an array of VNRecognizedObjectObservation objects. This makes detection so much simpler, since each object has a boundingBox and matching label of what was found. The updates to the Xcode project are based on the sample project from Apple. The sample project does not walk through detection in a single image, but my GitHub repo for this use case does.

To set up the camera for Vision detection, we need to do some work. The updated code does this in more compartmentalized fashion:

  • setUpAVCapture() sets up the AVCaptureSession to use a camera (back by default) using Apple recommended methods, and adds the camera layer to the screen
  • setUpLayers() adds a CALayer for the camera view and drawing rectangles around the found objects over the camera layer
  • updateLayerGeometry() is from the Apple project to help the overlay rectangles when rotating the device
  • setUpVision() sets up the machine learned object detection using the Vision framework and provide a handler when it detects an image is detected

The main change with detection is how the detected objects are now a VNRecognizedObjectObservation object. This makes it easy to get the relevant information on screen.

We start off by checking that the results we get back from a detection are of the new type. The API has not changed — It returns an array of Any objects, for what I assume is backwards compatibility with an array of VNCoreMLFeatureValueObservation previously returned — So we need to be sure we are getting the iOS 12 only VNRecognizedObjectObservation objects from the results.

  • Line 6 — we take the first object detected, which is the highest confidence of detection
  • Like 8 converts the detected bounding box from the found rectangle to the position to be displayed on top of your camera view, a very important translation
  • Line 10 is a custom method to create a layer that will be added to denote where the bounding box of the detected object is

The custom create layer method does more than just put a rectangle around the detected area — The rectangle style changes based on the confidence or likelihood that it detected one of the objects. Dashed lines are under my threshold of 45%, and colors are changed based on the detected object, so red represents a different object than cyan.

The advantage to this new API from Vision is that we no longer have to determine the predictions ourselves, removing this Turi Create step (or the method call to predictionsFromMultiDimensionalArrays) that we had to do previously. We no longer need to:

Check the IoU (see Evaluation) between it and and all the remaining predictions. Remove (or suppress) any prediction with an IoU above a pre-determined threshold (the nmsThreshold we extracted from the meta data).

Conclusion

Turi Create has made some great advancements since it released in late 2017. Along with Apple’s updates to the Vision framework, training and detecting objects in images and live video is easier and faster than ever. I am excited to see what iOS 13 brings with regard to CoreML and machine learning, and if any updates would be able to make object detection training done through Xcode 11.

GitHub Links

--

--