Create your own Object Recognizer — ML on iOS

Hunter Ward
9 min readNov 12, 2017

--

OVERVIEW

GETTING STARTED WITH CORE ML ON iOS

▼ CREATE YOUR OWN OBJECT RECOGNIZER

Creating your own image classifier (aka, your own object recognizer) is a powerful skill. The growing intelligence of our devices means we’re expecting more from them. Whether we want to find the model number of a faulty valve, recognize a dog’s breed, or count the calories in that donut 🍩 … 😋 — the ability for apps to run their own Object Recognizer is a game-changer.

Previously we covered how to load and run a Core ML model in an iOS app. Today we’re going to use our own data to generate our own Core ML model.

Previously we covered iOS inferencing and input (green). Today; the whole pipeline (blue).

This tutorial provides a broad overview of the entire Machine Learning (ML) pipeline: from getting your own data, to running it in your app. For this article, we’ll be using a service so that we can skip the nitty-gritty parts involved in ‘Custom Training’ and ‘Model Conversion’.

As we noted previously: Starting out, our focus will be Real-Time Computer Vision (CV) on mobile. Right now we’re focusing on Object Recognition (not Object Detection, 2D/3D Localization, or time-based recognition … yet).

After an hour, your Object Recognizing application isn’t expected to be perfect nor fool-proof. However, it should be sufficiently accurate for POCs, MVPs, and experimental apps. The robustness of your applications will grow alongside your understanding of Machine Learning.

Duration

  • 1–2 hours 🕜

Assumed Knowledge

  • Novice understanding of iOS Development
  • Zero understanding of ML, Deep Learning, Computer Vision

Technologies

  • iOS 11, Core ML, Xcode, Swift 4.0, ARKit

A Five Step Process

You can pick any dataset you’d like: Toys, Utensils, Furniture, etc. Below we’ve opted on an object that you all hopefully have access to – your hands 👐.

▿ Step 1 of 5. Consider your Data ▿

Consider the types of Objects you want to Recognize
Depending on your dataset, there can be many ‘classes’ (groups) within the dataset we can choose to recognize. For example, there are multiple handshapes; the Ok Hand 👌, Pointing Hand ☝️, Binocular Hands, etc. You may go through these five steps multiple times until you intuit what objects work best. For this example, we’ll skip ahead and reveal that the Open Hand 🖐 and Fist 👊 are the easiest handshapes to distinguish.

Develop a Gut Feeling, an Intuition

Getting an intuition for the type of images required for generating a machine learning model isn’t easy when you’re just starting out. You’re not going to become a master overnight, however, this tutorial will get you started.

I recommend writing down your predicted anti-biases (robust attributes, e.g. lighting) and biases (more on that below). You might be guessing at this point, but it’ll get you thinking about your potential data.

I also recommend experimenting with GCL’s Teachable Machine. Try getting the site running on your mobile and training on your objects.

The Neural Network they use is different from ours. However, Teachable Machine should provide a sense of what data works, what doesn’t, and what biases you should be attentive too.

Growing intuition takes time — it’s usually a matter of simply experimenting and reading more about ML, DNNs, and CV. In later articles, I’ll point you towards resources that can effectively and efficiently grow your intuition.

Acknowledge Biases & Robustness
With any dataset you create, you should be aware of its inherent biases and robustness. For this tutorial, we are going to train and recognize your personal objects.

To generalize beyond your personal artifacts (e.g. to recognize any animal, any hand ✋, etc.) — you would expand your dataset to cover more anti-biases.

For this introductory tutorial, we aim to be:
Unbiased (Robust) towards:
• Different Lighting Conditions (🌕 / 🌔 / 🌓 / 🌒 / 🌑 / 🌘 / 🌗 / 🌖 / 🌕)
• Different Distances of the Hand from the Camera
• Different Locations of the Hand in an image

Biased Towards:
• Your chosen Hand (Left / Right)
• Your own, Human, Hand (👋🏿 / 👋🏾 / 👋🏽 / 👋🏼/ 👋🏻)
• Your own Hand’s accompanying Accessories / Tattoos (💍 / 💅 / ⌚️) etc.
• The environment or room you’re in
• Back of the Hand (robust to 45° rotations)

Other biases / anti-biases — There are always going to be other biases. To guard against the important ones, frequent and diverse user tests should be conducted.

2 unit tests. 0 integration tests.

: ::::

▿ Step 2 of 5. Collect Data ▿

To generate our Core ML model, we are going to use a Microsoft Cognitive Service called “Custom Vision”. This service is very beginner friendly as it allows us to create a model with very little data. Usually, we’d need much MUCH more data.

For this dataset, we want photos in these 3 classes (categories):
• Five Hand 🖐
• Fist Hand 👊
• No Hand ❎ *
* It’s generally good practice to have a class for ‘not-item’.

Keep your anti-biases in mind when you’re taking images. In this case, we want some hands to be really close to the camera, some to be really far. Some hands might be to the very left of the camera, some to the bottom right. Some hands might be taken in a well-lit room, others might be harshly lit from the side.

For ‘Custom Vision’: I recommend at most 60 images for each class on your first try. After your first test in step 5, you may decide to add 40 images to improve particular anti-biases, then an additional 20, etc. If your object is fairly simple (e.g. a toy hot dog 🌭), you could even get away with ~10 diverse images.

( Note: Microsoft’s Custom Vision is limited to a total of 1,000 training images and 50 unique tags. )

Shrinking Images

Microsoft Custom Vision doesn’t like large photos like the ones taken with an iPhone. I recommend shrinking them down to roughly 300 pixels in width before uploading. On Mac, you can use ‘Preview’ to resize multiple images.

:: :::

▿ Step 3 of 5. Generate a Model using Microsoft Custom Vision ▿

To turn our data into a Core ML model we’ll be using a Microsoft Azure Cognitive Service; ‘Custom Vision’. This is one of the few, free, emerging services that can export CoreML models.

Traditionally, we’d start with frameworks like Caffe, Keras, or TensorFlow to train our own models. Services like Custom Vision abstracts away that training, using their own evolving model architectures. This makes it super easy to start generating our own models.

3.1 Create an Account at Custom Vision. (It’s free!)

3.2 Create a New Project with the Domain “General (compact)”*
*“General (compact)” is currently the only domain that can export a Core ML mlmodel file.

3.3 Create Tags for each of your classes: Five-Hand 🖐, Fist-Hand 👊, No-Hand ❎.

3.4 Upload Your Data
I recommend grouping your images beforehand in their respective folders, so you can tag them collectively during the upload (rather than one by one on the browser).
Your images should’ve already been resized from Step 2 to ~ 300px in width.

3.5 Train your Data
Hit ‘Train’ to train your data! Microsoft Custom Vision spoils us by doing the training for us (training will be covered in future articles).

After a couple minutes, you’ll get some Precision and Recall values. I would NOT worry much about these values — as long as they’re above ~ 60%. Ensuring that your data covers your real-world anti-biases matters much more. A robust dataset with anti-biases will inherently result in lower precision and recall than if your data was biased/homogeneous to begin with.

3.6 Export the model
Hit ‘Export’. Choose iOS 11 (Core ML). Click Download.

Tip: I recommend renaming the .mlmodel file before importing it into Xcode.

Note: Microsoft’s Custom Vision is training a simplistic model that is only a couple of megabytes. More advanced models like the Inception V3 model are around 100 MBs. We’ll cover these in future learnings.

::: ::

▿ Step 4 of 5. Load and run the Core ML model in your app ▿

Now that we have our Core ML model, we can load it into our iOS app!

You can review the previous tutorial on Loading and Running Core ML on iOS.

Gesture Recognition 101 — Github

This template (also source code) on GitHub, can optionally be used to load your model. It’s running machine learning in real-time alongside ARKit. This means it’s “AR-ready” for your own applications. 🤓

:::: :

Step 5 of 5. Test, Rinse and Repeat

With our Core ML model running live on our iPhone or iPad we can walk around and test it out! Try and see if you can fool it based on your anti-bias goals. Note down areas where it’s getting easily fooled (e.g. if it mistakes a wooden floor for a hand), then repeat these steps with some additional training images to make it a bit more robust. (Try at least 3 rounds of this.) If things aren’t working out — don’t be afraid to try something new!

:::::

Future Work

Keep experimenting and growing your intuition. As you play and read, your intuition of this space will grow.

Optional Tip 1: In the short-term, there are established Computer Vision methods that you can try to improve your image classifier. These include pre-processing techniques like Frame Differencing, Edge Detection, etc.

Optional Tip 2: Because our model is so small it is incredibly efficient. For 60 FPS real-time applications, this means you still have plenty of computing power to try running other models. You could also run this model multiple times on different areas of the camera image to try to figure out the location of your object (i.e. object detection and localization via a faux Region-based Convolutional Neural Network, RCNN).

In the future, we’ll be diving into advanced tools and techniques for Training and Converting ML models. This will augment our control and understanding of our own models.

Next: Custom Training and Conversion

Until Next Time!

▶ CONVERSION… (WIP)

▶ TRAINING… (WIP)

--

--