Building a real time object recognizer for iOS

Using CoreML and Swift

Credits — Apple (

One of the exciting features announced at WWDC 2017 was CoreML. Its the Apple framework that can be used to integrate machine learning into your app, all offline 😉.

Core ML lets you integrate a broad variety of machine learning model types into your app. In addition to supporting extensive deep learning with over 30 layer types, it also supports standard models such as tree ensembles, SVMs, and generalized linear models. Because it’s built on top of low level technologies like Metal and Accelerate, Core ML seamlessly takes advantage of the CPU and GPU to provide maximum performance and efficiency. You can run machine learning models on the device so data doesn’t need to leave the device to be analyzed. — Apple Documentation on Machine Learning (

Importance of CoreML comes when you look deeply on how and where the prediction happens. Till now, everyone used to integrate machine learning into apps, where the predictions happen in an hosted server. If it was an object recognition app, you have to capture the frame from the device, send this data to the prediction engine, wait till the the image gets completely uploaded to the server and finally get the output. There are mainly two issues with this approach — network delay and the user privacy. Now all these processing can simply happen in the device and thus reducing both these issues.

Building from scratch

We can try to make use of CoreML and implement a simple on-device solution for this. I’ll go through the important steps without mentioning the basics of iOS or Swift.

First thing we have to do is get an iOS 11 device and Xcode 9.

If you are not familiar with machine learning, take a look at brief introduction here. Or you can get a very high level overview from here.

Machine Learning

Its the technology which gives computer the ability to learn without explicitly coding the solution for a problem.

There are basically two process involved here — Training and prediction.

Training is the process where we give the model different sets of inputs (and corresponding output) to learn from the pattern. And this trained model is given an input which it has not seen before to predict from its earlier observations.

Choosing a model

So the first thing we have to do is to select a good model for your project. There are lots of pre-trained models available for image recognition. Or you can even train your own model to get a better experience.

There are good models available as CoreML model from Apple’s Machine Learning portal. Or if you have your own model, you can convert it to CoreML supported model using the CoreML tool available from Apple.

I chose the Inception V3 library available in the Apple portal.

Inception v3 — Detects the dominant objects present in an image from a set of 1000 categories such as trees, animals, food, vehicles, people, and more.

Creating iOS project

You can create a basic iOS project using swift with a single view controller for this including a video preview layer and a label.

Getting frames from the video preview

Getting current frames is as usual, which we already know. This is explained in this invasivecode article.

Using Inception v3 for prediction

Consider our inception model as the black box when provided an input image gives you probability of that being one among the set of categories it knows.

Download the model from the Apple’s portal, drag n drop it (Inceptionv3.mlmodel) to your project. You can see the model descriptions from the Xcode.

Inceptionv2.mlmodel in model viewer of Xcode

You can see that the model takes a 299x299 pixels image as an input and gives outputs:

  • Most likely category the image falls into
  • List of probabilities of each category

We can make use of any of these parameters to determine the category. I used the first one which is a String and directly printed it on the screen.

You can also see that, the Xcode create a swift model(Inceptionv3.swift) directly from the mlmodel object. You don’t have to make any extra changes for this.


We can make use of the prediction API generated by Xcode which looks like this:

Prediction is simple as this:

But it requires an object of CVPixelBuffer instead of UIImage for prediction, which is explained beautifully by the guys from here in the section ‘Machine learning and Vision’.

I have created UIImage category which abstracts this along with the resize API.

Final Architecture


The app was able to recognize correctly from almost all the inputs provided.

Get the complete code from the repo — ⭐️ if you like it.