An Introduction to ARKit — Vision Integration

5 min readJul 4, 2018

This part of the series will be about using Vision, Apple’s computer vision and machine learning framework, with ARKit.

The Basics

With ARKit, you can integrate Vision with AR in real-time to produce seamless and effective user experiences.

What is Vision?

Vision, on its own, is a computer vision framework that uses a ML model to classify what the phone’s camera is seeing, based on pre-trained data.

For example, you can add a bunch of photos of different electrical components such as motors or resistors, give them labels, and train a MLImageClassifier model to identify those components.

What I’m hoping to teach you today is how to incorporate Vision with AR, so that you can use ARKit to provide feedback on the findings and results that Vision finds in real-time.

For example, say I wanted to make an app that identified where my toast came from (example from WWDC). I would be able to make a ML Model that is trained to identify different toasts, set it up with Vision, and then put a toast icon on top of the toast using AR once a match has been made:

Creating a CoreML Image Classifier

The first thing we need in our project is to create a MLImageClassifier .

Creating an image classifier:

Take photos of different items. These can be different types of tools, different types of fruit, or even different brands of pop — whatever you want.
Group items of the same kind in a folder with the name of the folder being the thing itself. For example, if you were to compare different types of fruit (let’s say: bananas and apples), you would put all the photos of bananas in a folder named “Banana” and all the photos of apples in a folder named “Apple”
Open Xcode Playrounds and enter the following code:

import CreateMLUIlet builder = MLImageClassifierBuilder()
builder.showInLiveView()

4. Open the Live View and run the playground. You should see this:

5. Put all folders in one big folder, and drop that folder in the box, it will then train the model.

6. Test if you have a testing set. If you have more images (also sorted), you can test your model for accuracy.

7. Export the model as a .mlmodel

Setting up Vision with your ML Model

The next thing you need to do is set up the Vision framework with your ML Model.

You will need to make a classification request object. Enter the following code, replacing “Model()” in the first line with the name of your model, followed by “()”:

private lazy var classificationRequest: VNCoreMLRequest = {
  do {      let model = try VNCoreMLModel(for: Model().model)

      let request = VNCoreMLRequest(model: model, completionHandler:  { [weak self] request, error in
          self?.processClassifications(for: request, error: error)
      })request.imageCropAndScaleOption = .centerCrop
      return request  } catch {
      fatalError("Failed to load Vision ML model: \(error)")
  }

Process Classifications (Method)

The code snippet in the above code refers to the processClassifications(request:,error:) method. This method handles the completion of the Vision request and chooses results to display or return. Let’s make the method now:

The first part just checks if the results from the passed in request object is not null and actually has some results. If it doesn’t have anything, it would just return and exit the method:

guard let results = request.results else {
 print("Unable to classify image.\n\(error!.localizedDescription)")
 return
}

Next, we create a classifications variable. This variable will hold all the results that we’ve found as a VNClassificationObservation object:

let classifications = results as! [VNClassificationObservation]

The next thing to do is filter out the results and find the best result: results that the model is at least 50% certain of.

let bestResult = classifications.first(where: { result in result.confidence > 0.5 })

You can then choose what to do with that result.

To get the label of the result, you would have to split the identifier of the bestResult variable and get the .first of that.

let result = bestResult.identifier.split(separator: “,”).first

Please note that all of the above is encapsulated in the method processClassifications(for request: VNRequest, error: Error?) .

Classifying What You See (Method)

The next thing we will need to do is write up a method that is responsible for actually running requests for everything that it sees through your camera.

This is the classifyCurrentImage() method. Let’s go over how to make it.

The first thing you will need to do is declare a variable of the type CVPixelBuffer . This allows your app to turn what it sees on the camera, which is also seen on the screen, into pixels that it can digest.

Declare the variable outside the method itself.

var currentBuffer: CVPixelBuffer?

Let’s begin on the classifyCurrentImage() method. Everything we do from here will be in the method. The first thing you will need to do is set up the orientation that your camera should be interpreted in. Simply create a variable that is equal to the orientation that the device is currently in:

let orientation = CGImagePropertyOrientation(UIDevice.current.orientation)

Next, create a request handler. A request handler, in this case a VNImageRequestHandler , is an object that processes one or more image analysis requests pertaining to a single image.

In here, we are passing in the currentBuffer variable we created earlier, outside the method.

let requestHandler = VNImageRequestHandler(cvPixelBuffer: currentBuffer!, orientation: orientation)

Lastly, we will release the pixel buffer (for memory and processing) and perform a classification request on the request handler that we just created.

do {
   defer { self.currentBuffer = nil } 
   try requestHandler.perform([self.classificationRequest])
} catch {
   print(“Error: Vision request failed with error \”\(error)\””)
}

The Final Step

The final thing you need to is add the classifyCurrentImage() to a session method, such as func session(_ session:,didUpdate frame:) . This allows the method to be called at every frame, constantly tracking and identifying what it sees.

Good practice is to also make sure that the device’s tracking state is normal. This means that the app will make sure the device isn’t shaking nor is anything in front of it moving rapidly that would not allow the app to classify successfully. Making sure that the method is called only when the phone is still and tracking is optimal saves the need for additional and unsuccessful requests to go through which can drain battery and performance.

if case .normal = frame.camera.trackingState {
    self.currentBuffer = frame.capturedImage
    classifyCurrentImage()
}

That’s it! Vision on it’s own is an amazing algorithm that is powerful and efficient. With AR, you can now make your classification applications more immersive and fun — giving the end user a better experience.