Computer Vision

Live Face Tracking on iOS using Vision Framework

Published in

Onfido Product and Tech

11 min readMay 28, 2019

Have you wondered how apps such as Snapchat add props to faces on screen? Or how they change your face in funny ways? They do so by first detecting where each face is. Additionally they need to detect where each face feature is too.

In this post I will show you how to detect faces and its features using Vision framework in an iOS app. We will receive live frames of the front camera of an iOS device. Next we will analyse each frame using Vision framework’s face detection. Finally we will display where the face and face features are on the screen.

In this post we won’t solve computer vision challenges. We won’t write computer vision algorithms. We will leverage provided functionality by Apple that solves computer vision challenges.

What is Vision Framework?

Introduced in WWDC 2017, Vision framework is a module offered by Apple to iOS developers. It contains functionality that solves some computer vision challenges.

Why use Vision Framework?

Solutions offered by the Vision framework are not new. In fact many of the solutions offered initially by the Vision framework were already present in the CoreImage framework also offered by Apple since iOS 5. Vision framework is offered from iOS 11. Additionally all the functionality we will be using on this post are supported both in CoreImage and Vision.

So why use Vision framework at all if its only supported by recent iOS versions?

Apple claims that the underlying implementation of the Vision algorithms are more accurate and less prone to return false positives or negatives. They claim that the framework leverages the latest machine learning (deep learning) and computer vision techniques which have improved results and performance.

Additionally active users using iOS versions prior to iOS 11 account for less than 3.5% based on Mixpanel’s data (number can vary based on your target audience). At the time of writing iOS 12 accounts nearly 90% of the users.

Getting Started

In this section we will create an iOS app project from scratch. The app will:

Stream the front camera feed onto the screen
Detect faces and draw bounding boxes on screen

Time to dive into the code. Let’s start by creating a new project from scratch. Open Xcode and then from menu select File > New > Project… Next, select Single View App template and then click on Next.

Name the project FaceTracker and then click Next. Finally store the project wherever convenient for you and then click Finish.

The template will create the necessary code with the right configuration to run the app and display a blank screen.

1. Streaming the front camera feed onto the screen

As the first step we want to be able to stream the camera feed from the front camera to the screen.

As part of the project setup from the Single View App template, Xcode will have created a file named ViewController.swift. This file contains the ViewController class which is the controller for the blank screen presented when you run the app. Right now it does nothing.

Let’s add the camera feed to the ViewController. First we will require access to the front camera. We will make use of the AVFoundation framework provided by Apple on the iOS platform to do so. AVFoundation framework allows us access to the camera and facilitates the output of the camera in our desired format for processing.

To gain access to the AVFoundation framework add the following line after import Foundation in ViewController.swift:

import AVFoundation

Next we next to create an instance of a class called AVCaptureSession. Within the ViewController class add the following line:

private let captureSession = AVCaptureSession()

This class coordinates multiple inputs such as microphone and camera into multiple outputs, one such is video. For this post we only need one single input (front camera) and one single output (raw frames).

Next let’s add the front camera as an input to our captureSession. Add the following function to your ViewController:

private func addCameraInput() {
    guard let device = AVCaptureDevice.DiscoverySession(
        deviceTypes: [.builtInWideAngleCamera, .builtInDualCamera, .builtInTrueDepthCamera],
        mediaType: .video,
        position: .front).devices.first else {
           fatalError("No back camera device found, please make sure to run SimpleLaneDetection in an iOS device and not a simulator")
    }    let cameraInput = try! AVCaptureDeviceInput(device: device)
    self.captureSession.addInput(cameraInput)
}

The function above starts by fetching the front camera device. Note simulators don’t have access to camera on the Mac. If no camera is found then the app will crash.

We aren’t managing camera permissions. If the user denies the app permission to the camera the app will crash.

Next we create an device input for the capture session and finally we add the device input into our capture session.

Let’s call our addCameraInput to carry the out the action of adding the front camera to the camera session. Add the following line at the end of the viewDidLoad function:

self.addCameraInput()

We aren’t done yet getting access to the camera. In order for an app to access the camera the app must declare that it requires to use the camera in its Info.plist file. Open Info.plist and add a new entry to the property list. For key add NSCameraUsageDescription and for value enter Required for front camera access.

Now that we have the front camera feed we now have to display it on screen. For such a task we are going to make use of the AVCaptureVideoPreviewLayer class. AVCaptureVideoPreviewLayer is a subclass of CALayer and it is used for displaying the camera feed. Let’s add this as a new property to our ViewController. Add the following line:

private lazy var previewLayer = AVCaptureVideoPreviewLayer(session: self.captureSession)

The property is lazy loaded as it requires captureSession to be loaded before it. Thus we used the lazy keyword to defer the initialisation to a point where the captureSession would already be loaded.

Next we have to add the previewLayer as a sublayer of the container UIView of our ViewController. Add the following function to do so:

private func showCameraFeed() {
    self.previewLayer.videoGravity = .resizeAspectFill
    self.view.layer.addSublayer(self.previewLayer)
    self.previewLayer.frame = self.view.frame
}

Let’s call this function. At the end of viewDidLoad add the following line:

self.showCameraFeed()

Next we need to adapt the preview layer’s frame when the container’s view frame changes; it can potentially change at different points of the UIViewController instance lifecycle. Add the following function to do so:

override func viewDidLayoutSubviews() {
    super.viewDidLayoutSubviews()
    self.previewLayer.frame = self.view.frame
}

Finally we have to tell the captureSession to start coordinating its input, preview and outputs. At the end of viewDidLoad call the following line:

self.captureSession.startRunning()

Run the app and watch the front camera feed!

2. Detect faces and draw bounding boxes on screen

For the second part we will extract live images from the camera feed continuously and then run face detection on each image. If face and face features are detected then we will draw a bounding box onto the screen.

Let’s first extract the live camera feed image. For such a task we will require our captureSession to output each image. We will need to make use of AVCaptureVideoDataOutput. Within the ViewController class create an instance of AVCaptureVideoDataOutput by adding the following line:

private let videoDataOutput = AVCaptureVideoDataOutput()

Next let’s add videoDataOutput as an output to our captureSession. Additionally let’s tell the videoDataOutput to deliver each frame to our ViewController. Add the following function to ViewController class:

private func getCameraFrames() {
    self.videoDataOutput.videoSettings = [(kCVPixelBufferPixelFormatTypeKey as NSString) : NSNumber(value: kCVPixelFormatType_32BGRA)] as [String : Any]
    self.videoDataOutput.alwaysDiscardsLateVideoFrames = true
    self.videoDataOutput.setSampleBufferDelegate(self, queue: DispatchQueue(label: "camera_frame_processing_queue"))
    self.captureSession.addOutput(self.videoDataOutput)    guard let connection = self.videoDataOutput.connection(with: AVMediaType.video),
        connection.isVideoOrientationSupported else { return }
    connection.videoOrientation = .portrait
}

The above function should raise an error.

In the getCameraFrames function we have told the videoDataOutput to give the ViewController the camera frames. However in order to do so ViewController must conform to the AVCaptureVideoDataOutputSampleBufferDelegate protocol. Add AVCaptureVideoDataOutputSampleBufferDelegate to the ViewController declaration right after UIViewController. Your ViewController declaration must now look like:

class ViewController: UIViewController, AVCaptureVideoDataOutputSampleBufferDelegate {
    ....

The error should now have disapeared.

Note in the getCameraFrames we told videoDataOutput to send live camera frames to our ViewController. We also told videoDataOutput on which thread to process the frame. We created an instance of DispatchQueue which executes the delegation of each frame on serial thread which is not the main one. The main thread is used to render views. It is good practice to off load intense tasks from the main thread. We’ll come back to this point once we want to draw where the face and face features are within the screen.

Next let’s add the function to receive the frames from the captureSession:

func captureOutput(
    _ output: AVCaptureOutput, 
    didOutput sampleBuffer: CMSampleBuffer,
    from connection: AVCaptureConnection) {
        print("did receive frame")
}

The captureOutput function will receive the frames from the videoDataOutput.

So far for this section we have created a function that outputs video frames and created a function that receives them. However we still need to call the getCameraFrames function. Before self.captureSession.startRunning() in viewDidLoad add the following line:

self.getCameraFrames()

Run the app and watch the console (from menu select View > Debug Area > Activate Console).

Whilst the app is running console will log did receive frame continuously.

Now that we have the image from the camera feed we now need to process it and run face detection on it. We will process using Vision frameworks VNDetectFaceLandmarksRequest. To access Vision we must first import it in the file using it. At the top of ViewController.swift file, below the other import statements add the following line:

import Vision

Next let’s add a new function to detect faces and face features:

private func detectFace(in image: CVPixelBuffer) {    let faceDetectionRequest = VNDetectFaceLandmarksRequest(completionHandler: { (request: VNRequest, error: Error?) in
        DispatchQueue.main.async {
            if let results = request.results as? [VNFaceObservation], results.count > 0 {
                print("did detect \(results.count) face(s)")
            } else {
                print("did not detect any face")
            }
        }
    })    let imageRequestHandler = VNImageRequestHandler(cvPixelBuffer: image, orientation: .leftMirrored, options: [:])
    try? imageRequestHandler.perform([faceDetectionRequest])
}

The function above will create a face detection request on the currently displayed image on screen. For now we check that one or more faces are detected on the image and then print the number of faces detected to the console.

Let’s call our new function to detect face and then test our FaceTracker app. Let’s change the implementation of our captureOutput function to the following:

func captureOutput(_ output: AVCaptureOutput,
                   didOutput sampleBuffer: CMSampleBuffer,
                   from connection: AVCaptureConnection) {
    guard let frame = CMSampleBufferGetImageBuffer(sampleBuffer) else {
        debugPrint("unable to get image from sample buffer")
        return
    }    self.detectFace(in: frame)
}

Run the app and watch the console.

We’re approaching the final steps. Next we will draw on the screen the bounding box of where the face is located. The results returned contains a properties named boundingBox for each observed face. We will take each face in turn and extract the bounding box for each of those.

Before we add the function to draw the faces bounding boxes let’s first create a new variable to hold the drawings in our ViewController instance:

private var drawings: [CAShapeLayer] = []

We will use this property to reference any drawings on screen.

Add the following functions to handle the results from face detection:

private func handleFaceDetectionResults(_ observedFaces: [VNFaceObservation]) {
    self.clearDrawings()    let facesBoundingBoxes: [CAShapeLayer] = observedFaces.map({ (observedFace: VNFaceObservation) -> CAShapeLayer in        let faceBoundingBoxOnScreen = self.previewLayer.layerRectConverted(fromMetadataOutputRect: observedFace.boundingBox)
        let faceBoundingBoxPath = CGPath(rect: faceBoundingBoxOnScreen, transform: nil)
        let faceBoundingBoxShape = CAShapeLayer()
        faceBoundingBoxShape.path = faceBoundingBoxPath
        faceBoundingBoxShape.fillColor = UIColor.clear.cgColor
        faceBoundingBoxShape.strokeColor = UIColor.green.cgColor        return faceBoundingBoxShape
    })    facesBoundingBoxes.forEach({ faceBoundingBox in self.view.layer.addSublayer(faceBoundingBox) })
    self.drawings = facesBoundingBoxes
}private func clearDrawings() {
    self.drawings.forEach({ drawing in drawing.removeFromSuperlayer() })
}

handleFaceDetectionResults starts off by clearing any drawings on screen. As the face changes position the old bounding boxes drawings are no longer correct so we remove them before drawing the new position of the face.

Note the face observation result returns a bounding box with the location of the face in the image. However the image resolution differs from the screen resolution. Therefore we have to convert the face location from image location to the one on screen. For such Apple has provided a conversion function on AVCaptureVideoPreviewLayer instance named layerRectConverted(fromMetadataOutputRect:) to convert from the image coordinates and screen coordinates.

Let’s call our new handleFaceDetectionResults from detectFace function. Change the detectFace implementation to the following:

private func detectFace(in image: CVPixelBuffer) {
    let faceDetectionRequest = VNDetectFaceLandmarksRequest(completionHandler: { (request: VNRequest, error: Error?) in
        DispatchQueue.main.async {
            if let results = request.results as? [VNFaceObservation] {
                self.handleFaceDetectionResults(results)
            } else {
                self.clearDrawings()
            }
        }
    })
    let imageRequestHandler = VNImageRequestHandler(cvPixelBuffer: image, orientation: .leftMirrored, options: [:])
    try? imageRequestHandler.perform([faceDetectionRequest])
}

Run the app and play with it. You should now see a green box around all faces.

As a final step let’s draw some face features. I won’t cover all the face features available. However what we will cover here can be applied to any face feature. For this post we will draw the eyes on screen. We can access the face features path using face detection result, VNFaceObservation landmarks property. Add the following functions to handle face landmarks from face detection:

private func drawFaceFeatures(_ landmarks: VNFaceLandmarks2D, screenBoundingBox: CGRect) -> [CAShapeLayer] {    var faceFeaturesDrawings: [CAShapeLayer] = []if let leftEye = landmarks.leftEye {
        let eyeDrawing = self.drawEye(leftEye, screenBoundingBox: screenBoundingBox)
        faceFeaturesDrawings.append(eyeDrawing)
    }    if let rightEye = landmarks.rightEye {
        let eyeDrawing = self.drawEye(rightEye, screenBoundingBox: screenBoundingBox)
        faceFeaturesDrawings.append(eyeDrawing)
    }    // draw other face features here    return faceFeaturesDrawings
}private func drawEye(_ eye: VNFaceLandmarkRegion2D, screenBoundingBox: CGRect) -> CAShapeLayer {    let eyePath = CGMutablePath()
    let eyePathPoints = eye.normalizedPoints
        .map({ eyePoint in
            CGPoint(
                x: eyePoint.y * screenBoundingBox.height + screenBoundingBox.origin.x,
                y: eyePoint.x * screenBoundingBox.width + screenBoundingBox.origin.y)
         })
    eyePath.addLines(between: eyePathPoints)
    eyePath.closeSubpath()    let eyeDrawing = CAShapeLayer()
    eyeDrawing.path = eyePath
    eyeDrawing.fillColor = UIColor.clear.cgColor
    eyeDrawing.strokeColor = UIColor.green.cgColor
    
    return eyeDrawing
}

The functions above converts each eye if detected and draw them. Note in drawEye function we have to convert each point for the eye contour to screen points as we did for the face bounding box. However these values are relative to the screen bounding box. I wasn’t able to find a convenience function to convert each relative point to screen points easily. Thus in the function above we do a manual conversion of the points.

Let’s call our new functions. Change the implementation of handleFaceDetectionResults to the following:

private func handleFaceDetectionResults(_ observedFaces: [VNFaceObservation]) {
    
    self.clearDrawings()    let facesBoundingBoxes: [CAShapeLayer] = observedFaces.flatMap({ (observedFace: VNFaceObservation) -> [CAShapeLayer] in
        let faceBoundingBoxOnScreen = self.previewLayer.layerRectConverted(fromMetadataOutputRect: observedFace.boundingBox)
        let faceBoundingBoxPath = CGPath(rect: faceBoundingBoxOnScreen, transform: nil)
        let faceBoundingBoxShape = CAShapeLayer()
        faceBoundingBoxShape.path = faceBoundingBoxPath
        faceBoundingBoxShape.fillColor = UIColor.clear.cgColor
        faceBoundingBoxShape.strokeColor = UIColor.green.cgColor        var newDrawings = [CAShapeLayer]()
        newDrawings.append(faceBoundingBoxShape)
        if let landmarks = observedFace.landmarks {
            newDrawings = newDrawings + self.drawFaceFeatures(landmarks, screenBoundingBox: faceBoundingBoxOnScreen)
        }
        return newDrawings
    })    facesBoundingBoxes.forEach({ faceBoundingBox in self.view.layer.addSublayer(faceBoundingBox) })    self.drawings = facesBoundingBoxes
}

And that’s it! 🎉 Run the the app and checkout the results!

Face with bounding box and eye contours drawing

Summary

In this post we have learnt to:

Stream the camera feed from our iOS devices to the screen
Handle live images from the camera in our app
Use the Vision framework to process the image and detect face and face features
Convert image coordinates to screen coordinates
Draw onto the screen using CAShapeLayer

Final notes

In this post we have learned how to solve computer vision challenges without knowing anything about computer vision. We have leveraged functionality that already solves these challenges. However in some cases built-in or off-the-shelf functionality might not solve your problem. In those cases you might need to solve the computer vision challenges yourself. In my previous post I showed how to be able to solve computer vision challenges by leveraging OpenCV, a C++ library that contains functionality aimed at real time image processing.

You can find the full source code to this post here.

If you liked this post please don’t forget to clap. Stay tuned for more posts on iOS development! Follow me on Twitter or Medium!