Understanding the Deep learning framework behind Apple Face ID

Venkat Rajgopal
3 min readJun 26, 2018

--

A prior knowledge of neural networks and convolutional neural networks is required to understand this method.

Introduction

Convolutional Neural Networks (CNN) are variants of the regular neural networks. CNN’s are inspired by the biological processes, such that the connections between the neurons resembles the structure of an animal “visual cortex”. Since the visual cortex of animals are the most powerful that exists, it seemed rather natural to depict its behavior. Prior works of Hubel and Wiesel [3] suggests that the visual cortex consists of complex arrangement of neurons. The neurons in the visual cortex are sensitive to small sub-regions of the visual field thus responding to stimuli only in some restricted region of the visual field known as the “receptive field”. These receptive fields of different neurons partially overlap in such a way that they cover the entire visual
field.
Dating back to early 90’s, Y.Le Cun [4] first applied CNN’s to solve character recognition tasks. CNN’s has had much wider recognition due to the recent work by (A.Krizhevsky [5]) which won the Image net classification challenge using Deep Convolutional Networks in 2012.

FaceID

FaceId derives inspiration from the “Overfeat”[1] which drew the equivalence between fully connected layers of a neural network and convolutional layers with valid convolutions of filters of the same spatial dimensions as the input. Meaning the Convnets used a similar filter padding so that the dimension remains the same throughout the network.

The architecture basically does two important steps.

  1. A binary classification to predict if there is a face in the input or not.
  2. A Regression to predict the bounding box parameters that localizes the face in the input.
The deep convolutional network architecture for face detection.

Training the network

There are of course several ways of training the network. One way for training is to create a large dataset of image of a fixed size corresponding to the smallest valid input to the network such that each image produces a single output from the network.

The training set is balanced into two sets. Where one half set contains a positive labels (which means image which has a face) and the other half as negative (meaning image without a face).

Next step is to create a Regressor for the positive class. For each positive set, a true location (x, y, w, h) of the face is provided.

The network is then trained to optimize the task described above. Once trained, the network is able to predict whether a set/image contains a face or not, and if so, it also provides the coordinates and scale of the face in the tile.

Due to the convolutional layout, the network can efficiently process an arbitrary sized image and produce a 2D output feature maps.

That was the basic idea behind the deep convolutional network for face detection.

Hope this was little useful. Do leave your comments.

PS: I hope to update this as and when i find more information.

Source

  1. Sermanet, Pierre, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, and Yann LeCun. OverFeat: Integrated Recognition, Localization and Detection Using Convolutional Networks. arXiv:1312.6229.

2. Face detection . https://machinelearning.apple.com

3. Hubel DH, Wiesel TN. Receptive fields and functional architecture of monkey cortex. The Journal of Physiology. 1968 Mar;195(1):215–43.

4. Convolutional Neural Networks (LeNet). DeepLearning 0.1 documentation. DeepLearning 0.1. LISA Lab. http://deeplearning.net/.html.

5. Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems 25 1097–1105, 2012

--

--