More than meets the eye — pupil

A short overview and comparison of current pupil detection methods on full face images.

Anneloes Ernest

Published in

Orikami blog

8 min readJul 7, 2020

Why do we need pupil detection methods?

There is an upcoming trend of health smartphone apps that measure all kinds of personal data to provide insight into certain aspects of someone’s health. However, some health apps are a little bit more specialized towards a specific group of people. At Orikami, we develop digital biomarkers and implement them in a mobile app like MS Sherpa. MS Sherpa personalizes healthcare for people with multiple sclerosis (MS) and measures the patient’s mobility, cognitive function and fatigue over time. One of these aspects, fatigue, can be determined by the delay and the speed of an eye saccade after a certain visual stimuli (Finke et al. 2012). While the patient is performing the eye saccade task on their smartphone, the patient’s front-facing camera records the face. Before we are able to determine the delay and speed of the saccade from this recording, we first need to find the location of the patient’s pupil. This is where pupil detection methods come in, to help us find the center of the pupil.

In this post, I will introduce four pupil detection methods and compare them to each other on four different datasets.

Which pupil detections methods are there?

Most pupil detection methods are designed for wearable eye trackers, which record the eyes up close with a high resolution camera and infrared light. These wearables are mostly used in visual attention research to estimate where the subject is looking at. In a less controlled and casual environment, it can be a bit restrictive to wear such an eye tracker. In these cases, a simple webcam or smartphone camera can be used instead. However, we now have a recording of at least the entire face. Luckily, there are also pupil detection methods designed for full face images. With full face images, we always need to extract the eye regions from the images first. Usually a facial landmark estimator is used to predict the location of the eyes. Even some pupil detection methods for wearable eye trackers can perform well on these extracted eye regions.

The pupil detection methods that will be compared in this post are the gradient approach (Timm & Barth, 2011), the circular hollow kernel (CHK) model (Ince et al. 2019), the PUpil REconstructor (PuRe) (Santini et al. 2017) and DeepVOG (Yiu et al. 2019). The gradient approach and the CHK model are both designed for full face image data, whereas PuRe and DeepVOG are actually designed for wearable eye tracker data.

The gradient approach evaluates the pixel intensity increases or decreases in all directions in the image. Under the assumption that the pupil is a dark feature on a lighter background in the eye region, where a large change in pixel intensity is expected, the pupil center is estimated as the intersection of the gradient.

Example of the output from the gradient approach (Timm & Barth, 2011)

The CHK model consists of an iris boundary, pupil boundary and pupil center. CHK resembles the gradient approach since it also makes use of large changes in pixel intensity. However, the CHK model only looks at large changes between edge pixels on a number of radials across the eye model, called the edge magnitudes. The eye model is calculated for all pixels in the image with that pixel as the center. The eye model center with the largest number of edge magnitudes is estimated as the pupil center.

The next method, PuRe, is a total edge-based algorithm. PuRe tries to find curved edge segments of all sizes in the eye region and predicts how likely they are to be part of the pupil boundary. The edge segment with the highest likelihood is fitted with an ellipse. The center of that ellipse is estimated as the pupil center. The implementation of PuRe that is used in the experiments is a third party Python implementation which can be found here.

An example output of the PuRe method (Santini et al. 2017)

The last method is DeepVOG and is the only deep learning method. DeepVOG is a U-net shaped convolutional neural network. The output of the network is thresholded and fitted with an ellipse. The center of the ellipse is predicted as the pupil center.

Example output of DeepVOG: a) the input image, b) the prediction output of the network, c) threshold the prediction output, d) determine boundary, e) fit ellipse on boundary, f) display ellipse on input image (Yiu et al. 2019)

Data

To compare the methods, we test them on four different datasets which mainly differ in the resolution of the images. The first dataset, BioID Database, is a low resolution dataset with 1521 images from 23 different subjects. The second dataset, the GI4E Database, is a medium to high resolution dataset and consists of 1236 images from 103 different subjects. Next, we have the IMM Frontal Face Database with 120 images from 12 different subjects, which is the high resolution dataset. The last dataset is the Talking Face Video Dataset which consists of 5000 images from a video of a single subject talking to the camera. This last dataset has a medium resolution. The specific resolutions can be found in the image below.

Results

For each full face image, I have calculated the normalized pixel error (NPE) for the two eyes. The NPE is defined as the pixel distance between the true pupil center and the predicted pupil center, divided by the distance between the two true pupil centers. By normalizing the error, we can make certain assumptions about how low we need the error to be for accurate pupil detection. An error smaller than 0.05 reflects an estimated pupil within the pupil boundary. An error smaller than 0.10 reflects an estimated pupil within the iris boundary. For accurate pupil detection, we need a method that estimates the pupil center at least within the pupil boundary. Therefore, we are looking for the highest detection rate at a NPE of 0.05. We are interested in the best prediction of the worst predicted eye in the image, so we will use the NPE of the worst predicted eye as the performance measure on the entire, full face, image. The results of the four methods on the four datasets are shown in the figure below.

Detection rate of the methods per NPE value for all datasets

The first thing that stands out is that PuRe is heavily outperformed by the other three methods on all datasets. This is not entirely unexpected since PuRe was designed for high resolution data from wearable eye trackers. DeepVOG, however, which is also designed for the same purpose as PuRe, does give very good results. We should note that DeepVOG is sometimes unable to predict anything meaningful since the output of the network is thresholded. Only on BioID this is problematic since a quarter of the images could not be properly predicted. On the other datasets, the difference between the detection rate of DeepVOG and the other methods at the pupil boundary is larger than the percentage of skipped images by DeepVOG, meaning DeepVOG would still come out on top. The gradient approach comes second to DeepVOG except on the BioID dataset. Here, the CHK model takes the lead.

The processing time of the methods is also very interesting to take into account here. In the figure below, the processing times of all methods on all datasets are displayed. A note should be added about DeepVOG, since this is the only method which uses a GPU for the computation. Here, we see that DeepVOG is not just very good in terms of performance, but also relatively fast. What also stands out is that the CHK model’s processing time is very dependent on the resolution, which makes sense since it is a pixel-based method. The implementation of the gradient approach uses a little cheat: the input images are resized to a maximum eye height such that the processing time does not explode for higher resolution images like it does for the CHK model.

Processing times per eye regions for all methods and datasets

So…

In this post, I have given a small overview of current pupil detection methods. I compared them on four datasets differing in resolution. The best performing method differs over the datasets. For low resolution datasets like BioID, the CHK model performs best. However, for higher resolution datasets like GI4E and IMM, DeepVOG performs best. If you would like to use pupil detection on your data, you should take the resolution of your data into account before selecting the right method.

Most smartphone front-facing cameras nowadays are able to produce quite good quality images and the datasets GI4E and IMM represent this resolution the best. Therefore, for the fatigue task, DeepVOG is the most promising method. In cases where DeepVOG fails, another method like the gradient approach can be applied instead. DeepVOG can even be improved upon by enhancing the training data with extracted eye regions from full face images.

Dataset links
BioID: https://ftp.uni-erlangen.de/pub/facedb/readme.html
GI4E: http://www.unavarra.es/gi4e/databases/gi4e/
IMM: http://www2.imm.dtu.dk/pubdb/pubs/3943-full.html
TalkingFace: https://www-prima.inrialpes.fr/FGnet/data/01-TalkingFace/talking_face.html

References
Finke, C., Pech, L. M., Sömmer, C., Schlichting, J., Stricker, S., Endres, M., Ostendorf, F., Ploner, C. J., Brandt, A. U., & Paul, F. (2012). Dynamics of saccade parameters in multiple sclerosis patients with fatigue. Journal of neurology, 259(12), 2656–2663. DOI: https://doi.org/10.1007/s00415-012-6565-8.

Ince, I. F., Erdem, Y. S., Bulut, F., & Sharif, M. H. (2019). A low-cost pupil center localization algorithm based on maximized integral voting of circular hollow kernels. The Computer Journal, 62(7), 1001–1015. DOI: https://doi.org/10.1093/comjnl/bxy102.

Santini, T., Fuhl, W., & Kasneci, E. (2018). PuRe: Robust pupil detection for real-time pervasive eye tracking. Computer Vision and Image Understanding, 170, 40–50. DOI: https://doi.org/10.1016/j.cviu.2018.02.002.

Timm, F., & Barth, E. (2011). Accurate eye centre localisation by means of gradients. Visapp, 11, 125–130. DOI: 10.5220/0003326101250130

Yiu, Y. H., Aboulatta, M., Raiser, T., Ophey, L., Flanagin, V. L., zu Eulenburg, P., & Ahmadi, S. A. (2019). DeepVOG: Open-source pupil segmentation and gaze estimation in neuroscience using deep learning. Journal of neuroscience methods, 324, 108307. DOI: https://doi.org/10.1016/j.jneumeth.2019.05.016.