A.I. Odyssey part 2. — Implementation Details

Julien Despois
4 min readJan 25, 2017

--

This is the follow-up to my story “Use your eyes and Deep Learning to command your computer”. Here, I’ll go into more details about the implementation of the eye motion detection. So if you haven’t checked out the original post, you should do it now!

Finding the eyes

The main problem with the approach described in the article for finding the eyes, is that HAAR Cascades, although very accurate, tend to have jitter in the position and shape of the bounding boxes. Even if this does not appear as a problem — because it looks like the eye is always centered — it completely messes difference frames.

Gamma motion with no eye tracking (jitter)

The solution to this is to implement a minimal eye tracking algorithm. What this does is that the bounding box is not changed if the detected one is close to the one in the previous frame. With some tuning on the parameters that determine what “close” means, I reached the point where the frame would not jitter anymore, except when the eye/head was moved too much.

Gamma motion with eye tracking

The inconvenient of this technique, is that the eye could be translated inside the first anchored bounding box. More importantly, when the bounding box would change, it would create a “shift” in the image difference.

In the gif above, there is such a shift when the eyes comes from the top right to the bottom left.

Eyes in current frame (Left), previous frame (Center) and difference (Right)

Note how the sudden jump to a new bounding box on the rightmost eye results in a huge shift in the difference while the eye itself did not move.

To alleviate this issue, I have simply decided to skip the frames where a shift occurs. This removed the shifts, which made the data cleaner, while keeping most of the information.

Ex. This means that if we change the bounding box between the frames 22 and 23, we would have a difference frame [22–21] and then [24–23].

Neural network

No pooling

I have chosen to use one convolutional layer, with no pooling as the image were pretty small (24px wide). This ensures that we keep as much information as possible.

Sharing the weights

Also, I could have set the weights for each eye to be shared, but I chose not to just in case, because of the slight differences in pose and shape of the eyes. I did not have time to test whether this was helpful, but sharing the weights would have made the model lighter.

Making predictions & Multithreading

It is absolutely essential to run the classifier in a separate thread from the webcam/eyes detection. That is because the model takes quite some time making a prediction (tens/hundreds of milliseconds, but it matters) and we would miss what’s happening in the meantime[a]. As the eye motions are quick (~1sec), we want to capture them at the highest framerate possible, and then make predictions on the latest frames available[b].

[a] Imagine yourself trying to write down what someone is saying, but after each word you have to give your pen and paper to a friend to translate what you wrote. You wouldn’t be able to write full sentences.
[b] This time, you write down as much as you can and your friend occasionally peeks over your shoulder and translates the latest words you’ve written.

Both approaches will miss some information, but it make more sense to make fewer predictions on full data, than more predictions on patchy, inconsistent data.

Making the predictions

The goal of this project was to evaluate the feasibility of eye motion recognition with deep learning and a laptop webcam. As such, I did not spend much time on the last step of the process (using the predictions to trigger commands on the computer). However, I wanted the software to work as well as it could.

For that, I had to fight against the inaccuracy of the model predictions. 85+% accuracy is good, but when you make 3 predictions a second, it quickly appears that it’s not enough (many false positives). The solution was to average predictions of the model on the duration of a motion.

This involved sketchy FPS computations see classifier.py to try to match the window used to average, and the time a motion stays in the frames history. The idea behind was that the prediction would be more accurate if it was done with the motion at the beginning, middle and end of the sequence.

This works very well, but could use some fine tuning to make the model even better.

Final words

Thank you again for your interest and support!

Here’s the code again for reference:

Unlisted

--

--