A Lip Reader

Kaotu

6 min readJun 21, 2022

Hello, my name is Kaotu from AI Builders class 2nd. The project I choose to do is a lip reader.

AI Builders: https://ai-builders.github.io/showcase/2022/A-Lip-Reader

Github: https://github.com/Kaotu999/A_Lip_Reader

Images and data: https://drive.google.com/drive/folders/1ou9UV9noFoDxCDicWWlybMvMKKGr1y1d?usp=sharing

Try for yourself: https://colab.research.google.com/drive/1yIxchzbmw3uQNeY0d8lkNxCP08E8_BYT?usp=sharing

What is it?

My idea, a lip reader taking input as a video file that has either corrupted audio or too much background noise, uses a CNN(convolutional neural network) to classify what a person is saying and then, add subtitles to it.

Why is this useful?

This project can solve many problems. As stated earlier, you can understand videos with corrupted audio or too much background noise. If developed further, it can be used for military purposes for example, reading criminal’s lips on CCTV to see what they are saying. It can also be used in medicine as an assistive tool for those with speech impairments.

Nowadays, the military trains their personnel to read lips. Professional lip readers have a 20–60% accuracy. My goal is to make an AI that does a better job.

Lets get started!

Preparing data

Option 1 : find a dataset

First, I found this dataset on kaggle.com : MIRACL-VC1 It was made specifically for lip reading which was great!

The only problem was that the dataset consists of different people saying only 10 words and 10 phrases meaning that the model will only be able to predict and classify these 10 words and nothing else, which is not what I wanted.

Option 2 : create a dataset

Furthermore, I decided to create my own dataset that can classify any word spoken. How would I do that? Using phonemes, I obviously couldn’t get a video of a person saying every word.

Idea: https://youtu.be/28U6EwfKois

The model will classify from an image which is phoneme that a person is speaking and construct a word. For my dataset, I found a video of a person reading the bee movie transcript for an hour, and I believe that should be enough.

The next task is to label each frame with it’s spoken phoneme. I will be using a tool called gentle which is a forced aligner to do this. It requires an audio file and a transcript which both I have, and returns a json file that has every phoneme labelled with exact time spoken.

After a little bit of research, I found out how I could run gentle. I had to use docker and upload the source code of gentle from github.

After that, for the model’s efficiency, I decided that each frame of the video has to be cropped to a person’s mouth before training or classifying by using the face_recognition library and open-cv.

Just like that, after a little bit of coding, I was able to successfully create the dataset. It has around 100,000 images cropped from a video each labelled with it’s phoneme, and there are 40 classes in total.

However, the dataset is a little bit imbalanced due to some phonemes showing up more often than others for example, none or silence appears around 1000 times more often than the phoneme “zh” in the dataset. I decided to leave it be because it is natural that some sounds appear more often than others.

Cleaning code is in Notebooks in github repo.

Training

CNN

First, I tried to use a CNN model from fastai by loading in the pre-trained CNN_learner to make a simple image classifier.

A CNN or convolutional neural network is a Deep Learning algorithm which can take in an input image, assign importance to various aspects/objects in the image and be able to differentiate one from the other. more info

This is an image classification model so the metrics will be accuracy.

After 6 epochs of training, the results were rather disappointing…

CNN+LSTM

I could continue training the CNN model, but I wanted to try inputting the data into the model as a sequence because saying something is a sequence of how our lips move, not a still photo of it, so, I decided to try using CNN+LSTM or Long Short-Term Memory.

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning order dependence in sequence prediction problems. more info

I decided to use the model “Sequential” from tensorflow then added more layers:

The model requires 4 numpy arrays(images) each size 64x64 and it will classify the image sequence between 40 classes. Same as the CNN model, the metrics is accuracy.

The training, again, was very disappointing.

After a few hours(30 epochs, I trained 10 epochs before this), the loss was not decreasing anymore nor was the accuracy increasing (Overfitting). So, I stopped training..

I got accuracy 34% and validation accuracy at 32%.

Error Analysis: Why these methods got low accuracy

There are 40 classes.
some phonemes have very little data for the model to see.
Some phonemes are very hard to differentiate even by humans. A very simple example is “v” and “f” though there are a lot more phonemes with similarity.

Can you differentiate that? These problems are somewhat unfixable because naturally some phonemes appear very rarely in everyday words and some phonemes sound and look very similar.

Baseline

On the test set the model got:

If I were to use the CNN model as a baseline (there are no other lip reading model), the LSTM model has sadly failed.

Adding Subtitles

Converting the output phonemes into words is another very difficult task. At first, I thought I would create another NLP model to predict the words from the phonemes but time is running out. So, I decided to just use a simple conversion using the CMU pronouncing Dictionary which has every English word and it’s phonemes.

Sadly, the output phoneme from the model does not seem to fit any word in the dictionary. So I just added phonemes as subtitles.

Deployment

I used google colab to run streamlit.

Try it for yourself

You’d have to upload your file to google drive first before using it, which is not very convenient, but it is usable.

Overall, my model did not reach my expectations, but I wouldn’t call it a complete failure. After all, it’s about the journey, not the destination. I learned many things from building my project, and I am very grateful that I am a part of AI-Builders.

Thank you very much…

A Lip Reader

What is it?

Why is this useful?

Lets get started!

Written by Kaotu