Age and Gender Classification using MobileNets

Kinar R
YML Innovation Lab
Published in
8 min readAug 8, 2018

With the advent of AI, visual understanding has become increasingly relevant to the computer vision society. Age and gender classification has been around for quite sometime now and continual efforts have been made to improve its results and this has been happening since the emergence of social platforms. This article will illustrate how a family of low-latency, low-power, on-device computer vision models that have come to be known as MobileNets will allow us to obtain a significant increase in performance and help train a model to achieve age and gender classification.

Traditionally, tasks in deep learning are usually performed in the cloud. This happens when a user issues a request to classify some image that needs to be annotated. It is first sent to a web service, where another model server takes this image, performs inference, returns the result to the service, and eventually, your phone receives it. With the rapid growth in processing power of mobiles, and the rise of architectures such as MobileNets, the way inference is carried out is changing quickly. Although these networks designed for the mobile platform are not usually as accurate as of the larger, more resource-intensive networks we have come to know and love, they really stand out when it comes to the resource/accuracy trade-off.

Now, we’ll look into leveraging these networks in order to classify the age and gender of a person.

Enter IMDB Faces

This dataset is claimed to be the largest publicly available dataset of face images with gender and age labels for training. However, due to its enormous size, we’ll be using the cropped version of the dataset which is significantly smaller ~ 7GB.

A .mat file containing all the meta-information is provided which can be loaded using either MATLAB or Python’s SciPy library. Some of the essential information are in the format as follows:

  • dob: date of birth (MATLAB serial date number)
  • photo_taken: year when the photo was taken
  • full_path: path to the image
  • gender: 0 for Female and 1 for Male, NaN if unknown
  • face_score: detector score (the higher the better).
  • second_face_score: detector score of the face with the second highest score. (useful to ignore images with more than one face). And this is NaN if no second face was detected.

In our case, Python will be employed and SciPy’s loadmat method can be used to load the metadata.

Cleaning up noisy labels

Looking at the metadata’s description, it can be observed that there’s a face score which can be used to judge the quality of the images being fed to the network. With this, a threshold can be set wherein images with scores no lower than a certain value (say 3) are allowed.

face_score_threshold = 3

Also, it might be possible that some annotations requiring numeric values may contain invalid formats. To overcome this, a module exists in NumPy to check whether an array contains non-numeric values (NaN) which can filter such noisy labels.

face_score_mask = face_score > face_score_threshold
second_face_score_mask = np.isnan(second_face_score)
unknown_gender_mask = np.logical_not(np.isnan(gender_classes))

Concerning age, it is made sure the age classes fall in a valid range (0~100).

age_classes = np.array([calc_age(photo_taken[i], dob[i]) 
for i in range(len(dob))
])
valid_age_range = np.isin(age_classes, [x for x in range(101)])

Now all these masks are combined to get a subset of the dataset containing a rich set of well annotated faces, mostly free from distortion.

mask = np.logical_and(face_score_mask, second_face_score_mask)
mask = np.logical_and(mask, unknown_gender_mask)
mask = np.logical_and(mask, valid_age_range)

After being done with preparing the denoised annotation lists, the images are then paired with their respective age and gender tuples by employing a nifty trick in python which involves chaining the operations — dict, zip.

With this, a dictionary containing the names of images as keys and (gender, age) labels as values are wrapped in a list denoting the ground truth data. These key-value pairs can then be loaded by a generator that can scale the images to any desired size and apply transformations to them.

The data that is going to be used is split into training (80%) and validation sets (20%).

Loading data

A custom image data generator for Keras is designed to load data in batches using the training and validation keys prepared in the preprocessing stage.

This is like any other generator except that it takes two annotated targets (gender, age) from the ground truth for the neural network’s outputs. This implies that we have two outputs that need to be vectorized before it can be passed to a model. This can be done by using to_categorical provided by Keras’ in its Numpy utilities package which is responsible for converting a class vector (integers) to proper vectors as a binary class matrix.

When it comes to the images themselves, the generator uses augmentations with regard to variations in saturation, brightness, lighting, contrast, and horizontal/vertical flip transformations.

The data can then be wrapped in a list of dictionaries

The Model

MobileNet has been included in one of the applications of Keras. This convolutional neural network excluding the top layer serves as the base for the model with a custom-designed classification block replacing the aforementioned scrapped layer. This block employs translation invariance in the form of Global Average Pooling. On a side note, the pooling layer allows the input image to be of any size. Overfitting is alleviated by means of Dropout Regularization with a mildly aggressive drop rate of 0.5 and then we apply a dense layer of size 1024 on this layer to mix signals from the former layers and extract higher-level notions.

Model Architecture

Two additional dense layers comprising of softmax classifiers, one for each (gender and age) are then connected to the fully connected layer which tops off the network. The dense layers here work like a traditional feedforward neural network. It connects the 1024 higher-level features from the previous layer with the final predictions for gender and age. Now, we have the model ready for training.

Training

The next step is to see what sort of accuracy can be gotten out from the MobileNet-esque configuration. We’ll start by training the model with an input size of 224 x 224 x 3.

For backpropagation, Stochastic Gradient Descent (SGD) will be used as the optimizer with an initial learning rate of 0.001 since we would be retraining the model using the weights trained on ImageNet which would allow for its faster convergence.

During training, a Learning Rate Scheduler decays the learning rate as the number of epochs starts to rise.

Also, to make sure the model does not overfit, a callback (ReduceLROnPlateau) is added to reduce the LR when a metric (say, validation loss, or some other metric) has stopped improving after a certain number of epochs.

The data is loaded from an input directory and then split into training and validation sets for the model.

  • input_path specifies the path to the .mat file
  • images_path is the directory where the images are located

Results

The following observations were made in terms of accuracy and loss after training the model for 70 epochs on a GTX 1080 Ti. As can be seen from the following curves, the model converges fairly quickly in about 10 epochs and then gradually begins to stabilize.

Accuracy
Loss

Trying it out on real images

Before going about testing the model on real faces, it is necessary for them to be localized. Detecting facial landmarks is another problem altogether, so we wouldn’t want to go deeper into the subject. It is a subset of the shape prediction problem. Given an input image, a shape predictor attempts to localize key points of interest along with the shape. In the context of facial landmarks, our goal is to detect the bounding boxes for a person’s face.

For starters, we initialize the face detector (based on the Histogram of Gradients) with dlib's facial landmark predictor.

detector = dlib.get_frontal_face_detector()

More on Histogram of Gradients, a link can be found in the references.

The pre-trained facial landmark detector inside the dlib library is used to estimate the location of 68 (x, y)-coordinates that map to facial structures on the face.

To understand how dlib’s facial landmark detector works, indexes of the 68 coordinates can be visualized on the image below:

68 facial landmark coordinates from the iBUG 300-W dataset

These annotations are part of the 68 point iBUG 300-W dataset which the dlib facial landmark predictor was trained on. There are other datasets available such as HELEN that use a 194-point model. Irrespective of which dataset is used, the same dlib framework can be leveraged to generate the bounding boxes.

When it comes to detecting the faces, a rectangular bounding box needs to be drawn using the facial landmarks generated by dlib. This is achieved by using the methods provided in dlib for obtaining cropped positions of the detected faces — left, top, right, bottom along with height and width. The images are resized and preprocessed accordingly for the consistency of evaluation.

The model may then be used to predict the age and gender as follows.

Hopefully, you found the article to be a good read and useful in your quest for recognizing a person’s age and gender.

Updated to use a modern solution (Keras 3, KerasCV and TensorFlow)

References

--

--