Age and gender estimation. Open-source projects overview. Simple project from scratch.

Pavel Chernov
13 min readApr 29, 2019

--

One day I decided to try something new in deep learning. So I wanted to make simple age and gender estimation using OpenCV library and pre-trained open-source models.

This is what I’ve got in the end. Pretty close.

I provide a simple project to demonstrate how to implement age and gender estimation.

What do you do when you want to try something new in deep learning? Of course you search for articles and open-source projects first!

Existing Open-Source Projects for Gender and Age Estimation

Disclaimer: There are many more projects that are not listed here. But I believe I have covered the most popular ones, that appear at first pages of search results.

How did I search

I have googled for:

  • gender age estimation
  • gender age opencv
  • gender age keras
  • gender age tensorflow
  • gender age caffemodel
  • gender age pytorch

I was looking at one or two first pages of results only. Then I excluded:

  • articles with restricted access,
  • projects without source code,
  • projects with source code written in other than python language,
  • projects that perform only age or gender estimation, not both,
  • project duplicates or copies.

After that I dig into source code to find details of input image format, output format, model architecture, weight size, license, pre-trained model availability, etc.

If you want the final summary table — just scroll down!

List of existing projects

Here is what I’ve found for the topic:

Age and Gender Classification using MobileNets by Kinar Ravishankar.

  • Source code: https://github.com/KinarR/age-gender-estimator-keras
  • License: MIT
  • Framework: Keras/TensorFlow
  • Input: RGB images of any size, author used: 224x224x3
  • Output: gender: two binary classes: Male and Female, choose maximum. age: 21 class, use softmax, choose maximum and multiply it’s index by 4.76, which gives you roughly [0–100] years interval.
  • Model weights size: we can estimate it as this model is based on MobileNet_v1_224, followed by one Dense(1024->1024) layer plus two output Dense(1024->1) layers. So there are approximately (4.24 MP + 1.05 MP) = 5.29 MP (=Million Parameters). Which is about 21 Mb for float32.
  • Pre-trained model available: NO

How to build an age and gender multi-task predictor with deep learning in TensorFlow by Cole Murray

  • Source code: https://github.com/ColeMurray/age-gender-estimation-tutorial
  • License: unspecified
  • Framework: TensorFlow
  • Input: RGB images224x224x3
  • Output: gender: two binary classes: Male and Female, choose maximum. age: vector of 101 classes probabilities for ages [0..100], choose maximum or use weighted sum.
  • Model weights size: we can estimate it from model architecture: Conv(5x5, 3->32) -> MaxPool(2->1) -> Conv(5x5, 32->64) -> MaxPool(2->1) -> Conv(5x5, 64->128) -> MaxPool(2->1) -> Dense(28*28*128 -> 1024) -> Dense(1024 -> 101), Dense(1024 -> 2). 2400 + 51200 + 204800 + 102760448 + 103424 + 2048 = 103.1MP Which is approximately 393 Mb.
  • Pre-trained model available: NO

Predicting apparent Age and Gender from face picture : Keras + Tensorflow by Youness Mansar

  • Source code: https://github.com/CVxTz/face_age_gender
  • License: MIT
  • Framework: Keras/TensorFlow
  • Input: RGB images 224x224x3
  • Output: gender: one number in range [0..1], where 0 = Female, 1 = Male. age: 8 classes [0..2], [4..6], [8..12], [15..20], [25..32], [38..43], [48..53], [60..100], use softmax, choose maximum.
  • Model weights size: We can estimate it from model architecture: ResNet50 -> Dense(100) -> Dense(1). Approximately: 100 Mb.
  • Pre-trained model available: NO

SSR-Net: A Compact Soft Stagewise Regression Network for Age Estimation by Tsun-Yi Yang, Yi-Hsuan Huang, Yen-Yu Lin, Pi-Cheng Hsiu, Yung-Yu Chuang.

  • Source code: https://github.com/shamangary/SSR-Net
  • Third party source code: https://github.com/shamangary/SSR-Net
  • License: Apache License 2.0
  • Framework: Keras/TensorFlow
  • Input: RGB images 64x64x3
  • Output: gender: one number in range [0..1], where 0 = Female, 1 = Male. age: one number
  • Model weights size: gender: 0.32 Mb, age: 0.32 Mb,
  • Pre-trained model available: YES
  • Last models update: Apr 2018

Mxnet version implementation of SSR-Net for age and gender estimation by @wayen820

  • Source code: https://github.com/wayen820/gender_age_estimation_mxnet
  • License: unspecified
  • Framework: MXNET
  • Input: RGB image 112x112x3
  • Output: gender: one number in range [0..1], where 0 = Female, 1 = Male. age: one number.
  • Model weights size: gender: 3.94 Mb. age: 1.95 Mb.
  • Pre-trained model available: YES
  • Last models update: Oct 2018

Age and Gender Classification Using Convolutional Neural Networks by Gil Levi and Tal Hassner.

  • Source code: https://github.com/GilLevi/AgeGenderDeepLearning
  • License: as is
  • Framework: Caffe. But models could be loaded with OpenCV.
  • Input: 256x256x3
  • Output: gender: two binary classes: Male and Female, choose maximum. age: 8 classes: [0..2], [4..6], [8..12], [15..20], [25..32], [38..43], [48..53], [60..100], use softmax, choose maximum.
  • Model weights size: gender: 43.5 Mb, age: 43.5 Mb.
  • Pre-trained model available: YES, separate models for gender and age.
  • Last models update: Sep 2017

Age and Gender Deep Learning with TensorFlow by Rude Carnie (? Daniel Pressel)

  • Source code: https://github.com/dpressel/rude-carnie
  • License: unspecified
  • Framework: TensorFlow
  • Input: RGB images 256x256x3
  • Output: gender: two binary classes: Male and Female, choose maximum. age: 8 classes: [0..2], [4..6], [8..12], [15..20], [25..32], [38..43], [48..53], [60..100], use softmax, choose maximum.
  • Model weights size: gender: inception_v3 based model — 166 Mb, age: inception_v3 based model — 166 Mb.
  • Pre-trained model available: YES, separate networks for gender and age.
  • Last models update: Apr/Feb 2017

Easy Real time gender age prediction from webcam video with Keras by Chengwei Zhang

  • Source code: https://github.com/Tony607/Keras_age_gender
  • License: unspecified
  • Framework: Keras/TensorFlow
  • Input: RGB images 64x64x3. Possibly, any size can be chosen.
  • Output: gender: one number [0..1], where 1 means Female. age: vector of 101 classes probabilities for ages [0..100], choose maximum or use weighted sum.
  • Model weights size: 186 Mb.
  • Pre-trained model available: YES
  • Last model update: Jan 2018

Age and Gender Estimation by Yusuke Uchida

  • Source code: https://github.com/yu4u/age-gender-estimation
  • License: MIT
  • Framework: Keras/TensorFlow
  • Input: RGB image of any size. Author used 32x32x3
  • Output: gender: one number [0..1], where 1 means Female. age: vector of 101 classes probabilities for ages [0..100], choose maximum or use weighted sum.
  • Model weights size: 187 Mb.
  • Pre-trained model available: YES
  • Last models update: Feb 2018

Age and gender estimation based on Convolutional Neural Network and TensorFlow by Boyuan Jiang

  • Source code: https://github.com/BoyuanJiang/Age-Gender-Estimate-TF
  • License: MIT
  • Framework: TensorFlow
  • Input: RGB image 160x160x3
  • Output: gender: one number, 0 = Female, 1 = Male. age: one number
  • Model weights size: 246.5 Mb.
  • Pre-trained model available: YES
  • Last models update: Nov 2017

Apparent Age and Gender Prediction in Keras by Sefik Ilkin Serengil

Multi output neural network in Keras (Age, gender and race classification) by Sanjaya Subedi

  • Source code: https://github.com/jangedoo/age-gender-race-prediction
  • License: unspecified
  • Framework: Keras/TensorFlow
  • Input: RGB image 198x198x3
  • Output: gender: one number, 0 = Male, 1 = Female. age: one number. race: vector of 5 classes: [‘White’, ‘Black’, ‘Asian’, ‘Indian’, ‘Others’]
  • Model weights size: unknown
  • Pre-trained model available: NO

Summary table

Summary table for age and gender estimation open-source projects

Note: I did not include model’s accuracy provided by authors in the description because it has no meaning when different models are tested on different test datasets!

Choosing model

I decided to choose two most lightweight networks, which are able to process video on-the-fly using only average CPU.

My choice is:

  1. No 4, SSR-Net, which has separate models for gender and age of size only 0.32 Mb! They are very fast in comparison with other models.
  2. No 6, models by Gil Levi and Tal Hassner, these are also two separate models for gender and age that are widely used by developers as they are about 43 Mb.

Of course I would like to have one neural net for both gender and age estimation. Maybe I will spend some time and train a model by myself. In this case I would definitely use staged training technique proposed by SSR-Net authors.

This Project Architecture

I provide a simple project to demonstrate how to implement age and gender estimation.

This simple program randomly chooses a video file from videos directory.

Then it reads frame by frame in cycle until the end or until user pressed ESC key.

For each frame:

  1. Get a smaller resized frame. As it is faster to process small images and this merely does not affect quality.
  2. Find faces on a small frame.
  3. Use faces coordinates of a small frame to extract faces patches from original (big) frame.
  4. Convert and adjust faces patches to a format that model expects. Construct a blob with all faces.
  5. Pass a blob of faces through model(s) to get predicted genders and ages for all faces.
  6. Draw a rectangle around each face and a label with estimated gender and age.

Below you may find some more details.

Initialization

Face detector is initialized basing on the face_detector_kind argument:

# Initialize face detector
if (face_detector_kind == 'haar'):
face_cascade = cv.CascadeClassifier('face_haar/haarcascade_frontalface_alt.xml')
else:
face_net = cv.dnn.readNetFromTensorflow('face_net/opencv_face_detector_uint8.pb', 'face_net/opencv_face_detector.pbtxt')

Model to estimate age and gender is initialized basing on the age_gender_kind argument:

# Load age and gender models
if (age_gender_kind == 'ssrnet'):
# Setup global parameters
face_size = 64
face_padding_ratio = 0.10
# Default parameters for SSR-Net
stage_num = [3, 3, 3]
lambda_local = 1
lambda_d = 1
# Initialize gender net
gender_net = SSR_net_general(face_size, stage_num, lambda_local, lambda_d)()
gender_net.load_weights('age_gender_ssrnet/ssrnet_gender_3_3_3_64_1.0_1.0.h5')
# Initialize age net
age_net = SSR_net(face_size, stage_num, lambda_local, lambda_d)()
age_net.load_weights('age_gender_ssrnet/ssrnet_age_3_3_3_64_1.0_1.0.h5')
else:
# Setup global parameters
face_size = 227
face_padding_ratio = 0.0
# Initialize gender detector
gender_net = cv.dnn.readNetFromCaffe('age_gender_net/deploy_gender.prototxt', 'age_gender_net/gender_net.caffemodel')
# Initialize age detector
age_net = cv.dnn.readNetFromCaffe('age_gender_net/deploy_age.prototxt', 'age_gender_net/age_net.caffemodel')
# Mean values for gender_net and age_net
Genders = ['Male', 'Female']
Ages = ['(0-2)', '(4-6)', '(8-12)', '(15-20)', '(25-32)', '(38-43)', '(48-53)', '(60-100)']

Reading video

Currently video stream is read from random file from videos directory.

import os
import cv2 as cv
import numpy as np
import time
# Initialize numpy random generator
np.random.seed(int(time.time()))
# Set video to load
videos = []
for file_name in os.listdir('videos'):
file_name = 'videos/' + file_name
if os.path.isfile(file_name) and file_name.endswith('.mp4'):
videos.append(file_name)
source_path = videos[np.random.randint(len(videos))]
# Create a video capture object to read videos
cap = cv.VideoCapture(source_path)

Detecting faces

Generally, there are two common ways to detect faces:

  • using HAAR cascade,
  • using trained CNN model.

Of course, CNN model is more accurate, but it requires more computational resources and runs slower.

In this project I decided to implement both ways and choose one via argument face_detector_kind.

Detecting faces with either HAAR or ConvNet is very easy:

def findFaces(img, confidence_threshold=0.7):
# Get original width and height
height = img.shape[0]
width = img.shape[1]

face_boxes = []
if (face_detector_kind == 'haar'):
# Get grayscale image
gray = cv.cvtColor(img, cv.COLOR_BGR2GRAY)
# Detect faces
detections = face_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5)

for (x, y, w, h) in detections:
padding_h = int(math.floor(0.5 + h * face_padding_ratio))
padding_w = int(math.floor(0.5 + w * face_padding_ratio))
x1, y1 = max(0, x - padding_w), max(0, y - padding_h)
x2, y2 = min(x + w + padding_w, width - 1), min(y + h + padding_h, height - 1)
face_boxes.append([x1, y1, x2, y2])
else:
# Convert input image to 3x300x300, as NN model expects only 300x300 RGB images
blob = cv.dnn.blobFromImage(img, 1.0, (300, 300), mean=(104, 117, 123), swapRB=True, crop=False)

# Pass blob through model and get detected faces
face_net.setInput(blob)
detections = face_net.forward()

for i in range(detections.shape[2]):
confidence = detections[0, 0, i, 2]
if (confidence < confidence_threshold):
continue
x1 = int(detections[0, 0, i, 3] * width)
y1 = int(detections[0, 0, i, 4] * height)
x2 = int(detections[0, 0, i, 5] * width)
y2 = int(detections[0, 0, i, 6] * height)
padding_h = int(math.floor(0.5 + (y2 - y1) * face_padding_ratio))
padding_w = int(math.floor(0.5 + (x2 - x1) * face_padding_ratio))
x1, y1 = max(0, x1 - padding_w), max(0, y1 - padding_h)
x2, y2 = min(x2 + padding_w, width - 1), min(y2 + padding_h, height - 1)
face_boxes.append([x1, y1, x2, y2])
return face_boxes

Please note the global variable face_padding_ratio which determines how to enlarge face_box detected by any algorithm. It's value depends on the face detection algorithm and on age/gender estimation algorithm. Ideally, you should choose it's value so that faces you get will be very similar to those that model was trained on.

Extracting faces patches

This is done in two steps:

  1. Convert face box coordinates from small frame to the big original frame: box_orig.
  2. Get part of the original frame specified by coordinates: face_bgr.

We could, of course, extract faces from the small frame. The reason to extract patches from big frame this is that we want to keep as much quality as possible. But we should keep in mind that this also may require slightly more calculations than in the first case.

def collectFaces(frame, face_boxes):
faces = []
# Process faces
for i, box in enumerate(face_boxes):
# Convert box coordinates from resized frame_bgr back to original frame
box_orig = [
int(round(box[0] * width_orig / width)),
int(round(box[1] * height_orig / height)),
int(round(box[2] * width_orig / width)),
int(round(box[3] * height_orig / height)),
]
# Extract face box from original frame w.r.t. image boundary
face_bgr = frame[
max(0, box_orig[1]):min(box_orig[3] + 1, height_orig - 1),
max(0, box_orig[0]):min(box_orig[2] + 1, width_orig - 1),
:
]
faces.append(face_bgr)
return faces

Now faces list contains faces patches, all of different sizes.

Estimating age and gender

In most cases neural networks are designed to work in batch mode. I.e. they can process many input samples at ones. This is especially useful at training time, as such batch mode training usually helps models to converge faster than in stochastic mode training (one sample at a time).

But before we could feed all faces into model we must resize them into a format that model expects. At least we should make all faces the same size and normalize their values.

SSR-Net expects input to be a tensor of size: N x 64 x 64 x 3, where N is the number of faces, 64x64 is the height and width correspondingly and 3 stands for RGB. Individual values in tensor should be scaled to [0...1]. Please note the function call cv.normalize(blob[i, :, :, :], None, alpha=0, beta=255, norm_type=cv.NORM_MINMAX) which does the required normalization.

ConvNet by Gil Levi and Tal Hassner expects input to be a tensor of size: N x 3 x 227 x 227, where N is the number of faces, 3 means channels of RGB and 227x227 is for height and width correspondingly. Individual channels in tensor should have mean 0 but should not be scaled. Please note the parameters scalefactor=1.0 and mean=(78.4263377603, 87.7689143744, 114.895847746) in the function call cv.dnn.blobFromImages which do exactly this.

As said, different models require different images preprocessing. So it is done as follows:

def predictAgeGender(faces):
if (age_gender_kind == 'ssrnet'):
# Convert faces to N,64,64,3 blob
blob = np.empty((len(faces), face_size, face_size, 3))
for i, face_bgr in enumerate(faces):
blob[i, :, :, :] = cv.resize(face_bgr, (64, 64))
blob[i, :, :, :] = cv.normalize(blob[i, :, :, :], None, alpha=0, beta=255, norm_type=cv.NORM_MINMAX)
# Predict gender and age
genders = gender_net.predict(blob)
ages = age_net.predict(blob)
# Construct labels
labels = ['{},{}'.format('Male' if (gender >= 0.5) else 'Female', int(age)) for (gender, age) in zip(genders, ages)]
else:
# Convert faces to N,3,227,227 blob
blob = cv.dnn.blobFromImages(faces, scalefactor=1.0, size=(227, 227),
mean=(78.4263377603, 87.7689143744, 114.895847746), swapRB=False)
# Predict gender
gender_net.setInput(blob)
genders = gender_net.forward()
# Predict age
age_net.setInput(blob)
ages = age_net.forward()
# Construct labels
labels = ['{},{}'.format(Genders[gender.argmax()], Ages[age.argmax()]) for (gender, age) in zip(genders, ages)]
return labels

That’s it.

Results

While implementing this project I analyzed different articles and models to estimate human gender and age by image.

I have discovered that there are a lot of good models with high accuracy that are yet too big and slow to compute.

On the other hand there are some small models with lower accuracy that could be used for real-time video processing.

I have successfully used two such models for real-time estimation of age and gender using only average CPU:

  • SSR-Net by Tsun-Yi Yang, Yi-Hsuan Huang, Yen-Yu Lin, Pi-Cheng Hsiu, Yung-Yu Chuang.
  • ConvNet by Gil Levi and Tal Hassner.

The result is great. And it was fun to do!

Gender is estimated firmly while age estimation fluctuates around true value. All is done in real-time!

Future thoughts

Nowadays cameras are getting cheaper and are placed literally everywhere. But we can never have enough people to watch all those cameras.

I believe there exists a demand for small and accurate models that could estimate and describe content of video stream in real-time. Models that could run on a RaspberryPI or other small platforms.

But today researches are mostly concentrated on accuracy, but not on applicability of their models. Researchers get more benefits if their model wins first score for accuracy in Kaggle competition. But no benefits if model is the most efficient one. i.e. has decent results with significantly less computations. My thoughts are the same as in this article by Michał Marcinkiewicz: The Real World is not a Kaggle Competition

Of course, one may argue that analyzing content of a video is still a complex task. And complex tasks require tons of calculations anyway.

But I see at least several ways to achieve high efficiency:

  1. Soft stagewise regression as proposed by authors of SSR-Net. I encourage you to read their article. It is actually a novel approach in NN training. I believe that if we re-formulate their basic idea it can be distributed to all other areas of deep learning. Not only to regression but also to classification, feature extraction, etc.
  2. Layers reusage as proposed by Okan Kopuklu, Maryam Babaee, Stefan Hormann, Gerhard Rigoll in their article CONVOLUTIONAL NEURAL NETWORKS WITH LAYER REUSE. Why use many layers each with their own parameters if we can repeat the same filters multiple times?
  3. Hidden units reusage. I did not find any article or even mention of this simple idea. Please tell me if you know any. The idea is described below.

Hidden units reusage

A typical content analyzing pipeline consists of several modules running in sequence or in parallel.

For instance, in this simple project we have:

  1. Input frame -> ConvNet to detect faces -> faces
  2. faces -> ConvNet to estimage gender -> genders
  3. faces -> ConvNet to estimage age -> ages

Where 2 and 3 may run in parallel.

In more sophisticated projects we could also find:

  • Input frame -> ConvNet to recognize common objects -> COCO names
  • Input frame -> ConvNet for semantic segmentation -> segmented image mask

Note that each ConvNet typically consists of many sequential layers. But I guess that first convolution layers of different networks are very similar.

I believe that if you take two different networks trained for different tasks, you will find similar filter’s weights in first layers of both networks. As they act like basic filters for borders detection.

It means that in complex projects similar filters process the same image several times.

I.e. first you apply these filters when you find faces in image. Then you again apply these (or similar) filters when you detect gender of a person. And then again — when you estimate person’s age.

We can save processing time if we get rid of unnecessary calculations and reuse hidden units as results of first layer’s filters applied to input image.

Of course, it’s a little bit challenging as it requires:

  • specially choose pretrained first layers,
  • freeze their parameters when training rest of model layers,
  • extract hidden units values, which could be hard in some frameworks.

That is it. Thank you for reading!

Note: Source codes and models weights can be found here.

--

--