Lab Notes: Predicting Gender — Using Machine Learning to Predict Personal Demographics from Images

Published in

Mission Data Journal

23 min readMar 24, 2020

Summary

Our goal going into this project was to accurately determine the gender of a person based on an image of their face. Leveraging an existing database from the IMDB-WiKi Project we achieved a 91.09% accuracy in our training phase, and then 91.17% in our testing phase, which seemed promising. However, at closer inspection, we learned that females were much more likely to be misread in all age ranges but one, and the dataset we used didn’t provide ethnicity which meant we couldn’t test the accuracy of our predictions against different ethnic groups. All in all, a great first attempt, but not quite ready for prime time.

If you’d like to get into the details with us on how exactly we completed this challenge, or if you’re curious where we plan to take this project next, keep reading.

Introduction

This article is the first in a series in which we explore using machine learning to predict the demographics of people from images. We used a sample business case in which we’re tasked to “Determine marketing demographics from customer photos so that advertising and sales offerings can be tailored to each customer’s needs.” We’ll walk you through the process we took and examine the results, challenges, and next steps for this project.

Because specific demographics are not stated in the business case we started with a smaller goal of being able to identify the customer’s gender from a photo using a TensorFlow model built from scratch. To do this we needed to acquire a data set containing photos of people’s faces that were already annotated with data needed to determine the person’s age range and gender. Luckily the IMDB-WIKI project exists and contains 500k+ images of human faces with age and gender labels.

We will attempt to cover the following in detail, so you can recreate the process or apply one or more of our methods to your projects:

Arranging the Data- We provide details for getting the database setup and preparing the data for testing and validation.
Wrangling, Exploring, and Preprocessing Data- We discuss translating the data into a format suitable for training our model.
Partitioning the Data- We walk you through our process of splitting up the data into partitions that can be used for training, validation, and testing.
Training the Model- We’ll explain how we approached building the machine learning training pipeline with TensorFlow and Keras, and we’ll discuss our findings on the estimated accuracy of our model.
Testing the Model- We will discuss the results of predictions from actual images, and how our model does with images it has never seen.
Challenges and Next Steps- We’ll discuss any issues encountered and what our next steps will be to refine and extend our predictions for marketing demographics.

The main toolsets we used are:

Python 3 — an interpreted, general-purpose programming language
Pandas — a Python data analysis library, akin to Excel for programmers and data scientists
Numpy — a fast and powerful collection of Python utilities for scientific computing
TensorFlow 2.0 — a machine learning framework and library created by Google
Keras — a high-level neural networks API for machine learning and is used to compliment TensorFlow

The complete source code for this post can be found here: https://git.io/JvoFR.

We hope you enjoy exploring along with us! If you’re interested in this, or similar topics, check out the tag #MLdemographics for more from Mission Data on the subject.

👋 Hey, you might enjoy this article also: Lab Notes: Amazon Rekognition for Identity Verification

Arranging the Data

We used a people data set compiled by the IMDB-WIKI project that contains images of the most popular 100,000 actors from IMDb. The following description from the IMDB-WIKI project page explains how the data was compiled.

“Since the publicly available face image datasets are often of small to medium size, rarely exceeding tens of thousands of images, and often without age information, we decided to collect a large dataset of celebrities. For this purpose, we took the list of the most popular 100,000 actors as listed on the IMDb website and (automatically) crawled from their profiles date of birth, name, gender and all images related to that person. Additionally, we crawled all profile images from pages of people from Wikipedia with the same meta information. We removed the images without timestamp (the date when the photo was taken). Assuming that the images with single faces are likely to show the actor and that the timestamp and date of birth are correct, we were able to assign to each such image the biological (real) age. Of course, we can not vouch for the accuracy of the assigned age information. Besides wrong timestamps, many images are stills from movies — movies that can have extended production times. In total we obtained 460,723 face images from 20,284 celebrities from IMDb and 62,328 from Wikipedia, thus 523,051 in total.” (Rothe, 2016)

The complete data sets from the IMDB-WIKI project are very large — a whopping 272 GB for the data and images from both IMDb and Wikipedia. The IMDB-WIKI project also offers much smaller subsets of data and images for face-only data — 7 GB for IMDb Face Only images and 1 GB for Wikipedia Face Only images. The Face Only data sets were perfect for our needs and we began to investigate these smaller data sets. We chose to start with the IMDb Face Only data set first.

Image metadata from the IMDb Face Only data set is stored in a MATLAB binary database file — which is great for the storage of variables, functions, arrays and other information — and contains the following fields:

dob — the Celebrity’s date of birth, expressed as a MATLAB serial date number
photo_taken — the year the photo was taken
full_path — the path to the corresponding image file
gender — the Celebrity’s gender expressed as 0 = female, 1 = male or NaN (Not a Number) if unknown
name — the Celebrity’s name
face_location — four coordinates specifying where in the image the face is located
face_score — a face detector score (the higher the better) with an Inf (infinity) value indicating that no face was found second_face_score — a second face detector score where a value of NaN means no second face was detected
celeb_names — a list of alternate names for the Celebrity
celeb_id — a unique number associated with the Celebrity

Unfortunately, the MATLAB binary database format isn’t easily used by Python for data analytics and we wanted to get the data into a Pandas DataFrame for easier manipulation. For this conversion, we turned to a Python library used for scientific, mathematics and engineering called SciPy. We used SciPy’s scipy.io.loadmat function to read the file and convert the data into a format Python can understand.

MATLAB database rows: 460723

But adding the metadata records directly from the MATLAB database file to a Pandas DataFrame wasn’t a direct import and we needed to convert some of the data first.

The first conversion needed was to calculate the person’s age — this information isn’t available in the MATLAB database file. We used the dob (a MATLAB serial date number) and photo_taken (the year the photo was taken) fields to calculate an approximate age, +/- seven months. The function below was used to calculate the person’s age:

We defined some helper functions to classify an age into an age range, and for converting gender and age range identifiers to a description. The age range classification helped us to be sure that gender could be accurately predicted across all age ranges. The tables below show the classifications used.

Integer Values for the Gender Label
+===============+================+
| Integer Value | Gender Meaning |
+===============+================+
| 0             | Female         |
+---------------+----------------+
| 1             | Male           |
+---------------+----------------+Integer Values for the Age Range Label
+===============+===================+
| Integer Value | Age Range Meaning |
+===============+===================+
| 0             | < 18              |
+---------------+-------------------+
| 1             | 18 - 24           |
+---------------+-------------------+
| 2             | 25 - 34           |
+---------------+-------------------+
| 3             | 35 - 44           |
+---------------+-------------------+
| 4             | 45 - 54           |
+---------------+-------------------+
| 5             | 55 - 64           |
+---------------+-------------------+
| 6             | 65 - 74           |
+---------------+-------------------+
| 7             | 75 +              |
+---------------+-------------------+

Next, we needed to look at the face_score and second_face_score field values. We needed images containing only one face, and the face detection score should have a decent value. Since the values for the face_score field were vague — the higher the better — we started with a minimum value of 2.0 and used Numpy functions to skip over records with a face_score value of Inf. We also needed to skip records that had a second_face_score value, or where the image was missing. While we’re iterating through this data set, we also assigned an age range to our data, and names and labels to both the age range and gender so we can more easily explore.

Wrangling, Exploring and Preprocessing Data

Now that our data was extracted from the MATLAB database file and placed into an array, we copied the data set to a Pandas DataFrame. Pandas is great for quickly analyzing data in a tabular format, and offers many tools for statistics and filtering right out of the box. So let’s look at the “shape” of our data.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 136768 entries, 0 to 136767
Data columns (total 7 columns):
file_path          136768 non-null object
gender_id          134498 non-null float64
gender_name        136768 non-null object
age                136768 non-null int64
age_range_id       136768 non-null int64
age_range_label    136768 non-null object
age_range_name     136768 non-null object
dtypes: float64(1), int64(2), object(4)
memory usage: 7.3+ MB

The output from the Pandas DataFrame.info() function gives us a count of records, what fields are available, a count of non-null values for each field, the data type for each field and the memory usage of the DataFrame. We can see from the “shape” of the data that we had 136,768 records, but the number of non-null gender_id records were only 134,498. Some of the records had a null gender_id value! Our gender_id field data type was also a float64, even though our data was an integer.

Our next step was to remove all duplicate records, drop records that had fields with null values and convert the gender_id field data type from a float64 to an int64.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 134498 entries, 0 to 136767
Data columns (total 7 columns):
file_path          134498 non-null object
gender_id          134498 non-null int64
gender_name        134498 non-null object
age                134498 non-null int64
age_range_id       134498 non-null int64
age_range_label    134498 non-null object
age_range_name     134498 non-null object
dtypes: int64(3), object(4)
memory usage: 8.2+ MB

The “shape” of our data looked much better after a few adjustments. All record fields have data, and our data types are as expected. We lost an additional 2,270 records from the cleaning process but we still had enough data to train our model.

Now that our data was clean, our next step was to look at the values of our data. The age, age_range_id, and gender_id fields are numeric and are great candidates for descriptive statistical analysis. The Pandas DataFrame.describe() function already provides this functionality out-of-the-box, so we ran that function to understand our data’s values better.

df.describe

Integer Values for the Age Range Label
+=======+===============+===============+=================+
|       | gender_id     | age           | age_range_id    |
+=======+===============+===============+=================+
| count | 134498.000000 | 134498.000000 | 134498.000000   |
+-------+---------------+---------------+-----------------+
| mean  | 0.518119      | 36.055183     | 2.634998        |
+-------+---------------+---------------+-----------------+
| std   | 0.499673      | 12.969357     | 1.351856        |
+-------+---------------+---------------+-----------------+
| min   | 0.000000      | -30.000000    | 0.000000        |
+-------+---------------+---------------+-----------------+
| 25%   | 0.000000      | 27.000000     | 2.000000        |
+-------+---------------+---------------+-----------------+
| 50%   | 1.000000      | 34.000000     | 2.000000        |
+-------+---------------+---------------+-----------------+
| 75%   | 1.000000      | 43.000000     | 3.000000        |
+-------+---------------+---------------+-----------------+
| max   | 1.000000      | 134.000000    | 7.000000        |
+-------+---------------+---------------+-----------------+

The output from the Pandas DataFrame.describe() function gives the count, mean, standard deviation, minimum, percentiles (25%, 50%, and 75%) and maximum for each of the numeric fields in your DataFrame. The output told us the following story:

The age field has a mean value of around 36.06 years of age, with the smallest value being -30 and the largest value being 134, and a standard deviation of around 12.97 years. The percentiles tell us that 25% of the records are younger than or equal to 27 years of age, 50% of the records are younger than or equal to 34 years of age and 75% of the records are younger than or equal to 43 years of age. Our data set looks to trend towards a younger age group.
The age_range_id field has a mean value of around 2.63 (“35–44”), with the smallest value being 0 (“ < 18”) and the largest value being 7 (“75+”), and a standard deviation of around 1.35 age ranges. The percentiles tell us that 25% of the records are less than or equal to 2 (“25–34”), 50% of the records are less than or equal to 2 (“25–34”) and 75% of the records are less than or equal to 3 (“35–44”). Our data set looks to trend towards the “25–34” age range.
The gender_id field has a mean value of around 0.52, with the smallest being 0 (female) and the largest being 1 (male), and a standard deviation of around 0.4997. The percentiles tell us that 25% of the records are less than or equal to 0 (female), 50% of the records are less than or equal to 1 (male) and 75% of the records are less than or equal to 1 (male). Our data set looks almost evenly distributed between female and male images.

That gave us a lot of great information about our data! We did, however, want to remove records that had a negative age. Since we were thinking about age, we also wanted to remove records of people that were below the age of 13 and older than 80. So we did that and called the DataFrame.describe() function again to see how this impacted our data.

+=======+===============+===============+=================+
|       | gender_id     | age           | age_range_id    |
+=======+===============+===============+=================+
| count | 132077.000000 | 132077.000000 | 132077.000000   |
+-------+---------------+---------------+-----------------+
| mean  | 0.519031      | 36.315225     | 2.660986        |
+-------+---------------+---------------+-----------------+
| std   | 0.499640      | 12.307695     | 1.301717        |
+-------+---------------+---------------+-----------------+
| min   | 0.000000      | 13.000000     | 0.000000        |
+-------+---------------+---------------+-----------------+
| 25%   | 0.000000      | 27.000000     | 2.000000        |
+-------+---------------+---------------+-----------------+
| 50%   | 1.000000      | 35.000000     | 3.000000        |
+-------+---------------+---------------+-----------------+
| 75%   | 1.000000      | 43.000000     | 3.000000        |
+-------+---------------+---------------+-----------------+
| max   | 1.000000      | 80.000000     | 7.000000        |
+-------+---------------+---------------+-----------------+

After removing records younger than 13 and older than 80 years of age, the minimum and maximum values for the age field is in line with expectations. The mean, standard deviation, and percentiles for the age, age_range_id, and gender fields were only fractionally impacted and indicated only a small percentage of data was removed — 2,421 records or about 2%.

Now that we understood our data a little better, we wanted to visually explore how our data was distributed across the age range and gender labels. While the Pandas library comes equipped with a graph capability, Pandas generates histograms that are often difficult to discern from the lack of labels on the axises and data groups. We also wanted to share this data with our business team. As a compromise Pandas was used to generate counts by data group and the charting capabilities of Google Docs was used to generate bar charts.

Generating counts for a group of data is easily accomplished by combining Pandas DataFrame.groupby() and DataFrame.count() functions. Since these functions produce a Pandas Series object we used the Pandas Series.to_frame() function to convert the series to a DataFrame and then assigned the column of aggregated values to the column name of count. We could then iterate through the records, perform a Label name lookup and then print each row’s values in a format we could use in Google Docs.

Let’s look at the distribution results for gender.

Female 63525
Male 68552

We used the output above to generate the Google Docs Chart below:

The distribution of images by gender are almost balanced (48% for females and 52% for males), and this close distribution should give a high level of gender prediction accuracy from our model.

Let’s look at the distribution results for the age range.

Female < 18 2568
Female 18 - 24 11954
Female 25 - 34 24771
Female 35 - 44 15662
Female 45 - 54 5346
Female 55 - 64 2264
Female 65 - 74 727
Female 75+ 233
Male < 18 1375
Male 18 - 24 5641
Male 25 - 34 18975
Male 35 - 44 21457
Male 45 - 54 12626
Male 55 - 64 5585
Male 65 - 74 2366
Male 75+ 527

We used the output above to generate the Google Docs Chart below:

The distribution of gender-specific images by age range is unbalanced with a higher representation of female images in the “25–34” age range and of male images in the “35–44” age range. This is an undesirable distribution for supervised learning because we are training a computer to understand the concept of an age range, instead of merely memorizing. An unbalanced distribution often results in biased predictions — much like when a young child first learns how to recognize a dog and initially refers to all cats as “dog” because they are similar in size and have fur, four legs, and a tail.

Since we wanted to get our initial result quickly, time was not spent on balancing the data across the various age ranges. We may revisit this process at a later time, depending on Business needs and the initial accuracy rate of our testing.

Our data was now cleaned and the final step with the data was to partition it into chunks that could be used for training, validation, and testing.

👍 Hey, you might like this article as well: Lab Notes: Machine Learning Slackbot

Partitioning the Data

The training chunk was used to train our model and the validation chunk was used by the machine learning training engine to validate that learning was actually occurring. This is necessary because machine learning is lazy. The machine learning training engine will always attempt to memorize results because this is the most efficient process. We provided a validation chunk to the machine learning training engine to validate that memorization wasn’t occurring, and that the model was learning concepts.

Finally, we created a testing chunk that was used to measure the accuracy of our model, after the training process has been completed. It was important that the machine learning training engine didn’t see this data and the model wasn’t influenced by that specific data.

The goal of the partitioning process was to:

Have an even distribution of records from each age range in each partition
The testing partition would be comprised of 10% of our data set
The remaining 90% of our data set would be distributed so that:

-80% of the remaining data set would be assigned to the training partition

-20% of the remaining data set would be assigned to the validation partition

To do this we filtered our data by each age range and then used the sklearn.model_selection.train_test_split() function, from the scikit-learn Python predictive data analysis library, to create our splits.

This gave us 95,090 records for training, 23,776 records for validation and 13,211 records for testing. Finally, we saved each partition to a CSV file as a save point, using the DataFrame.to_csv() function. We were now ready to start training our model.

Training the Model

Before we jumped into training our model we had to decide on a deep learning approach. Since we were dealing with images, a Convolutional Neural Network (CNN) makes the most sense for our problem set. So what is a Convolutional Neural Network anyway?

According to Wikipedia,

“CNN’s are regularized versions of multilayer perceptrons. Multilayer perceptrons usually mean fully connected networks, that is, each neuron in one layer is connected to all neurons in the next layer. The “fully-connectedness” of these networks makes them prone to overfitting data. Typical ways of regularization include adding some form of magnitude measurement of weights to the loss function. CNN takes a different approach towards regularization: they take advantage of the hierarchical pattern in data and assemble more complex patterns using smaller and simpler patterns. Therefore, on the scale of connectedness and complexity, CNNs are on the lower extreme.
Convolutional networks were inspired by biological processes in that the connectivity pattern between neurons resembles the organization of the animal visual cortex. Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field.
CNNs use relatively little pre-processing compared to other image classification algorithms. This means that the network learns the filters that in traditional algorithms were hand-engineered. This independence from prior knowledge and human effort in feature design is a major advantage.” (“Convolutional neural network,” n.d.)

A simplified explanation is that a CNN is a bunch of interconnected layers, kind of like the neurons in your brain, and they all work to solve a problem. The “problem” is ingested through a special input layer, like your eyes, and that data is forwarded to all those interconnected layers (the brain). When the “brain” finishes processing that data the solution is forwarded to an output layer, like your mouth, and gives us a prediction.

Our first step was figuring out how to feed our images and gender labels into our training process. Keras has the ImageDataGeneraor() function that will allow us to accept images and specify real-time data augmentation. Data augmentation is an important process as it helps us to limit overfitting when training our model. Overfitting happens when a model learns (memorizes) the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. Data augmentation is the process of randomly changing the data in ways that limit the model’s memorization of the data.

We only needed to apply data augmentation to our training data, and chose to perform the following data augmentation methods:

Rescale — will rescale the image to provide a format understood by TensorFlow, and we used this to convert our image to a matrix of floats our model will expect
Shear Range — the shear angle in a counter-clockwise direction in degrees
Zoom Range — the range for random zoom
Horizontal Flip — randomly flip inputs horizontally

Our validation data is only applying the Rescale data augmentation for conversion of the image to a matrix of floats. The other augmentation processes aren’t needed with the validation data because the model validates its training progress and accuracy against this data set.

Since our data was in the form of images on disk and columns in a Pandas DataFrame, we chose to use Keras ImageDataGenerator.flow_from_dataframe() method as it allows you to use a Pandas DataFrame as a data label source. Also, this activates a data batch generator.

Being able to generate data in batches is important for controlling the amount of memory used while training our model. The ImageDataGenerator.flow_from_dataframe() function has also been tuned to make sure our training process isn’t waiting for the next batch of images, so the time it takes to train our model has been optimized.

We chose to start with a batch size of 128 images and created generators for our training and Validation data sets. In each generator, we specify the Pandas DataFrame, the DataFrame’s column for the path to the Image file, the DataFrame’s column for the image’s label (class), the desired image size, the batch size, and a class mode.

The class mode prepares the data to be trained. Since we have two labels/classes (female and male), we chose to use “categorical” since this allows us to work with multiple classes.

We also used a shuffle option for our training generator. This shuffles the order of the data and is used as an additional tool to prevent overfishing.

Found 95090 validated image filenames belonging to 2 classes.Found 23776 validated image filenames belonging to 2 classes.

Each of our generators loaded the expected number of images, and each correctly identified our two classes (female and male).

Now that we had a process to load data, it was time to build our model and pipeline.

For the model, we used a Keras Sequential model. This is a linear stack of layers and allows us to define each layer in the constructor. A Layer is a processing function, and we can stack different types of Layers together like building blocks to process our data as needed.

Our first layer was a Conv2D layer that accepted our images as input and outputs sixteen filters in the convolution. Because we’re working with images, a convolution is simply an element-wise multiplication of two matrices followed by a sum.

A MaxPooling2D layer followed and is used to down-sample input. This down-sampling reduces the dimensions of the input and allows the training engine to make assumptions about features. In other words, it simplifies our data and gives the training engine a smaller thing to look at.

Conv2D and MaxPooling2D layers were repeated twice more, each with the Conv2D’s filter count increasing exponentially. As data moves through our layers the data patterns get more complex. The number of filters is increased with each layer so we can capture as many of those pattern combinations as possible.

We then included a Flatten layer to convert our data to a simpler vector output.

And finally, we included two Dense layers. A Dense layer is the actual machine learning network. Our first Dense layer was configured as a network with 512 units. Our second Dense layer was configured as a network of 2 units — this is our output layer and had one unit per class (female or male).

We chose to build a model that would allow us to explore results and didn’t spend time on optimization. There may be a better way to solve this classification problem but we chose to spend our time on generating initial results instead of optimizing layers, hyperparameters or approach.

With our model defined we now had to compile it. From the compile step, we have additional properties that can be defined, and we made use of the Adam optimizer. The Adam optimizer is an adaptive learning rate algorithm that’s been designed specifically for training deep neural networks. It does this by leveraging the power of adaptive learning rates methods to find individual learning rates for each parameter.

We used CategoricalCrossentropy for our loss function. This is used because we were classifying our images with multiple classes (female and male), and CategoricalCrossentropy defines a multi-class log loss.

We also specified that we want to track metrics for our model’s accuracy. This information will be used after we train our model to see how well we did.

Let’s compile our model and print out a summary.

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 224, 224, 16)      448       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 112, 112, 16)      0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 112, 112, 32)      4640      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 56, 56, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 56, 56, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 28, 28, 64)        0         
_________________________________________________________________
flatten (Flatten)            (None, 50176)             0         
_________________________________________________________________
dense (Dense)                (None, 512)               25690624  
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 1026      
=================================================================
Total params: 25,715,234
Trainable params: 25,715,234
Non-trainable params: 0
_________________________________________________________________
None

Our model is compiled and the summary information shows us the layers and the number of parameters at each stage.

We then started training our model. We chose to train our model for four epochs. An epoch is a complete pass through the entire training data set. During, and at the end of, each epoch we’ll receive feedback on how well the model is doing at the classification process. This feedback will show accuracy and loss for both the training and validation data sets.

Epoch 1/4
742/742 [==============================] - 712s 960ms/step - loss: 0.3910 - accuracy: 0.8273 - val_loss: 0.3001 - val_accuracy: 0.8774
Epoch 2/4
742/742 [==============================] - 720s 970ms/step - loss: 0.2812 - accuracy: 0.8862 - val_loss: 0.2685 - val_accuracy: 0.8935
Epoch 3/4
742/742 [==============================] - 716s 965ms/step - loss: 0.2568 - accuracy: 0.8996 - val_loss: 0.2455 - val_accuracy: 0.9062
Epoch 4/4
742/742 [==============================] - 722s 973ms/step - loss: 0.2443 - accuracy: 0.9055 - val_loss: 0.2343 - val_accuracy: 0.9109

At the end of our training, we had a model that gave us a 91.09% accuracy of gender prediction from our validation data set.

Let’s chart out the progress of accuracy and loss from the training histogram for both our training and validation data sets.

The first chart shows training and validation accuracy. The x-axis is the number of epochs (zero-based) and the y-axis is our prediction accuracy.

The second chart shows training and validation loss. The x-axis is the number of epochs (zero-based) and the y-axis is our prediction loss.

These charts are used to see if our model needs additional training, if the model configuration is poor, or if our model is overfitting and has started to memorize results.

Both sets of lines for accuracy and loss are continuing to move in the same direction, are concave and make incremental improvements. This means that our model configuration is great and we haven’t hit overfitting. The validation accuracy line is still trending up and the Validation loss line is still trending down when our last epoch finishes and may indicate that additional accuracy is possible through additional epochs.

This approach looks promising.

👉 You should check out this article: Lab Notes: Improving Our Machine Learning Results

Testing the Model

We did well for our first attempt at training our model. In the test phase, we ran predictions from the entire test data set. This is new data that the model had not seen during its training process, and helped us understand if our model continued to be as accurate as expected.

We iterated through each of the test data set records, loaded the record’s image and asked the model to predict the person’s gender. We then compared the predicted result with the actual result and placed the predicted value and whether the prediction was correct into a DataFrame for later analysis.

The overall accuracy result from our test data set was 91.17%, slightly higher than our model’s last Validation accuracy but still close enough. So far our results are consistent with expectations.

Let’s look at our accuracy rates across each of the age ranges and explore the percentages of incorrect predictions by gender.

The top 3 age ranges for accuracy are “35–44” (92.35%), “45–54” (92.16%) and “55–64” (91.59%). The “35–44” age range is close to our original expectations, but we also thought the “25–34” age range would be higher due to the number of records.

We also see that female records have a higher percent of incorrect predictions than male records for every age range except for the “< 18” value.

To better understand why we’re seeing this behavior additional exploration would be needed on the face_score, characteristics of the actual images and data integrity. These tasks are very labor-intensive. We’ll leave that work for a later time and will instead focus on getting preliminary results to our business team for further requirements and approval.

As a finishing touch, we created a sampling of 30 random test data set results, so our business team could have a visual of actual prediction and image data. Our sample shows the actual image, the predicted gender. The predicted gender is color-coded as green if the prediction was correct or red if incorrect.

*Celebrity images and attributes used are from the IMDB-WIKI project. (Rothe, 2015)*

After seeing the actual images from our random sampling of test data we saw one false negative from a data issue, and one image incorrectly predicted as male.

Challenges and Next Steps

We think that we did well for our first attempt. The IMDB-WIKI project made our job easier since the data set was already created. We do wish that the face_score ranking was defined and understood its meaning.

Our data set was also predominantly from the age ranges of “25–34” and “35–44”. In future attempts, additional records should be found to even out the representation across all age ranges.

The IMDB-WIKI project also doesn’t provide ethnic group information. We’re not sure how accurate our model is across the various ethnic groups, and it would be great to have another way to slice the test results when looking at incorrect predictions.

Provided that our business team views our preliminary results as positive, our next step would be to include a prediction for age range. We may also want to work on increasing the accuracy of our gender prediction.

References

Rothe, R., Timofte, R., Van Gool, L. (2016). Deep expectation of real and apparent age from a single image without facial landmarks. International Journal of Computer Vision, 126 (2–4), 144–157. https://link.springer.com/article/10.1007/s11263-016-0940-3

Rothe, R., Timofte, R., Van Gool, L. (2015). IMDB-WIKI — 500k+ face images with age and gender labels. IMDB-WIKI. https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/

Convolutional neural network. (2020, March 10). In Wikipedia, The Free Encyclopedia. Retrieved from https://en.wikipedia.org/w/index.php?title=Convolutional_neural_network&oldid=944917484

Have a digital product you need designed and developed? Drop us a line.