Developing an End-to-End Masked Face Detection Model with Superior Performance

Published in

The Modern Scientist

12 min readMay 27, 2023

Establishing a Solid Foundation for a Dependable Masked Face Recognition System

masked-face-detection-cover — Masked Face Detection

In a previous discussion, we explored the development of a dependable and highly accurate face recognition system. However, that particular system was limited to detecting and recognizing only unmasked faces. In this article, we will take the first steps towards constructing an end-to-end masked face recognition system by establishing a trustworthy and precise masked face detection model.

Our first step is to develop a model that can classify input images as either masked or unmasked. This classification will be essential in determining the subsequent path the algorithm should follow. For instance, if the input image is classified as masked, the algorithm will execute the masked face detection functions, and vice versa.

We will present two available solutions and provide a comprehensive analysis of the advantages and disadvantages associated with each of them. Finally, we will evaluate their performance. To accomplish this, ResNet50V2 and MobileNetV2 are two popular convolutional neural network architectures widely used in computer vision tasks. ResNet50V2, an improved version of ResNet, is renowned for its deep structure and skip connections that enable effective training of very deep networks. It utilizes residual blocks to address the vanishing gradient problem, making it easier to train deep neural networks.

On the other hand, MobileNetV2 is designed specifically for resource-constrained environments such as mobile devices. It focuses on achieving a good trade-off between accuracy and computational efficiency by utilizing depth-wise separable convolutions and inverted residuals. Both architectures have demonstrated excellent performance in various image recognition and classification tasks, providing researchers and developers with powerful tools for building state-of-the-art computer vision models.

ResNet50V2 is a variant of the ResNet architecture that has gained significant popularity in the field of computer vision. It consists of 50 layers and introduces improvements over the original ResNet model. One notable enhancement is the use of residual blocks, which allow for easier training of deep networks by addressing the vanishing gradient problem. Residual blocks contain skip connections that enable the network to bypass certain layers, allowing gradients to flow more easily during backpropagation. This architecture has demonstrated exceptional performance in image recognition tasks, achieving high accuracy rates on benchmark datasets. Its ability to train very deep networks makes it a valuable tool for researchers and practitioners working on complex computer vision problems.

MobileNetV2 is a lightweight convolutional neural network architecture designed specifically for resource-constrained environments. With the rise of mobile and embedded devices, MobileNetV2 offers a solution for efficient image recognition on these platforms. It focuses on achieving a balance between model accuracy and computational efficiency. One of its key features is the use of depth-wise separable convolutions, which significantly reduce the computational cost by separating the spatial and channel-wise convolutions.

MobileNetV2 also employs inverted residuals with linear bottlenecks to further enhance efficiency. Despite its compact size, MobileNetV2 has demonstrated impressive performance on various image classification tasks, making it a popular choice for deploying deep learning models on mobile devices with limited computational resources.

Implementation:

1-) Data Collection: For training our AI model, we have gathered a dataset comprising 2165 samples of masked faces and an equal number of samples of unmasked faces. Additionally, we have set aside 500 masked face samples and 500 unmasked face samples specifically for testing purposes. The careful collection and labeling of this data is of paramount importance in training our AI model effectively.

2-) Training: To begin, we will train the ResNet50V2 model using the provided Python code. This code utilizes the TensorFlow and Keras libraries to accomplish the task. The ResNet50V2 model is a deep neural network that has been pre-trained on the ImageNet dataset, below is python example on training our ResNet50V2

from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.applications.mobilenet_v2 import preprocess_input
from tensorflow.keras.preprocessing.image import img_to_array
from tensorflow.keras.preprocessing.image import load_img
from tensorflow.keras.applications import ResNet50V2
from tensorflow.keras.layers import AveragePooling2D
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelBinarizer
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model
from imutils import paths
import tensorflow as tf
import numpy as np
import os

# initialize the initial learning rate, number of epochs to train for,
INIT_LR = 1e-4
EPOCHS = 20
BS = 32

# grab the list of images in our dataset directory, then initialize the list of data (i.e., images) and class images
dataset_path = "/home/jawabreh/Desktop/MFDD/dataset/train"
print("[INFO] loading images...")
imagePaths = list(paths.list_images(dataset_path))
data = []
labels = []

# loop over the image paths
for imagePath in imagePaths:
    # extract the class label from the filename
    label = imagePath.split(os.path.sep)[-2]

    # load the input image (224x224) and preprocess it
    image = load_img(imagePath, target_size=(224, 224))
    image = img_to_array(image)
    image = preprocess_input(image)

    # update the data and labels lists, respectively
    data.append(image)
    labels.append(label)

# convert the data and labels to NumPy arrays
data = np.array(data, dtype="float32")
labels = np.array(labels)

# perform one-hot encoding on the labels
lb = LabelBinarizer()
labels = lb.fit_transform(labels)
labels = to_categorical(labels)

# partition the data into training splits
trainX = data
trainY = labels

# construct the training image generator for data augmentation
aug = ImageDataGenerator(
    rotation_range=20,
    zoom_range=0.15,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.15,
    horizontal_flip=True,
    fill_mode="nearest"
)

# load the MobileNetV2 network, ensuring the head FC layer sets are
# left off
baseModel = ResNet50V2(weights="imagenet", include_top=False,
                        input_tensor=Input(shape=(224, 224, 3)))

# construct the head of the model that will be placed on top of the
# the base model
headModel = baseModel.output
headModel = AveragePooling2D(pool_size=(7, 7))(headModel)
headModel = Flatten(name="flatten")(headModel)
headModel = Dense(128, activation="relu")(headModel)
headModel = Dropout(0.5)(headModel)
headModel = Dense(2, activation="softmax")(headModel)

# place the head FC model on top of the base model (this will become
# the actual model we will train)
model = Model(inputs=baseModel.input, outputs=headModel)

# loop over all layers in the base model and freeze them so they will
# *not* be updated during the first training process
for layer in baseModel.layers:
    layer.trainable = False

# compile our model
print("[INFO] compiling model...")
opt = tf.keras.optimizers.legacy.Adam(lr=INIT_LR, decay=INIT_LR / EPOCHS)
model.compile(loss="binary_crossentropy", optimizer=opt, metrics=["accuracy"])

print("[INFO] training head...")
H = model.fit(
aug.flow(trainX, trainY, batch_size=BS),
steps_per_epoch=len(trainX) // BS,
epochs=EPOCHS)

print("[INFO] saving mask detector model...")
model.save("ResNet50V2_model", save_format="h5")


print("[INFO] MODEL SAVED")

And below is python example on training our MobileNetV2

from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.applications.mobilenet_v2 import preprocess_input
from tensorflow.keras.preprocessing.image import img_to_array
from tensorflow.keras.preprocessing.image import load_img
from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.layers import AveragePooling2D
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelBinarizer
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model
from imutils import paths
import tensorflow as tf
import numpy as np
import os

# initialize the initial learning rate, number of epochs to train for,
INIT_LR = 1e-4
EPOCHS = 20
BS = 32

# grab the list of images in our dataset directory, then initialize the list of data (i.e., images) and class images
dataset_path = "/home/jawabreh/Desktop/MFDD/dataset/train"
print("[INFO] loading images...")
imagePaths = list(paths.list_images(dataset_path))
data = []
labels = []

# loop over the image paths
for imagePath in imagePaths:
    # extract the class label from the filename
    label = imagePath.split(os.path.sep)[-2]

    # load the input image (224x224) and preprocess it
    image = load_img(imagePath, target_size=(224, 224))
    image = img_to_array(image)
    image = preprocess_input(image)

    # update the data and labels lists, respectively
    data.append(image)
    labels.append(label)

# convert the data and labels to NumPy arrays
data = np.array(data, dtype="float32")
labels = np.array(labels)

# perform one-hot encoding on the labels
lb = LabelBinarizer()
labels = lb.fit_transform(labels)
labels = to_categorical(labels)

# partition the data into training splits
trainX = data
trainY = labels

# construct the training image generator for data augmentation
aug = ImageDataGenerator(
    rotation_range=20,
    zoom_range=0.15,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.15,
    horizontal_flip=True,
    fill_mode="nearest"
)

# load the MobileNetV2 network, ensuring the head FC layer sets are
# left off
baseModel = MobileNetV2(weights="imagenet", include_top=False,
                        input_tensor=Input(shape=(224, 224, 3)))

# construct the head of the model that will be placed on top of the
# the base model
headModel = baseModel.output
headModel = AveragePooling2D(pool_size=(7, 7))(headModel)
headModel = Flatten(name="flatten")(headModel)
headModel = Dense(128, activation="relu")(headModel)
headModel = Dropout(0.5)(headModel)
headModel = Dense(2, activation="softmax")(headModel)

# place the head FC model on top of the base model (this will become
# the actual model we will train)
model = Model(inputs=baseModel.input, outputs=headModel)

# loop over all layers in the base model and freeze them so they will
# *not* be updated during the first training process
for layer in baseModel.layers:
    layer.trainable = False

# compile our model
print("[INFO] compiling model...")
opt = tf.keras.optimizers.legacy.Adam(lr=INIT_LR, decay=INIT_LR / EPOCHS)
model.compile(loss="binary_crossentropy", optimizer=opt, metrics=["accuracy"])

print("[INFO] training head...")
H = model.fit(
aug.flow(trainX, trainY, batch_size=BS),
steps_per_epoch=len(trainX) // BS,
epochs=EPOCHS)

print("[INFO] saving mask detector model...")
model.save("MobileNetV2_model", save_format="h5")


print("[INFO] MODEL SAVED")

3-) Testing: Once we have completed the training process for both models, we will proceed to evaluate their performance and make a comparative analysis. For this purpose, we will utilize a dedicated test dataset consisting of 500 samples of masked faces and an equal number of samples of unmasked faces, as mentioned earlier.

By subjecting the models to this test database, we can assess their respective capabilities and measure their performance in accurately classifying masked and unmasked faces. This evaluation phase is crucial in determining the effectiveness and reliability of the trained models in real-world scenarios, below is a python example of a cross validation script

import cv2
import numpy as np
import os
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.image import img_to_array
from tensorflow.keras.applications.mobilenet_v2 import preprocess_input

# load the trained model
model = load_model("MobileNetV2_model")

# define the input image size and the confidence threshold
IMG_SIZE = (224, 224)
CONF_THRESH = 0.5

# define the directory where the test data is stored
test_dir = "/home/jawabreh/Desktop/MFDD/dataset/test"

# define the names of the subdirectories containing the masked and unmasked images
masked_dir = "Masked"
unmasked_dir = "Unmasked"

# define the name of the Excel file to write the results to
excel_file = "results.xlsx"

# initialize lists to store the ground truth labels and the predicted labels
ground_truth_labels = []
predicted_labels = []
file_names = []

# iterate over all the masked images in the test directory
for filename in os.listdir(os.path.join(test_dir, masked_dir)):
    # load the input image and preprocess it
    image = cv2.imread(os.path.join(test_dir, masked_dir, filename))
    image = cv2.resize(image, IMG_SIZE)
    image = img_to_array(image)
    image = preprocess_input(image)

    # make predictions on the input image using the trained model
    predictions = model.predict(np.expand_dims(image, axis=0))[0]

    # extract the probabilities for each class
    mask_prob = predictions[0]
    without_mask_prob = predictions[1]

    # determine the predicted class
    if mask_prob > without_mask_prob and mask_prob > CONF_THRESH:
        label = "Mask"
    else:
        label = "Unmasked"

    # add the ground truth and predicted labels to their respective lists
    ground_truth_labels.append("Mask")
    predicted_labels.append(label)
    file_names.append(filename)

# iterate over all the unmasked images in the test directory
for filename in os.listdir(os.path.join(test_dir, unmasked_dir)):
    # load the input image and preprocess it
    image = cv2.imread(os.path.join(test_dir, unmasked_dir, filename))
    image = cv2.resize(image, IMG_SIZE)
    image = img_to_array(image)
    image = preprocess_input(image)

    # make predictions on the input image using the trained model
    predictions = model.predict(np.expand_dims(image, axis=0))[0]

    # extract the probabilities for each class
    mask_prob = predictions[0]
    without_mask_prob = predictions[1]

    # determine the predicted class
    if without_mask_prob > mask_prob and without_mask_prob > CONF_THRESH:
        label = "Unmasked"
    else:
        label = "Mask"

    # add the ground truth and predicted labels to their respective lists
    ground_truth_labels.append("Unmasked")
    predicted_labels.append(label)
    file_names.append(filename)

# write the results to an Excel file
with open(excel_file, "w") as f:
    f.write("Ground Truth Label\tPredicted Label\tFile Name\n")
    for i in range(len(ground_truth_labels)):
        f.write("{}\t{}\t{}\n".format(ground_truth_labels[i], predicted_labels[i], file_names[i]))

print("Results written to {}".format(excel_file))

Based on the cross-validation results, we will calculate the recall, precision, and F1 score as follows:

Recall: Recall measures the ability of the model to correctly identify positive instances. It is calculated by dividing the true positives (TP) by the sum of true positives and false negatives (FN).
Recall = TP / (TP + FN)

Precision: Precision evaluates the accuracy of positive predictions made by the model. It is calculated by dividing the true positives (TP) by the sum of true positives and false positives (FP).
Precision = TP / (TP + FP)

F1 Score: The F1 score is a harmonic mean of precision and recall, providing a balanced measure of the model's performance. It is calculated by taking the reciprocal of the average of precision and recall, and then multiplying it by 2.
F1 Score = 2 * ((Precision * Recall) / (Precision + Recall))

By calculating these metrics, we can gain insights into the performance of the model in terms of correctly identifying positive instances (recall), the accuracy of positive predictions (precision), and an overall balanced measure (F1 score). These metrics are valuable in assessing the effectiveness of the model and making informed decisions about its deployment.

Masked and Unmasked Face Detection

This code utilizes pre-trained models for masked face detection. It performs the following steps:

Importing Libraries: The necessary libraries are imported, including TensorFlow, OpenCV, and NumPy, which provide functions for image processing, deep learning, and numerical operations.
Command-line Arguments: The code uses argparse to parse command-line arguments. These arguments include the path to the input image, the directory containing the face detector model, the path to the trained face mask detector model, and the minimum confidence threshold for face detections.
Loading Models: The code loads the face detector model and the face mask detector model. The face detector model is loaded from the provided “config.prototxt” and “weight.caffemodel” files using OpenCV’s dnn module. The face mask detector model is loaded using TensorFlow’s load_model() function.
Image Preprocessing: The input image is loaded using OpenCV’s imread() function. The original image is copied, and its spatial dimensions are stored. A blob is created from the image using OpenCV’s dnn.blobFromImage() function, which performs preprocessing operations like mean subtraction and scaling.
Face Detection: The blob is passed through the face detector model, and the detections are obtained using the net.forward() function. The confidence of each detection is checked against the minimum confidence threshold. If the confidence is higher, the bounding box coordinates are extracted, and the face region is cropped.
Face Mask Detection: The cropped face region is converted from BGR to RGB color ordering and resized to 224x224 pixels. It is then preprocessed by converting it to an array, normalizing the pixel values, and adding an extra dimension to match the model’s input shape. The preprocessed face is passed through the face mask detector model, and predictions are obtained.
Result Visualization: Based on the predictions, the code determines whether the face has a mask or not. It assigns a label (“Mask” or “No Mask”) and a color (green for masked and red for unmasked) to each bounding box. The probability and the label are displayed on the image using OpenCV’s putText() function. The bounding box is drawn using the rectangle() function.
Output Display: The output image with bounding boxes and labels is displayed using OpenCV’s imshow() function. The image remains visible until a key is pressed.
Main Function: The main function calls the mask_image() function, which executes the entire process.

By utilizing the pre-trained face detector and face mask detector models, this code provides an efficient way to detect and classify masked faces in an input image, enabling various applications in the domain of face recognition and public safety.

Below is a python example of how to classify input image to masked and unmasked

from tensorflow.keras.applications.mobilenet_v2 import preprocess_input
from tensorflow.keras.preprocessing.image import img_to_array
from tensorflow.keras.models import load_model
import numpy as np
import argparse
import cv2
import os
def mask_image():
 # construct the argument parser and parse the arguments
 ap = argparse.ArgumentParser()
 ap.add_argument("-i", "--image", required=True,
  help="path to input image")
 ap.add_argument("-f", "--face", type=str,
  default="face_detector",
  help="path to face detector model directory")
 ap.add_argument("-m", "--model", type=str,
  default="ResNet50V2_model",
  help="path to trained face mask detector model")
 ap.add_argument("-c", "--confidence", type=float, default=0.5,
  help="minimum probability to filter weak detections")
 args = vars(ap.parse_args())




 # load our serialized face detector model from disk
 print("[INFO] loading face detector model...")
 configPath = os.path.sep.join([args["face"], "config.prototxt"])
 weightsPath = os.path.sep.join([args["face"],
  "weight.caffemodel"])
 net = cv2.dnn.readNet(configPath, weightsPath)

 # load the face mask detector model from disk
 print("[INFO] loading face mask detector model...")
 model = load_model(args["model"])

 # load the input image from disk, clone it, and grab the image spatial
 # dimensions
 image = cv2.imread(args["image"])
 orig = image.copy()
 (h, w) = image.shape[:2]

 # construct a blob from the image
 blob = cv2.dnn.blobFromImage(image, 1.0, (300, 300),
  (104.0, 177.0, 123.0))

 # pass the blob through the network and obtain the face detections
 print("[INFO] computing face detections...")
 net.setInput(blob)
 detections = net.forward()

 # loop over the detections
 for i in range(0, detections.shape[2]):
  # extract the confidence (i.e., probability) associated with
  # the detection
  confidence = detections[0, 0, i, 2]

  # filter out weak detections by ensuring the confidence is
  # greater than the minimum confidence
  if confidence > args["confidence"]:
   # compute the (x, y)-coordinates of the bounding box for
   # the object
   box = detections[0, 0, i, 3:7] * np.array([w, h, w, h])
   (startX, startY, endX, endY) = box.astype("int")

   # ensure the bounding boxes fall within the dimensions of
   # the frame
   (startX, startY) = (max(0, startX), max(0, startY))
   (endX, endY) = (min(w - 1, endX), min(h - 1, endY))

   # extract the face ROI, convert it from BGR to RGB channel
   # ordering, resize it to 224x224, and preprocess it
   face = image[startY:endY, startX:endX]
   face = cv2.cvtColor(face, cv2.COLOR_BGR2RGB)
   face = cv2.resize(face, (224, 224))
   face = img_to_array(face)
   face = preprocess_input(face)
   face = np.expand_dims(face, axis=0)

   # pass the face through the model to determine if the face
   # has a mask or not
   (mask, withoutMask) = model.predict(face)[0]

   # determine the class label and color we'll use to draw
   # the bounding box and text
   label = "Mask" if mask > withoutMask else "No Mask"
   color = (0, 255, 0) if label == "Mask" else (0, 0, 255)

   # include the probability in the label
   label = "{}: {:.2f}%".format(label, max(mask, withoutMask) * 100)

   # display the label and bounding box rectangle on the output
   # frame
   cv2.putText(image, label, (startX, startY - 10),
    cv2.FONT_HERSHEY_SIMPLEX, 0.45, color, 2)
   cv2.rectangle(image, (startX, startY), (endX, endY), color, 2)

 # show the output image
 cv2.imshow("Output", image)
 cv2.waitKey(0)
 
if __name__ == "__main__":
 mask_image()

In the next article we will see how this implementation of masked face detector and classification model can be used in a masked face recognition pipeline.

Follow my github for more project:
https://github.com/Jawabreh0

Developing an End-to-End Masked Face Detection Model with Superior Performance

Implementation:

Masked and Unmasked Face Detection

Written by Ahmad Jawabreh