Person Classification using CCTV Surveillance Video

Published in

Nerd For Tech

4 min readJul 4, 2021

In this article, I want to share about my Google Bangkit Capstone project. The project is about empty house monitoring application that can send a notification if there are human presence inside the house.

Sometimes, when we need to leave our house empty such for a vacation, we leave our house unguarded. We use a CCTV camera that can be monitored through an app, but we need to supervise the footage all the time because a conventional CCTV monitoring apps usually can’t detect human presence. My team designed a system to prevent crime for happening inside an empty house using a Machine Learning method to classify if there are human on the scene or not.

Image Classification

We use Image classification method to implement the solution that we designed. Image classification works by define a set of target classes (objects to identify in images), and train a model to recognize them using labeled example photo.[1]

There are only 2 classes in this project, person and no person. So the project will be binary classification using image.

There are several steps to do the classification

Prepare the dataset
Build a neural network model
Train the model using the collected dataset
Test the model.

Prepare the Dataset

Based on the problem that we want to solve, we need a surveillance CCTV footage inside the house. I collect the data from YouTube videos and CCTV recording then slice the video by frame using this script

Script for slicing video by frame

I customized the slicing by 3 second for 1 frame to minimize the similar frame to captured more than once. Total of 2394 frame acquired, then divided into 80% for training and 20% for validation. The path structures for the train and validation file are like this :

Build the Neural Network model

In this project, I use transfer learning method to implement the Image Classification algorithm. InceptionV3 used as a pre trained model to classify an image.

Inception v3 is a widely-used image recognition model that has been shown to attain greater than 78.1% accuracy on the ImageNet dataset. The model is the culmination of many ideas developed by multiple researchers over the years. It is based on the original paper: “Rethinking the Inception Architecture for Computer Vision” by Szegedy, et. al. [2]

Code snippet to import Inceptionv3

The code above will import the Inceptionv3 model into our colab then print the model summary to view how the layer arranged. The input set to 256x256 pixels with 3 color channel (RGB)

For this project, I only use InceptionV3 model until ‘mixed7’ layer then flattened to 1 dimension. The flattened model feed into dense layer with 1024 hidden units and ReLu activation function.

I used dropout layer with rate of 20% to reduce the overfitting, then the output is set to single sigmoid layer. The model compiled using RMSprop with 0.0001 learning rate, binary crossentropy loss (because it is binary classification) and accuracy metrics.

Training

Before train the model, we need to upload the dataset into the notebook. To make it easier, I already uploaded the dataset into drive so it only need to be mounted to our notebook

After the images uploaded, we need to do image augmentation. I used ImageDataGenerator to generate batches of tensor image data with real-time data augmentation.

After everything is ready, let’s start training!

The training done with 20 epochs, here are the results:

The model tend to be a little overfit at some point, as seen when the training accuracy is slightly bigger than validation accuracy. It’s expected because of a lack of dataset variety. To solve this problem, I think adding more variety to the data will help.

Testing

The video testing be done by classifying image frame by frame. To do that, we need to run this code

Before we feed our frame, we need to preprocess it. We have to make sure that the format is already on RGB and the pixel are resized to 256x256 pixels. Then we normalize the frame value.

If the prediction value > 0.5 then there are classified as a person containing frame. Here are the result: