Suspicious Human Activity Recognition from CCTV with LRCN model

Learn how to classify human activity from CCTV footage using LRCN model with Keras and TensorFlow in Python

5 min readFeb 9, 2022

Introduction

In today's world CCTV surveillance is the most basic & impactful security feature a premises can have. It can found in Hospitals, Malls, Universities etc. being the most famous way of preventing and detecting the unwanted activities. But imagine a academic campus with more than 100 CCTV in multiple buildings like Hostels, Classes, Canteen, Sports area, Auditorium etc. Manual monitoring of all the events on the CCTV camera is impossible. Even if the event had already happened, searching manually the same event in the recorded video wastes a lot of time.

We will be creating a Long-term recurrent convolutional network (LRCN) based system for academic campus to monitor the CCTV footage and detect non-suspicious activities like — running, walking & suspicious activity like — fighting. The system can be used to create an alarm which will notify the user if any suspicious activity is detected.

The Plan

Load Suspicious Human Activity Recognition Data.
Pre-process the data.
Build LRCN Model for Classification.
Evaluate the Model.

This Complete Project in on Github.
Run the code on browser.

Suspicious Human Activity Recognition Data

The data has been compiled from 2 different datasets — KTH Action dataset, Video Fight Detection Dataset

KTH dataset — Database containing six types of human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed several times by 25 subjects. All sequences were taken over homogeneous backgrounds with a static camera with 25fps frame rate. The sequences were down-sampled to the spatial resolution of 160x120 pixels and have a length of 4 seconds in average.
Video Fight Detection Dataset — Kaggle Dataset consists of, over 100 videos taken from movies and YouTube videos can be used for training suspicious behavior (fighting).

100 videos of each action was taken — walking & running from KTH Dataset, fighting from Kaggle dataset.
The final compiled dataset used: Drive Link

Set Datasets Variable

The data set variables refer to the variable that will be used through out the code like height & width of a frame to be loaded, sequence length — number of frames to be considered in a video, dataset directory & Classes for classification. As well as to set random variable, which will be used while train test split of data.

Data Preprocessing

Process to convert a single video into a numpy array so that it can be used for training the model.

Extracting Frames: Each Video is read using OpenCV Library, And 30 frames are extracted from the video at equal time interval, Each frame is read as a 3D numpy array dimension — (height ,width ,3) the last dimension refers to RGB i.e the colour of that cell.
Resizing: Frame resizing is necessary when we need to increase or decrease the total number of pixels. So, we resized all the frames to width: 64px and height: 64px to maintain the uniformity of the input images to the architecture.
Normalization: It will help the learning algorithm to learn faster and capture necessary features from the images. So, we normalized the resized frame by dividing it with 255 so that each pixel value lies between 0 & 1.
Store in Numpy Arrays: The sequence of 30 resized and Normalized frames are stored in a numpy array to give as Input to the Model.

This function performs the above task and takes the video location as parameter.

To preprocess the whole dataset and load it into a numpy array the following function is used which returns features from each video, labels & path of each file loaded.

The last preprocessing step is the encoding of the categories:

# Using Keras's to_categorical method to convert labels 
# into one-hot-encoded vectors
one_hot_encoded_labels = to_categorical(labels)

Data Shape

# Returns shape of features & labels
print(features.shape, labels.shape)

features — (#videos, #Frames per video, height, width, RGB )

labels — (#videos,)

(300, 30, 64, 64, 3) (300,)

Train Test Split Data

Split the preprocessed data for training and testing:

75% of the data is used for Training
25% of the data is used for Testing

LRCN Model

Long-term recurrent convolutional network (LRCN): A group of authors in May 2016 proposed LRCNs, a class of architectures leveraging the strengths of rapid progress in CNNs for visual recognition problems, and the growing desire to apply such models to time-varying inputs and outputs. LRCN processes the variable-length visual input (left) with a CNN (middleleft), whose outputs are fed into a stack of recurrent sequence models (LSTMs, middle-right), which finally produce a variable-length prediction (right). Both the CNN and LSTM weights are shared across time, resulting in a representation that scales to arbitrarily long sequences.

Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, Trevor Darrell, 2016

Building Model

We’ll create a basic LRCN model with 4 CNN layers followed by a LSTM layer. You may increase the complexity of the model on your own.

Model Training

The model is to be trained on the training split of the data, with an early stopping callback.

Early stopping callback — It can be understood as a function which is used to stop the model training when the improvement of the model stops or starts decreasing.

Evaluation

Here is the accuracy graph while training:

The model performance can be increased by some hyperparameter tuning.

Accuracy on test data:

Accuracy = 82.66666666666667

The model achieved an accuracy of about 83%, which is not bad for 300 records of data.

Future Work

The model is not trained on further suspicious actions like — fainting, burglary etc. The model can still be improved by training it on more suspicious actions. As well as if code is run on high end GPU it can be used to process CCTV footage in near real-time.

Conclusion

We created an LRCN model for detecting activity like fighting, walking & running from CCTV footage, the model was trained on 300 videos and achieved an accuracy of about 83%.