Suspicious Human Activity Recognition from CCTV with LRCN model
Learn how to classify human activity from CCTV footage using LRCN model with Keras and TensorFlow in Python
Introduction
In today's world CCTV surveillance is the most basic & impactful security feature a premises can have. It can found in Hospitals, Malls, Universities etc. being the most famous way of preventing and detecting the unwanted activities. But imagine a academic campus with more than 100 CCTV in multiple buildings like Hostels, Classes, Canteen, Sports area, Auditorium etc. Manual monitoring of all the events on the CCTV camera is impossible. Even if the event had already happened, searching manually the same event in the recorded video wastes a lot of time.
We will be creating a Long-term recurrent convolutional network (LRCN) based system for academic campus to monitor the CCTV footage and detect non-suspicious activities like — running, walking & suspicious activity like — fighting. The system can be used to create an alarm which will notify the user if any suspicious activity is detected.
The Plan
- Load Suspicious Human Activity Recognition Data.
- Pre-process the data.
- Build LRCN Model for Classification.
- Evaluate the Model.
Suspicious Human Activity Recognition Data
The data has been compiled from 2 different datasets — KTH Action dataset, Video Fight Detection Dataset
- KTH dataset — Database containing six types of human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed several times by 25 subjects. All sequences were taken over homogeneous backgrounds with a static camera with 25fps frame rate. The sequences were down-sampled to the spatial resolution of 160x120 pixels and have a length of 4 seconds in average.
- Video Fight Detection Dataset — Kaggle Dataset consists of, over 100 videos taken from movies and YouTube videos can be used for training suspicious behavior (fighting).
100 videos of each action was taken — walking & running from KTH Dataset, fighting from Kaggle dataset.
The final compiled dataset used: Drive Link
Set Datasets Variable
The data set variables refer to the variable that will be used through out the code like height & width of a frame to be loaded, sequence length — number of frames to be considered in a video, dataset directory & Classes for classification. As well as to set random variable, which will be used while train test split of data.
Data Preprocessing
Process to convert a single video into a numpy array so that it can be used for training the model.
- Extracting Frames: Each Video is read using OpenCV Library, And 30 frames are extracted from the video at equal time interval, Each frame is read as a 3D numpy array dimension — (height ,width ,3) the last dimension refers to RGB i.e the colour of that cell.
- Resizing: Frame resizing is necessary when we need to increase or decrease the total number of pixels. So, we resized all the frames to width: 64px and height: 64px to maintain the uniformity of the input images to the architecture.
- Normalization: It will help the learning algorithm to learn faster and capture necessary features from the images. So, we normalized the resized frame by dividing it with 255 so that each pixel value lies between 0 & 1.
- Store in Numpy Arrays: The sequence of 30 resized and Normalized frames are stored in a numpy array to give as Input to the Model.
This function performs the above task and takes the video location as parameter.
To preprocess the whole dataset and load it into a numpy array the following function is used which returns features from each video, labels & path of each file loaded.
The last preprocessing step is the encoding of the categories:
# Using Keras's to_categorical method to convert labels
# into one-hot-encoded vectors
one_hot_encoded_labels = to_categorical(labels)
Data Shape
# Returns shape of features & labels
print(features.shape, labels.shape)
features — (#videos, #Frames per video, height, width, RGB )
labels — (#videos,)
(300, 30, 64, 64, 3) (300,)
Train Test Split Data
Split the preprocessed data for training and testing:
- 75% of the data is used for Training
- 25% of the data is used for Testing
LRCN Model
Long-term recurrent convolutional network (LRCN): A group of authors in May 2016 proposed LRCNs, a class of architectures leveraging the strengths of rapid progress in CNNs for visual recognition problems, and the growing desire to apply such models to time-varying inputs and outputs. LRCN processes the variable-length visual input (left) with a CNN (middleleft), whose outputs are fed into a stack of recurrent sequence models (LSTMs, middle-right), which finally produce a variable-length prediction (right). Both the CNN and LSTM weights are shared across time, resulting in a representation that scales to arbitrarily long sequences.
Building Model
We’ll create a basic LRCN model with 4 CNN layers followed by a LSTM layer. You may increase the complexity of the model on your own.
Model Training
The model is to be trained on the training split of the data, with an early stopping callback.
Early stopping callback — It can be understood as a function which is used to stop the model training when the improvement of the model stops or starts decreasing.
Evaluation
Here is the accuracy graph while training:
The model performance can be increased by some hyperparameter tuning.
Accuracy on test data:
Accuracy = 82.66666666666667
The model achieved an accuracy of about 83%, which is not bad for 300 records of data.
Future Work
The model is not trained on further suspicious actions like — fainting, burglary etc. The model can still be improved by training it on more suspicious actions. As well as if code is run on high end GPU it can be used to process CCTV footage in near real-time.
Conclusion
We created an LRCN model for detecting activity like fighting, walking & running from CCTV footage, the model was trained on 300 videos and achieved an accuracy of about 83%.
References
You can connect with me on LinkedIn, or follow me on Github.