Unveiling the Power of Deep Learning in Violence Detection: A Comparative Study

13 min readMar 8, 2024

Introduction

The application of artificial intelligence algorithms for violence detection is of utmost significance in Low and Medium-income countries like Nigeria, where it can play a vital role in tackling security and public safety challenges. This research is a valuable addition to the growing dataset for violence detection, it also provides critical insights into the efficacy of different pre-trained models and offers a potential breakthrough in addressing security issues. These models offer a tailor-made approach to enhance surveillance systems and safety measures within the country. This article will delve into the core findings and implications of this groundbreaking study.

DATASET

The datasets utilized in this study, including the Real Life Violent Situations Dataset (RVSD), which Contains 1000 Violence and 1000 non-violence videos collected from YouTube videos, violent videos contain many real street fights situations in several environments and conditions.

also, non-violence videos from our dataset are collected from many different human actions like sports, eating, walking …etc. These were all broken down into 6,644 sequential frames

and the Dark People Violent Dataset (DPVD) containing 5,747 random images collected as part of this research, and the third being the combination of both datasets(RVSD + DPVD) containing 12,093 Images, reflecting a concerted effort to align the research with the characteristics of violence prevalent in Nigeria. The amalgamation of these datasets serves as a foundation for training models that are sensitive to the unique manifestations of violence specific to the Nigerian context.

Methodology:

The study meticulously executed each step of the research methodology, spanning data collection, preprocessing, and finetuning of deep neural networks( RESNET, DENSENET, XCEPTION, and VGG-16) pre-trained on the image-net dataset. Utilizing the three datasets, the research aimed to establish a robust foundation for violence detection model training and validation. The research employed a comprehensive arsenal of tools and technologies, featuring Python-based OpenCV, TensorFlow, Torch/PyTorch, and other libraries. The hardware infrastructure, inclusive of an Intel Core i5 processor, 16GB RAM, and an Nvidia GTX GPU with 4GB VRAM, ensured the computational power required for intricate deep-learning tasks.

Testing and Validation

The core of the study revolved around rigorous testing and validation. the models were assessed based on:

Training & Validation accuracy
Loss Function
Classification Reports
and Confusion matrices.

Training of the violence detection model was done three times for each model, once with DPVD, once with RVSD, and once with a combination of both datasets(RVSD +DPVD). Each dataset was divided into training and test sets, with 80% of data used to train and 20% of data used to evaluate the system’s performance. The images were trained in batches according to the epoch(50), Following the development of the various models, the matplotlib function was used to plot the model parameters after training, which included accuracy, loss, confusion matrix, and classification report.

Overview of Results

Real-Life Violence Situation Dataset

Densenet exhibited remarkable performance with an initial training accuracy of 0.80 and validation accuracy of 0.95, steadily increasing to 0.99 in training and 0.98 in validation after 50 epochs. Resnet, starting at 0.50 training and 0.60 validation accuracy, showed improvement, reaching 87.3% in training and 86.4% in validation. Vgg-16 began at 0.825 training and 0.900 validation accuracy, culminating in an impressive 99.2% training and 97.2% validation accuracy. Xception demonstrated an initial surge in training accuracy to 0.96, settling at 1.00, while validation accuracy ranged between 0.925 and 0.95, resulting in a final accuracy of 0.99 in training and 0.95 in validation after 50 epochs. Overall, these models showcase promising accuracy in violence detection on real-life violence datasets.

Dark People Violence Dataset

Densenet displayed an initial training accuracy of 0.70 and validation accuracy of 0.90, escalating to 0.95–0.99 in training and maintaining 0.90–0.95 in validation. After 50 epochs, Densenet achieved a final train accuracy of 98.6%, with a validation accuracy of 92.8%. Resnet, starting at 0.55 training and 0.625 validation accuracy, increased to 0.70–0.80 in training and 0.725–0.80 in validation, concluding with 79.6% train and 79.2% validation accuracy. Vgg-16 initiated at 0.70 training and 0.85 validation accuracy, reaching 0.90–0.98 in training and 0.85–0.95 in validation. After 50 epochs, Vgg-16 attained a final train accuracy of 97.9%, while validation accuracy was 92%. Xception, beginning at 0.75 in training and 0.825 validation accuracy, progressed to 0.95–0.98 in training and 0.85–0.90 in validation. The final train accuracy for Xception was 98.6%, with a validation accuracy of 90.6% after 50 epochs.

Mixed Dataset (RVSD + DPVD)

In the evaluation of mixed datasets, encompassing Real Life Violence Situations (RVSD) and Dark People Violence Dataset (DPVD), Densenet exhibited robust performance with an initial training accuracy of approximately 0.75 and a validation accuracy of 0.90. As training progressed, Densenet achieved impressive accuracy, peaking between 0.95 and 0.98, while validation accuracy maintained a high range between 0.90 and 0.95. After 50 epochs, Densenet concluded with a final training accuracy of 98.3% and a validation accuracy of 96.9%. Resnet demonstrated moderate performance, starting at 0.50 and 0.575 for training and validation accuracy, respectively, and reaching a final training accuracy of 80.6% and a validation accuracy of 82.8%. Vgg-16 showcased strong capabilities with an initial accuracy of around 0.75 and 0.90, progressively improving to 0.95–1.00 in training and maintaining a steady 0.90–0.95 in validation. The final accuracy for Vgg-16 was 98.6%, and the validation accuracy was 96.1%. Xception displayed robust results with an initial accuracy of 0.75 and 0.88, escalating to 0.95–1.00 during training and achieving a final training accuracy of 99.1% and validation accuracy of 94.8% after 50 epochs. These outcomes underscore the effectiveness of these deep learning models in handling diverse datasets, emphasizing their potential applicability in real-world scenarios.

Evaluation Based On Loss Function
Real Life Violent Situations Dataset

Examining the loss functions for the Real Life Violent Situations Dataset provides insightful observations on the training dynamics of the deep learning models. Densenet’s dynamic behavior and fluctuations, coupled with spikes at critical epochs, indicate potential challenges in convergence or optimization strategies for this dataset. Resnet’s erratic behavior with multiple sharp spikes suggests difficulties in stabilizing the training process, emphasizing the need for careful parameter tuning. Vgg-16’s fluctuating loss graphs, especially the sharp spike at 45 epochs for validation loss, highlight sensitivity to certain dataset characteristics, requiring attention during model training. Xception’s fluctuations with gentle spikes indicate the model’s responsiveness to training nuances but may pose challenges in fine-tuning. These insights underscore the importance of model-specific considerations and optimizations when applying deep learning to real-life violent situations datasets.

Dark People Violent Dataset

Analyzing the loss functions of the Dark People Violent Dataset reveals notable insights into the training dynamics of various deep learning models. Densenet exhibited erratic behavior with sharp spikes at critical epochs, suggesting challenges in convergence or optimization. Resnet displayed fluctuations and spikes, emphasizing potential difficulties in stabilizing the training process. Vgg-16 showcased similar erratic behavior, indicating potential sensitivity to the dataset. Xception demonstrated relatively smoother training but still experienced gentle spikes, implying subtle challenges in fine-tuning. These patterns suggest that the models may face complexities specific to the characteristics of the Dark People Violent Dataset

Mixed Dataset

Analyzing the loss functions for the Mixed Dataset provides insights into the training dynamics of the deep learning models. Densenet exhibits rapid fluctuations in both training and validation losses, with gentle spikes at 31 and 45 epochs in the validation loss graph, indicating potential challenges in convergence during these epochs. Resnet showcases similar erratic behavior with rapid fluctuations, featuring sharp spikes at 28 and 42 epochs, along with gentle spikes at 20, 35, and 38 epochs in the validation loss graph. Vgg-16 displays fluctuating loss graphs with sharp spikes at 23, 35, 38, and 45 epochs in the validation loss, suggesting sensitivity to certain dataset characteristics. Xception exhibits rapid fluctuations in both training and validation losses, with gentle spikes at 23, 28, and 42 epochs in the validation loss graph. These insights emphasize the nuanced nature of training dynamics when dealing with mixed datasets, requiring careful consideration in model development and optimization.

Evaluation Based on Classification Report (Precision, Recall, and F1-Score)

Real Life Violent Situations Dataset

In the evaluation based on precision, recall, and F1-score for the Real Life Violent Situations Dataset, the classification report showcases the performance metrics for each implemented model trained on the dataset. The Densenet model achieved remarkable results with a precision, recall, and F1-score of 0.99, emphasizing its high accuracy and reliability in violence detection. Resnet demonstrated good performance with a precision of 0.88, a recall of 0.87, and an F1-score of 0.86. VGG-16 exhibited impressive precision, recall, and F1-score values of 0.97, highlighting its effectiveness in violence detection. Xception also displayed strong performance metrics, with precision, recall, and an F1-score of 0.96. These results indicate the models’ competence in accurately classifying violent situations, with variations in performance metrics providing a comprehensive assessment of their strengths and areas for potential improvement.

Fig 25: Classification Report of Models on RVSD

Dark People Violent Dataset

In the evaluation based on precision, recall, and F1-score for the Dark People Violent Dataset, the classification report for all implemented models was trained on the collected dataset. The Densenet model showcased commendable precision, recall, and F1-score values of 0.93, indicating its effectiveness in accurately identifying instances of violence. Resnet demonstrated good performance with a precision of 0.80, recall of 0.80, and an F1-score of 0.79. VGG-16 exhibited strong precision, recall, and F1-score values of 0.92, emphasizing its reliability in violence detection. Xception also displayed robust performance metrics, with precision, recall, and an F1-score of 0.91. These results highlight the models’ capability to effectively classify violent situations in the Dark People Violent Dataset, offering valuable insights into their strengths and areas for potential enhancement.

Fig 26: Classification Report of Models on DPVD

Mixed Dataset

For the Mixed Dataset (RVSD + DPVD), the classification report for all implemented models trained on the combined dataset. The Densenet model demonstrated robust performance with precision, recall, and F1-score values of 0.97, highlighting its efficiency in accurately identifying instances of violence. Resnet exhibited commendable results with precision, recall, and an F1-score of 0.83, showcasing its effectiveness in violence detection. VGG-16 showcased strong precision, recall, and F1-score values of 0.96, emphasizing its reliability across various datasets. Xception displayed solid performance metrics, with a precision, recall, and F1-score of 0.95. These findings underscore the versatility and effectiveness of the models in detecting violence across diverse datasets, providing valuable insights into their overall performance in mixed scenarios.

Evaluation Based on Confusion Matrix

Real Life Violent Situations Dataset

The Confusion Matrix analysis for RVSD reveals distinctive misclassification patterns among the implemented models. Densenet showed a comparatively lower number of misclassifications, with 4 non-violent images erroneously classified as violent and 13 violent images as non-violent. Resnet faced challenges, particularly in false positives, misclassifying over 100 non-violent images as violent and 29 violent images as non-violent. VGG-16 demonstrated moderate misclassifications, with 12 non-violent images mistakenly labeled as violent and 24 violent images as non-violent. Xception, on the other hand, displayed a more balanced performance, misclassifying 45 non-violent images as violent and 9 violent images as non-violent.

Fig 28: Confusion Matrix for DenseNet on RVSD

Fig 29: Confusion Matrix for ResNet on RVSD

Fig 30: Confusion Matrix for Vgg 16 on RVSD

Fig 31: Confusion Matrix for Xception on RVSD

Dark People Violent Dataset

In the assessment of the DPVD dataset, the confusion matrices revealed noteworthy misclassification patterns across the implemented models. Densenet exhibited a considerable number of misclassifications, with 63 non-violent images mistakenly labeled as violent and 19 violent images as non-violent. This suggests a sensitivity to false positives, requiring adjustments to improve precision. Resnet faced substantial challenges, particularly in false positives, misclassifying over 100 non-violent images as violent and 75 violent images as non-violent. The model’s performance indicates the need for refinement, emphasizing precision enhancement to minimize false positives. VGG-16 demonstrated moderate misclassifications, with 63 non-violent images erroneously classified as violent and 28 violent images as non-violent. These misclassification patterns hint at areas for improvement, particularly in balancing precision and recall. Lastly, Xception displayed a mixed performance, misclassifying 63 non-violent images as violent and 44 violent images as non-violent.

Fig 32: Confusion Matrix for DenseNet on DPVD

Fig 33: Confusion Matrix for ResNet on DPVD

Fig 34: Confusion Matrix for Vgg-16 on DPVD

Fig 35: Confusion Matrix for Xception on DPVD

Mixed DataSet

In the evaluation of the mixed dataset, the confusion matrices highlighted critical misclassification trends across the employed models. Densenet exhibited misclassifications, with 31 non-violent images inaccurately categorized as violent and 44 violent images as non-violent. This emphasizes the importance of refining the model to minimize both false positives and false negatives. Resnet faced considerable challenges, misclassifying over 200 non-violent images as violent and a similar number of violent images as non-violent, indicating the need for significant improvements to enhance precision and recall. VGG-16 demonstrated moderate misclassifications, with 48 non-violent images wrongly labeled as violent and 44 violent images as non-violent. This suggests opportunities for fine-tuning the model to achieve a more balanced performance. Xception showed a mix of misclassifications, with 75 non-violent images inaccurately categorized as violent and 49 violent images as non-violent.

Fig 37: Confusion Matrix for ResNet on Mixed Dataset

Fig 38: Confusion Matrix for Vgg 16 on Mixed Dataset

Fig 39: Confusion Matrix for Xception on Mixed Dataset

Conclusion:

In this comprehensive exploration of deep learning algorithms for violence detection in images, the comparative study across multiple datasets and models has yielded valuable insights. The models, particularly Densenet, Resnet, VGG-16, and Xception, exhibited commendable accuracy, precision, recall, and F1-score metrics in distinct datasets. Notably, Densenet consistently outperformed others, boasting an impressive average accuracy of 97.6%. However, the mixed dataset evaluation uncovered nuanced challenges, emphasizing the need for ongoing model refinement to address misclassifications. The analysis of loss functions revealed dynamic behaviors, guiding further enhancements for robustness. While these models present promising tools for violence detection, it is crucial to acknowledge their limitations and continually optimize their performance in real-world scenarios.

Recommendation:

To propel this research forward, deploying the developed models in real-time scenarios should be a priority. The incorporation of these models into web-based applications, police drones, or surveillance cameras can provide a practical assessment of their efficacy in dynamic, uncontrolled environments. The misclassification trends identified in the mixed dataset underline the importance of ongoing model optimization, potentially leveraging advanced techniques like ensemble learning. Furthermore, considering the focus on violence detection in Nigeria and other low- to medium-income countries, concerted efforts should be made to expand and diversify the datasets, ensuring robust models that are sensitive to the nuances of diverse cultural and environmental contexts. Additionally, collaboration with local authorities and organizations can facilitate the integration of these models into existing surveillance infrastructure, contributing significantly to public safety efforts.

References

https://www.kaggle.com/mohamedmustafa/real-life-violence-situations-dataset

Violent Scenes Dataset

InterDigital.com - Create. Connect. Live. Inspire.

www.interdigital.com

The Code

GitHub - baznta/PIPELINE

Contribute to baznta/PIPELINE development by creating an account on GitHub.

github.com

Unveiling the Power of Deep Learning in Violence Detection: A Comparative Study

Violent Scenes Dataset

InterDigital.com - Create. Connect. Live. Inspire.

GitHub - baznta/PIPELINE

Contribute to baznta/PIPELINE development by creating an account on GitHub.

Written by Bassey Nta