In today’s society, videos thrive along with the risk of spreading violent content.
Preventing this risk through automated software is becoming increasingly crucial to reduce human efforts.
This research offers a video-level solution for detecting scenes of violence in videos.
The recent advancements in Computer Vision with Deep Learning inspired us to employ the Convolutional Neural Network Inflated 3D (I3D) to accomplish our challenge. In light of this, we introduce a novel video dataset, coined Foi-Fight, able to discern violent content in the videos.
The dataset is composed of 10683 trimmed videos representing violent and non-violent scenes from different backgrounds.
The basic idea of our approach is to capture both spatial and temporal information from video content training the I3D network through three input modalities: RGB, RGB difference and optical flow, extracted with LiteflowNet.
Exploiting these results, we implement a segment-based sampling, designed to recognize realtime violent scenes in untrimmed videos. Furthermore, we implement a new inference strategy that improves neural network performance. We collect 39 untrimmed videos for an overall of 2 hours and 45 minutes as a benchmark, to evaluate different configurations of the proposed model for real-life applications. The empirical results demonstrate that, with the use of our segmentbased sampling, the model obtains high accuracy and robustness, reaching a maximum accuracy of 83.98%.