Nowadays videos thrive along with the risk of spreading violent content.
The prevention of this risk through automated software is becoming increasingly crucial to reduce human efforts. This research offers a video-level solution for detecting scenes of violence in videos.
The recent advancements in Computer Vision with Deep Learning inspired us to employ the Convolutional Neural Network Inflated 3D (I3D) to accomplish our challenge. In light of this, we introduced a novel video dataset, coined Foi-Fight, which is able to discern violent content in the videos. The dataset is composed of 10683 trimmed videos representing violent and non-violent scenes from different backgrounds.
The idea behind our approach is to capture both spatial and temporal information from video content training the I3D network through three input modalities: RGB, RGB difference and optical flow extracted with LiteflowNet.
Using these results, we implement a segment-based sampling, designed to recognize realtime violent scenes in untrimmed videos. Furthermore, we implement a new inference strategy that improves neural network performance. We collect 39 untrimmed videos for an overall of 2 hours and 45 minutes as a benchmark, to evaluate different configurations of the proposed model for real-life applications. The empirical results demonstrate that, through the use of our segment-based sampling, the model obtains high accuracy and robustness, reaching a maximum accuracy of 83.98%.