Paper ID: 2405.02961

JOSENet: A Joint Stream Embedding Network for Violence Detection in Surveillance Videos

Pietro Nardelli, Danilo Comminiello

The increasing proliferation of video surveillance cameras and the escalating demand for crime prevention have intensified interest in the task of violence detection within the research community. Compared to other action recognition tasks, violence detection in surveillance videos presents additional issues, such as the wide variety of real fight scenes. Unfortunately, existing datasets for violence detection are relatively small in comparison to those for other action recognition tasks. Moreover, surveillance footage often features different individuals in each video and varying backgrounds for each camera. In addition, fast detection of violent actions in real-life surveillance videos is crucial to prevent adverse outcomes, thus necessitating models that are optimized for reduced memory usage and computational costs. These challenges complicate the application of traditional action recognition methods. To tackle all these issues, we introduce JOSENet, a novel self-supervised framework that provides outstanding performance for violence detection in surveillance videos. The proposed model processes two spatiotemporal video streams, namely RGB frames and optical flows, and incorporates a new regularized self-supervised learning approach for videos. JOSENet demonstrates improved performance compared to state-of-the-art methods, while utilizing only one-fourth of the frames per video segment and operating at a reduced frame rate. The source code is available at https://github.com/ispamm/JOSENet.

Submitted: May 5, 2024